Hopes: Estimators#

Roadmap#

  • [x] Implement Inverse Probability Weighting (IPW) estimator

  • [x] Implement Self-Normalized Inverse Probability Weighting (SNIPW) estimator

  • [x] Implement Direct Method (DM) estimator

  • [x] Implement Trajectory-wise Importance Sampling (TIS) estimator

  • [x] Implement Self-Normalized Trajectory-wise Importance Sampling (SNTIS) estimator

  • [x] Implement Per-Decision Importance Sampling (PDIS) estimator

  • [x] Implement Self-Normalized Per-Decision Importance Sampling (SNPDIS) estimator

  • [x] Implement Doubly Robust (DR) estimator

Implemented estimators#

Currently, the following estimators are implemented:

hopes.ope.estimators.BaseEstimator

Base class for all estimators.

hopes.ope.estimators.InverseProbabilityWeighting

Inverse Probability Weighting (IPW) estimator.

hopes.ope.estimators.SelfNormalizedInverseProbabilityWeighting

Self-Normalized Inverse Probability Weighting (SNIPW) estimator.

hopes.ope.estimators.DirectMethod

Direct Method (DM) estimator.

hopes.ope.estimators.TrajectoryWiseImportanceSampling

Trajectory-wise Importance Sampling (TIS) estimator.

hopes.ope.estimators.SelfNormalizedTrajectoryWiseImportanceSampling

Self-Normalized Trajectory-wise Importance Sampling (TIS) estimator.

hopes.ope.estimators.PerDecisionImportanceSampling

Per-Decision Importance Sampling (PDIS) estimator.

hopes.ope.estimators.SelfNormalizedPerDecisionImportanceSampling

Self-Normalized Per-Decision Importance Sampling (PDIS) estimator.

hopes.ope.estimators.SequentialDoublyRobust

Sequential Doubly Robust estimator.

Estimators documentation#

class hopes.ope.estimators.InverseProbabilityWeighting#

Bases: BaseEstimator

Inverse Probability Weighting (IPW) estimator.

\(V_{IPW}(\pi_e, D)=\frac {1}{n} \sum_{t=1}^n p(s_t,a_t) r_t\)

Where:
  • \(D\) is the offline collected dataset.

  • \(p(s_t,a_t)\) is the importance weight defined as \(p(s_t,a_t)=\frac{\pi_e(a_t|s_t)}{\pi_b(a_t|s_t)}\).

  • \(\pi_e\) is the target policy and \(\pi_b\) is the behavior policy.

  • \(r_t\) is the reward observed at time \(t\) for the behavior policy.

  • \(n\) is the number of samples.

This estimator has generally high variance, especially on small datasets, and can be improved by using self-normalized importance weights.

References

https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_faculty_pubs

estimate_policy_value() float#

Estimate the value of the target policy using the IPW estimator.

estimate_weighted_rewards() ndarray#

Estimate the weighted rewards using the IPW estimator.

class hopes.ope.estimators.SelfNormalizedInverseProbabilityWeighting#

Bases: InverseProbabilityWeighting

Self-Normalized Inverse Probability Weighting (SNIPW) estimator.

\(V_{SNIPW}(\pi_e, D)= \frac {\sum_{t=1}^n p(s_t,a_t) r_t}{\sum_{t=1}^n p(s_t,a_t)}\)

Where:
  • \(D\) is the offline collected dataset.

  • \(p(s_t,a_t)\) is the importance weight defined as \(p(s_t,a_t)=\frac{\pi_e(a_t|s_t)}{\pi_b(a_t|s_t)}\).

  • \(\pi_e\) is the target policy and \(\pi_b\) is the behavior policy.

  • \(r_t\) is the reward at time \(t\).

  • \(n\) is the number of samples.

References

https://papers.nips.cc/paper_files/paper/2015/hash/39027dfad5138c9ca0c474d71db915c3-Abstract.html

estimate_policy_value() float#

Estimate the value of the target policy using the SNIPW estimator.

estimate_weighted_rewards() ndarray#

Estimate the weighted rewards using the SNIPW estimator.

class hopes.ope.estimators.TrajectoryWiseImportanceSampling(steps_per_episode: int, discount_factor: float = 1.0)#

Bases: BaseEstimator, TrajectoryPerDecisionMixin

Trajectory-wise Importance Sampling (TIS) estimator.

\(V_{TIS} (\pi_e, D) = \frac {1}{n} \sum_{i=1}^ n\sum_{t=0}^{T-1} \gamma^t w^{(i)}_{0:T-1} r_t^{(i)}\)

Where:

  • \(D\) is the offline collected dataset.

  • \(w^{(i)}_{0:T-1}\) is the importance weight of the trajectory \(i\) defined as \(w_{0:T-1} = \prod_{t=0}^{T-1} \frac {\pi_e(a_t|s_t)} {\pi_b(a_t|s_t)}\)

  • \(\pi_e\) is the target policy and \(\pi_b\) is the behavior policy.

  • \(n\) is the number of trajectories.

  • \(T\) is the length of the trajectory.

  • \(\gamma_t\) is the discount factor at time \(t\).

  • \(r_t^{(i)}\) is the reward at time \(t\) of trajectory \(i\).

TIS can suffer from high variance due to the product operation of the importance weights, also when action space is large.

References

https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_faculty_pubs

check_parameters() None#

Check if the estimator parameters are valid.

estimate_policy_value() float#

Estimate the value of the target policy using the Trajectory-wise Importance Sampling estimator.

estimate_weighted_rewards() ndarray#

Estimate the weighted rewards using the Trajectory-wise Importance Sampling estimator.

Returns:

the weighted rewards, or here the policy value per trajectory.

short_name() str#

Return the short name of the estimator.

This method can be overridden by subclasses to customize the short name.

Returns:

the short name of the estimator. By default, it returns the abbreviation of the class name, ie “IPW”.

class hopes.ope.estimators.SelfNormalizedTrajectoryWiseImportanceSampling(steps_per_episode: int, discount_factor: float = 1.0)#

Bases: TrajectoryWiseImportanceSampling

Self-Normalized Trajectory-wise Importance Sampling (TIS) estimator.

\[V_{TIS} (\pi_e, D) = \frac {1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \frac {w^{(i)}_{0:T-1}} {\frac {1}{n} \sum_{j=1}^n w^{(j)}_{0:T-1}} r_t^{(i)}\]

Where:

  • \(D\) is the offline collected dataset.

  • \(w^{(i)}_{0:T-1}\) is the importance weight of the trajectory \(i\) defined as \(w_{0:T-1} = \prod_{t=0}^{T-1} \frac {\pi_e(a_t|s_t)} {\pi_b(a_t|s_t)}\)

  • \(\pi_e\) is the target policy and \(\pi_b\) is the behavior policy.

  • \(n\) is the number of trajectories.

  • \(T\) is the length of the trajectory.

  • \(\gamma_t\) is the discount factor at time \(t\).

  • \(r_t^{(i)}\) is the reward at time \(t\) of trajectory \(i\).

SNTIS is a variance reduction technique for TIS. It divides the weighted rewards by the mean of the importance weights of the trajectories.

References

https://arxiv.org/abs/1906.03735

normalize(weights: ndarray) ndarray#

Normalize the importance weights using the self-normalization strategy.

It uses self-normalization to reduce the variance of the estimator, using the mean of the importance weights over the trajectories.

Parameters:

weights – the importance weights to normalize.

Returns:

the normalized importance weights.

short_name() str#

Return the short name of the estimator.

This method can be overridden by subclasses to customize the short name.

Returns:

the short name of the estimator. By default, it returns the abbreviation of the class name, ie “IPW”.

class hopes.ope.estimators.PerDecisionImportanceSampling(steps_per_episode: int, discount_factor: float = 1.0)#

Bases: BaseEstimator, TrajectoryPerDecisionMixin

Per-Decision Importance Sampling (PDIS) estimator.

\(V_{PDIS} (\pi_e, D) = \frac {1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t w^{(i)}_{t} r_t^{(i)}\)

Where:

  • \(D\) is the offline collected dataset.

  • \(w^{(i)}_{t}\) is the importance weight of the decision \(t\) of trajectory \(i\) defined as \(w_{t} = \frac {\pi_e(a_t|s_t)} {\pi_b(a_t|s_t)}\)

  • \(\pi_e\) is the target policy and \(\pi_b\) is the behavior policy.

  • \(n\) is the number of trajectories.

  • \(T\) is the length of the trajectory.

  • \(\gamma_t\) is the discount factor at time \(t\).

  • \(r_t^{(i)}\) is the reward at time \(t\) of trajectory \(i\).

References

https://arxiv.org/abs/1906.03735

check_parameters() None#

Check if the estimator parameters are valid.

estimate_policy_value() float#

Estimate the value of the target policy using the Trajectory-wise Importance Sampling estimator.

estimate_weighted_rewards() ndarray#

Estimate the weighted rewards using the Trajectory-wise Importance Sampling estimator.

Returns:

the weighted rewards, or here the policy value per trajectory.

class hopes.ope.estimators.SelfNormalizedPerDecisionImportanceSampling(*, steps_per_episode: int, discount_factor: float = 1.0, normalization: str = 'per_timestep', eps: float = 1e-12)#

Bases: PerDecisionImportanceSampling

Self-Normalized Per-Decision Importance Sampling (PDIS) estimator.

\[V_{PDIS} (\pi_e, D) = \frac {1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \frac {w^{(i)}_{t}} {\frac {1}{n} \sum_{j=1}^n w^{(j)}_{t}} r_t^{(i)}\]

Where:

  • \(D\) is the offline collected dataset.

  • \(w^{(i)}_{t}\) is the importance weight of the decision \(t\) of trajectory \(i\) defined as \(w_{t} = \frac {\pi_e(a_t|s_t)} {\pi_b(a_t|s_t)}\)

  • \(\pi_e\) is the target policy and \(\pi_b\) is the behavior policy.

  • \(n\) is the number of trajectories.

  • \(T\) is the length of the trajectory.

  • \(\gamma_t\) is the discount factor at time \(t\).

  • \(r_t^{(i)}\) is the reward at time \(t\) of trajectory \(i\).

SNPDIS is a variance reduction technique for PDIS.

References

https://arxiv.org/abs/1906.03735

check_parameters() None#

Check if the estimator parameters are valid.

estimate_policy_value() float#

Estimate the value of the target policy using the Trajectory-wise Importance Sampling estimator.

estimate_weighted_rewards() ndarray#

Estimate the weighted rewards using the Trajectory-wise Importance Sampling estimator.

Returns:

the weighted rewards, or here the policy value per trajectory.

normalize(weights: ndarray) ndarray#

Normalize the importance weights using the self-normalization strategy.

It uses self-normalization to reduce the variance of the estimator, using the mean of the importance weights over the trajectories.

Parameters:

weights – the importance weights to normalize.

Returns:

the normalized importance weights.

class hopes.ope.estimators.DirectMethod(q_model_cls: type[RegressionBasedRewardModel], behavior_policy_obs: ndarray, behavior_policy_act: ndarray, behavior_policy_rewards: ndarray, steps_per_episode: int, discount_factor: float = 1.0, q_model_type: str = 'random_forest', q_model_params: dict | None = None)#

Bases: BaseEstimator

Direct Method (DM) estimator.

\(V_{DM}(\pi_e, D, Q)=\frac {1}{n} \sum_{i=1}^n \sum_{a \in A} \pi_e(a|s^i_0) Q(s^i_0, a)\)

Where:
  • \(D = \{\{ (s_t, a_t, r_t) \}^{T-1}_{t=0}\}^n_{i=1}\) is the offline collected dataset consisting of n trajectories.

  • \(\pi_e\) is the target policy.

  • \(Q(s^i_0, a)\) is the Q model trained to estimate the expected discounted sum of rewards from the initial state \(s^i_0\) when taking action \(a\) under the behavior policy.

  • \(n\) is the number of episodes/trajectories.

  • \(a\) is the action taken in the set of actions \(A\).

  • \(s^i_0\) is the initial state of the i-th trajectory.

This estimator trains a Q model using supervised learning on initial states and their corresponding discounted cumulative returns, then uses it to estimate the expected value under the target policy. The performance of this estimator depends on the quality of the Q model.

check_parameters() None#

Check if the estimator parameters are valid.

Base estimator checks plus additional checks for the Q model.

estimate_policy_value() float#

Estimate the value of the target policy using the Direct Method estimator.

estimate_weighted_rewards() ndarray#

Estimate the weighted rewards using the Direct Method estimator.

For each episode i, computes: \(V(s^i_0) = Σ_{a∈A} π_e(a|s^i_0) * Q(s^i_0, a)\)

Where \(Q(s^i_0, a)\) is the predicted discounted cumulative return from the initial state \(s^i_0\).

Returns:

the estimated values for each episode/trajectory.

fit() dict[str, float] | None#

Fit the Q model to estimate the expected discounted sum of rewards from the initial state.

The Q model is trained on (initial_state, initial_action) pairs with their corresponding discounted cumulative returns computed as:

\(G_0 = r_0 + γ*r_1 + γ²*r_2 + ... + γ^(T-1)*r_(T-1)\)

Returns:

the fit statistics of the Q model.

class hopes.ope.estimators.SequentialDoublyRobust(*, steps_per_episode: int, discount_factor: float = 1.0, eps: float = 1e-12, clip: float | None = None)#

Bases: BaseEstimator

Sequential Doubly Robust estimator.

This estimator computes a per-decision doubly robust estimate using a temporal-difference-style formulation. It combines model-based predictions with cumulative importance weights built from behavior and target policy action probabilities.

The per-episode estimate is computed as:

\[\hat{V}_{\mathrm{DR}}^{(i)} = \hat{V}(s_{i,0}) + \sum_{t=0}^{T-1} W_{i,t} \left( r_{i,t} + \gamma \hat{V}(s_{i,t+1}) - \hat{Q}(s_{i,t}, a_{i,t}) \right)\]

where:

\[W_{i,t} = \prod_{k=0}^{t} \rho_{i,k}\]

and

\[\rho_{i,t} = \frac{\pi_e(a_{i,t} \mid s_{i,t})}{\pi_b(a_{i,t} \mid s_{i,t})}\]

with:

  • \(i\) denoting the episode index,

  • \(t\) denoting the timestep index,

  • \(r_{i,t}\) the observed reward at timestep \(t\),

  • \(\hat{Q}(s_{i,t}, a_{i,t})\) the estimated action-value for the logged action,

  • \(\hat{V}(s_{i,t})\) the estimated state value under the target policy,

  • \(\gamma\) the discount factor,

  • \(\pi_e\) the target policy,

  • \(\pi_b\) the behavior policy.

If precomputed step-wise importance ratios are provided, they are used directly. Otherwise, the ratios are constructed from the target and behavior policy action probabilities and the logged actions.

Stickiness handling, when needed, must be applied upstream during preprocessing. Reference paper: https://arxiv.org/abs/1511.03722

check_parameters() None#

Check if the estimator parameters are valid.

estimate_policy_value() float#

Estimate the value of the target policy.

estimate_weighted_rewards() ndarray#

Estimate episode-level sequential DR contributions.

Returns:

Episode-level DR estimates, shape (n_episodes, 1).

fit(*, obs_flat: ndarray, act_flat: ndarray, rew_flat: ndarray, num_actions: int, q_model_params: dict | None = None, random_state: int = 0) RTGQModelHGBoost#

Fit an internal RTGQModelHGBoost Q model from raw logged data, then populate q_values and logged_actions automatically.

This mirrors the design of DirectMethod.fit(): rather than building and fitting the Q model externally and injecting the predictions via set_model_predictions(), you can pass the raw trajectory data directly and let the estimator handle the model training. The two workflows remain interchangeable — set_model_predictions() is still available for cases where Q-values come from an external source.

Parameters:
  • obs_flat – Observations, shape (n_samples, obs_dim).

  • act_flat – Discrete action indices, shape (n_samples,).

  • rew_flat – Rewards, shape (n_samples,).

  • num_actions – Total number of discrete actions.

  • q_model_params – Optional hyper-parameters forwarded to RTGQModelHGBoost.

  • random_state – Random seed for the underlying gradient-boosting model.

Returns:

The fitted RTGQModelHGBoost instance.

set_logged_actions(logged_actions: ndarray) None#

Set logged actions.

Parameters:

logged_actions – Logged action indices with shape (n_samples,).

set_model_predictions(*, q_values: ndarray) None#

Set model-based predictions used by the sequential DR estimator.

This can be used to inject Q-value predictions from an external model, or to set the predictions after fitting an internal model via fit(). The provided Q-values must be aligned with the logged data, meaning each row corresponds to the estimated action-values for the state at the same index in the logged data as in \([\hat{Q}(s_t, a)]_{a \in \mathcal{A}}\).

Parameters:

q_values – Estimated action-values for all actions, shape (n_samples, n_actions). Each row must contain the estimated action-values for the corresponding state.

short_name() str#

Return the short name of the estimator.

This method can be overridden by subclasses to customize the short name.

Returns:

the short name of the estimator. By default, it returns the abbreviation of the class name, ie “IPW”.

Implementing a new estimator#

To implement a new estimator, you need to subclass hopes.ope.estimators.BaseEstimator and implement:

Optionally, you can implement hopes.ope.estimators.BaseEstimator.short_name() to provide a short name for the estimator. When not implemented, the uppercase letters of the class name are used.

Below is the BaseEstimator class documentation.

class hopes.ope.estimators.BaseEstimator#

Base class for all estimators.

abstract estimate_policy_value() float#

Estimate the value of the target policy.

This method should be overridden by subclasses to implement the specific estimator. The typical implementation should call estimate_weighted_rewards() to compute the weighted rewards, then compute the policy value.

Returns:

the estimated value of the target policy.

abstract estimate_weighted_rewards() ndarray#

Estimate the weighted rewards.

This method should be overridden by subclasses to implement the specific estimator.

Returns:

the weighted rewards.

short_name() str#

Return the short name of the estimator.

This method can be overridden by subclasses to customize the short name.

Returns:

the short name of the estimator. By default, it returns the abbreviation of the class name, ie “IPW”.