Hopes: Estimators#
Roadmap#
[x] Implement Inverse Probability Weighting (IPW) estimator
[x] Implement Self-Normalized Inverse Probability Weighting (SNIPW) estimator
[x] Implement Direct Method (DM) estimator
[x] Implement Trajectory-wise Importance Sampling (TIS) estimator
[x] Implement Self-Normalized Trajectory-wise Importance Sampling (SNTIS) estimator
[x] Implement Per-Decision Importance Sampling (PDIS) estimator
[x] Implement Self-Normalized Per-Decision Importance Sampling (SNPDIS) estimator
[x] Implement Doubly Robust (DR) estimator
Implemented estimators#
Currently, the following estimators are implemented:
Base class for all estimators. |
|
Inverse Probability Weighting (IPW) estimator. |
|
|
Self-Normalized Inverse Probability Weighting (SNIPW) estimator. |
Direct Method (DM) estimator. |
|
Trajectory-wise Importance Sampling (TIS) estimator. |
|
|
Self-Normalized Trajectory-wise Importance Sampling (TIS) estimator. |
Per-Decision Importance Sampling (PDIS) estimator. |
|
|
Self-Normalized Per-Decision Importance Sampling (PDIS) estimator. |
Sequential Doubly Robust estimator. |
Estimators documentation#
- class hopes.ope.estimators.InverseProbabilityWeighting#
Bases:
BaseEstimatorInverse Probability Weighting (IPW) estimator.
\(V_{IPW}(\pi_e, D)=\frac {1}{n} \sum_{t=1}^n p(s_t,a_t) r_t\)
- Where:
\(D\) is the offline collected dataset.
\(p(s_t,a_t)\) is the importance weight defined as \(p(s_t,a_t)=\frac{\pi_e(a_t|s_t)}{\pi_b(a_t|s_t)}\).
\(\pi_e\) is the target policy and \(\pi_b\) is the behavior policy.
\(r_t\) is the reward observed at time \(t\) for the behavior policy.
\(n\) is the number of samples.
This estimator has generally high variance, especially on small datasets, and can be improved by using self-normalized importance weights.
References
https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_faculty_pubs
- estimate_policy_value() float#
Estimate the value of the target policy using the IPW estimator.
- estimate_weighted_rewards() ndarray#
Estimate the weighted rewards using the IPW estimator.
- class hopes.ope.estimators.SelfNormalizedInverseProbabilityWeighting#
Bases:
InverseProbabilityWeightingSelf-Normalized Inverse Probability Weighting (SNIPW) estimator.
\(V_{SNIPW}(\pi_e, D)= \frac {\sum_{t=1}^n p(s_t,a_t) r_t}{\sum_{t=1}^n p(s_t,a_t)}\)
- Where:
\(D\) is the offline collected dataset.
\(p(s_t,a_t)\) is the importance weight defined as \(p(s_t,a_t)=\frac{\pi_e(a_t|s_t)}{\pi_b(a_t|s_t)}\).
\(\pi_e\) is the target policy and \(\pi_b\) is the behavior policy.
\(r_t\) is the reward at time \(t\).
\(n\) is the number of samples.
References
https://papers.nips.cc/paper_files/paper/2015/hash/39027dfad5138c9ca0c474d71db915c3-Abstract.html
- estimate_policy_value() float#
Estimate the value of the target policy using the SNIPW estimator.
- estimate_weighted_rewards() ndarray#
Estimate the weighted rewards using the SNIPW estimator.
- class hopes.ope.estimators.TrajectoryWiseImportanceSampling(steps_per_episode: int, discount_factor: float = 1.0)#
Bases:
BaseEstimator,TrajectoryPerDecisionMixinTrajectory-wise Importance Sampling (TIS) estimator.
\(V_{TIS} (\pi_e, D) = \frac {1}{n} \sum_{i=1}^ n\sum_{t=0}^{T-1} \gamma^t w^{(i)}_{0:T-1} r_t^{(i)}\)
Where:
\(D\) is the offline collected dataset.
\(w^{(i)}_{0:T-1}\) is the importance weight of the trajectory \(i\) defined as \(w_{0:T-1} = \prod_{t=0}^{T-1} \frac {\pi_e(a_t|s_t)} {\pi_b(a_t|s_t)}\)
\(\pi_e\) is the target policy and \(\pi_b\) is the behavior policy.
\(n\) is the number of trajectories.
\(T\) is the length of the trajectory.
\(\gamma_t\) is the discount factor at time \(t\).
\(r_t^{(i)}\) is the reward at time \(t\) of trajectory \(i\).
TIS can suffer from high variance due to the product operation of the importance weights, also when action space is large.
References
https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_faculty_pubs
- check_parameters() None#
Check if the estimator parameters are valid.
- estimate_policy_value() float#
Estimate the value of the target policy using the Trajectory-wise Importance Sampling estimator.
- estimate_weighted_rewards() ndarray#
Estimate the weighted rewards using the Trajectory-wise Importance Sampling estimator.
- Returns:
the weighted rewards, or here the policy value per trajectory.
- short_name() str#
Return the short name of the estimator.
This method can be overridden by subclasses to customize the short name.
- Returns:
the short name of the estimator. By default, it returns the abbreviation of the class name, ie “IPW”.
- class hopes.ope.estimators.SelfNormalizedTrajectoryWiseImportanceSampling(steps_per_episode: int, discount_factor: float = 1.0)#
Bases:
TrajectoryWiseImportanceSamplingSelf-Normalized Trajectory-wise Importance Sampling (TIS) estimator.
\[V_{TIS} (\pi_e, D) = \frac {1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \frac {w^{(i)}_{0:T-1}} {\frac {1}{n} \sum_{j=1}^n w^{(j)}_{0:T-1}} r_t^{(i)}\]Where:
\(D\) is the offline collected dataset.
\(w^{(i)}_{0:T-1}\) is the importance weight of the trajectory \(i\) defined as \(w_{0:T-1} = \prod_{t=0}^{T-1} \frac {\pi_e(a_t|s_t)} {\pi_b(a_t|s_t)}\)
\(\pi_e\) is the target policy and \(\pi_b\) is the behavior policy.
\(n\) is the number of trajectories.
\(T\) is the length of the trajectory.
\(\gamma_t\) is the discount factor at time \(t\).
\(r_t^{(i)}\) is the reward at time \(t\) of trajectory \(i\).
SNTIS is a variance reduction technique for TIS. It divides the weighted rewards by the mean of the importance weights of the trajectories.
References
https://arxiv.org/abs/1906.03735
- normalize(weights: ndarray) ndarray#
Normalize the importance weights using the self-normalization strategy.
It uses self-normalization to reduce the variance of the estimator, using the mean of the importance weights over the trajectories.
- Parameters:
weights – the importance weights to normalize.
- Returns:
the normalized importance weights.
- short_name() str#
Return the short name of the estimator.
This method can be overridden by subclasses to customize the short name.
- Returns:
the short name of the estimator. By default, it returns the abbreviation of the class name, ie “IPW”.
- class hopes.ope.estimators.PerDecisionImportanceSampling(steps_per_episode: int, discount_factor: float = 1.0)#
Bases:
BaseEstimator,TrajectoryPerDecisionMixinPer-Decision Importance Sampling (PDIS) estimator.
\(V_{PDIS} (\pi_e, D) = \frac {1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t w^{(i)}_{t} r_t^{(i)}\)
Where:
\(D\) is the offline collected dataset.
\(w^{(i)}_{t}\) is the importance weight of the decision \(t\) of trajectory \(i\) defined as \(w_{t} = \frac {\pi_e(a_t|s_t)} {\pi_b(a_t|s_t)}\)
\(\pi_e\) is the target policy and \(\pi_b\) is the behavior policy.
\(n\) is the number of trajectories.
\(T\) is the length of the trajectory.
\(\gamma_t\) is the discount factor at time \(t\).
\(r_t^{(i)}\) is the reward at time \(t\) of trajectory \(i\).
References
https://arxiv.org/abs/1906.03735
- check_parameters() None#
Check if the estimator parameters are valid.
- estimate_policy_value() float#
Estimate the value of the target policy using the Trajectory-wise Importance Sampling estimator.
- estimate_weighted_rewards() ndarray#
Estimate the weighted rewards using the Trajectory-wise Importance Sampling estimator.
- Returns:
the weighted rewards, or here the policy value per trajectory.
- class hopes.ope.estimators.SelfNormalizedPerDecisionImportanceSampling(*, steps_per_episode: int, discount_factor: float = 1.0, normalization: str = 'per_timestep', eps: float = 1e-12)#
Bases:
PerDecisionImportanceSamplingSelf-Normalized Per-Decision Importance Sampling (PDIS) estimator.
\[V_{PDIS} (\pi_e, D) = \frac {1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \frac {w^{(i)}_{t}} {\frac {1}{n} \sum_{j=1}^n w^{(j)}_{t}} r_t^{(i)}\]Where:
\(D\) is the offline collected dataset.
\(w^{(i)}_{t}\) is the importance weight of the decision \(t\) of trajectory \(i\) defined as \(w_{t} = \frac {\pi_e(a_t|s_t)} {\pi_b(a_t|s_t)}\)
\(\pi_e\) is the target policy and \(\pi_b\) is the behavior policy.
\(n\) is the number of trajectories.
\(T\) is the length of the trajectory.
\(\gamma_t\) is the discount factor at time \(t\).
\(r_t^{(i)}\) is the reward at time \(t\) of trajectory \(i\).
SNPDIS is a variance reduction technique for PDIS.
References
https://arxiv.org/abs/1906.03735
- check_parameters() None#
Check if the estimator parameters are valid.
- estimate_policy_value() float#
Estimate the value of the target policy using the Trajectory-wise Importance Sampling estimator.
- estimate_weighted_rewards() ndarray#
Estimate the weighted rewards using the Trajectory-wise Importance Sampling estimator.
- Returns:
the weighted rewards, or here the policy value per trajectory.
- normalize(weights: ndarray) ndarray#
Normalize the importance weights using the self-normalization strategy.
It uses self-normalization to reduce the variance of the estimator, using the mean of the importance weights over the trajectories.
- Parameters:
weights – the importance weights to normalize.
- Returns:
the normalized importance weights.
- class hopes.ope.estimators.DirectMethod(q_model_cls: type[RegressionBasedRewardModel], behavior_policy_obs: ndarray, behavior_policy_act: ndarray, behavior_policy_rewards: ndarray, steps_per_episode: int, discount_factor: float = 1.0, q_model_type: str = 'random_forest', q_model_params: dict | None = None)#
Bases:
BaseEstimatorDirect Method (DM) estimator.
\(V_{DM}(\pi_e, D, Q)=\frac {1}{n} \sum_{i=1}^n \sum_{a \in A} \pi_e(a|s^i_0) Q(s^i_0, a)\)
- Where:
\(D = \{\{ (s_t, a_t, r_t) \}^{T-1}_{t=0}\}^n_{i=1}\) is the offline collected dataset consisting of n trajectories.
\(\pi_e\) is the target policy.
\(Q(s^i_0, a)\) is the Q model trained to estimate the expected discounted sum of rewards from the initial state \(s^i_0\) when taking action \(a\) under the behavior policy.
\(n\) is the number of episodes/trajectories.
\(a\) is the action taken in the set of actions \(A\).
\(s^i_0\) is the initial state of the i-th trajectory.
This estimator trains a Q model using supervised learning on initial states and their corresponding discounted cumulative returns, then uses it to estimate the expected value under the target policy. The performance of this estimator depends on the quality of the Q model.
- check_parameters() None#
Check if the estimator parameters are valid.
Base estimator checks plus additional checks for the Q model.
- estimate_policy_value() float#
Estimate the value of the target policy using the Direct Method estimator.
- estimate_weighted_rewards() ndarray#
Estimate the weighted rewards using the Direct Method estimator.
For each episode i, computes: \(V(s^i_0) = Σ_{a∈A} π_e(a|s^i_0) * Q(s^i_0, a)\)
Where \(Q(s^i_0, a)\) is the predicted discounted cumulative return from the initial state \(s^i_0\).
- Returns:
the estimated values for each episode/trajectory.
- fit() dict[str, float] | None#
Fit the Q model to estimate the expected discounted sum of rewards from the initial state.
The Q model is trained on (initial_state, initial_action) pairs with their corresponding discounted cumulative returns computed as:
\(G_0 = r_0 + γ*r_1 + γ²*r_2 + ... + γ^(T-1)*r_(T-1)\)
- Returns:
the fit statistics of the Q model.
- class hopes.ope.estimators.SequentialDoublyRobust(*, steps_per_episode: int, discount_factor: float = 1.0, eps: float = 1e-12, clip: float | None = None)#
Bases:
BaseEstimatorSequential Doubly Robust estimator.
This estimator computes a per-decision doubly robust estimate using a temporal-difference-style formulation. It combines model-based predictions with cumulative importance weights built from behavior and target policy action probabilities.
The per-episode estimate is computed as:
\[\hat{V}_{\mathrm{DR}}^{(i)} = \hat{V}(s_{i,0}) + \sum_{t=0}^{T-1} W_{i,t} \left( r_{i,t} + \gamma \hat{V}(s_{i,t+1}) - \hat{Q}(s_{i,t}, a_{i,t}) \right)\]where:
\[W_{i,t} = \prod_{k=0}^{t} \rho_{i,k}\]and
\[\rho_{i,t} = \frac{\pi_e(a_{i,t} \mid s_{i,t})}{\pi_b(a_{i,t} \mid s_{i,t})}\]with:
\(i\) denoting the episode index,
\(t\) denoting the timestep index,
\(r_{i,t}\) the observed reward at timestep \(t\),
\(\hat{Q}(s_{i,t}, a_{i,t})\) the estimated action-value for the logged action,
\(\hat{V}(s_{i,t})\) the estimated state value under the target policy,
\(\gamma\) the discount factor,
\(\pi_e\) the target policy,
\(\pi_b\) the behavior policy.
If precomputed step-wise importance ratios are provided, they are used directly. Otherwise, the ratios are constructed from the target and behavior policy action probabilities and the logged actions.
Stickiness handling, when needed, must be applied upstream during preprocessing. Reference paper: https://arxiv.org/abs/1511.03722
- check_parameters() None#
Check if the estimator parameters are valid.
- estimate_policy_value() float#
Estimate the value of the target policy.
- estimate_weighted_rewards() ndarray#
Estimate episode-level sequential DR contributions.
- Returns:
Episode-level DR estimates, shape (n_episodes, 1).
- fit(*, obs_flat: ndarray, act_flat: ndarray, rew_flat: ndarray, num_actions: int, q_model_params: dict | None = None, random_state: int = 0) RTGQModelHGBoost#
Fit an internal
RTGQModelHGBoostQ model from raw logged data, then populateq_valuesandlogged_actionsautomatically.This mirrors the design of
DirectMethod.fit(): rather than building and fitting the Q model externally and injecting the predictions viaset_model_predictions(), you can pass the raw trajectory data directly and let the estimator handle the model training. The two workflows remain interchangeable —set_model_predictions()is still available for cases where Q-values come from an external source.- Parameters:
obs_flat – Observations, shape (n_samples, obs_dim).
act_flat – Discrete action indices, shape (n_samples,).
rew_flat – Rewards, shape (n_samples,).
num_actions – Total number of discrete actions.
q_model_params – Optional hyper-parameters forwarded to
RTGQModelHGBoost.random_state – Random seed for the underlying gradient-boosting model.
- Returns:
The fitted
RTGQModelHGBoostinstance.
- set_logged_actions(logged_actions: ndarray) None#
Set logged actions.
- Parameters:
logged_actions – Logged action indices with shape (n_samples,).
- set_model_predictions(*, q_values: ndarray) None#
Set model-based predictions used by the sequential DR estimator.
This can be used to inject Q-value predictions from an external model, or to set the predictions after fitting an internal model via
fit(). The provided Q-values must be aligned with the logged data, meaning each row corresponds to the estimated action-values for the state at the same index in the logged data as in \([\hat{Q}(s_t, a)]_{a \in \mathcal{A}}\).- Parameters:
q_values – Estimated action-values for all actions, shape (n_samples, n_actions). Each row must contain the estimated action-values for the corresponding state.
- short_name() str#
Return the short name of the estimator.
This method can be overridden by subclasses to customize the short name.
- Returns:
the short name of the estimator. By default, it returns the abbreviation of the class name, ie “IPW”.
Implementing a new estimator#
To implement a new estimator, you need to subclass hopes.ope.estimators.BaseEstimator and implement:
hopes.ope.estimators.BaseEstimator.estimate_weighted_rewards(). It should return the estimated weighted rewards.hopes.ope.estimators.BaseEstimator.estimate_policy_value(). It should return the estimated value of the target policy. It typically uses the estimated weighted rewards.
Optionally, you can implement hopes.ope.estimators.BaseEstimator.short_name() to provide a short name for the estimator.
When not implemented, the uppercase letters of the class name are used.
Below is the BaseEstimator class documentation.
- class hopes.ope.estimators.BaseEstimator#
Base class for all estimators.
- abstract estimate_policy_value() float#
Estimate the value of the target policy.
This method should be overridden by subclasses to implement the specific estimator. The typical implementation should call
estimate_weighted_rewards()to compute the weighted rewards, then compute the policy value.- Returns:
the estimated value of the target policy.
- abstract estimate_weighted_rewards() ndarray#
Estimate the weighted rewards.
This method should be overridden by subclasses to implement the specific estimator.
- Returns:
the weighted rewards.
- short_name() str#
Return the short name of the estimator.
This method can be overridden by subclasses to customize the short name.
- Returns:
the short name of the estimator. By default, it returns the abbreviation of the class name, ie “IPW”.