Hopes: Estimators#

Roadmap#

[x] Implement Inverse Probability Weighting (IPW) estimator
[x] Implement Self-Normalized Inverse Probability Weighting (SNIPW) estimator
[x] Implement Direct Method (DM) estimator
[x] Implement Trajectory-wise Importance Sampling (TIS) estimator
[x] Implement Self-Normalized Trajectory-wise Importance Sampling (SNTIS) estimator
[x] Implement Per-Decision Importance Sampling (PDIS) estimator
[x] Implement Self-Normalized Per-Decision Importance Sampling (SNPDIS) estimator
[ ] Implement Doubly Robust (DR) estimator

Implemented estimators#

Currently, the following estimators are implemented:

`hopes.ope.estimators.BaseEstimator`	Base class for all estimators.
`hopes.ope.estimators.InverseProbabilityWeighting`	Inverse Probability Weighting (IPW) estimator.
`hopes.ope.estimators.SelfNormalizedInverseProbabilityWeighting`	Self-Normalized Inverse Probability Weighting (SNIPW) estimator.
`hopes.ope.estimators.DirectMethod`	Direct Method (DM) estimator.
`hopes.ope.estimators.TrajectoryWiseImportanceSampling`	Trajectory-wise Importance Sampling (TIS) estimator.
`hopes.ope.estimators.SelfNormalizedTrajectoryWiseImportanceSampling`	Self-Normalized Trajectory-wise Importance Sampling (TIS) estimator.
`hopes.ope.estimators.PerDecisionImportanceSampling`	Per-Decision Importance Sampling (PDIS) estimator.
`hopes.ope.estimators.SelfNormalizedPerDecisionImportanceSampling`	Self-Normalized Per-Decision Importance Sampling (PDIS) estimator.

Estimators documentation#

class hopes.ope.estimators.InverseProbabilityWeighting#

Bases: BaseEstimator

Inverse Probability Weighting (IPW) estimator.

$V_{I P W} (π_{e}, D) = \frac{1}{n} \sum_{t = 1}^{n} p (s_{t}, a_{t}) r_{t}$

Where:

$D$ is the offline collected dataset.
$p (s_{t}, a_{t})$ is the importance weight defined as $p (s_{t}, a_{t}) = \frac{π_{e} (a_{t} | s_{t})}{π_{b} (a_{t} | s_{t})}$ .
$π_{e}$ is the target policy and $π_{b}$ is the behavior policy.
$r_{t}$ is the reward observed at time $t$ for the behavior policy.
$n$ is the number of samples.

This estimator has generally high variance, especially on small datasets, and can be improved by using self-normalized importance weights.

References

https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_faculty_pubs

estimate_policy_value() → float#: Estimate the value of the target policy using the IPW estimator.

estimate_weighted_rewards() → ndarray#: Estimate the weighted rewards using the IPW estimator.

class hopes.ope.estimators.SelfNormalizedInverseProbabilityWeighting#

Bases: InverseProbabilityWeighting

Self-Normalized Inverse Probability Weighting (SNIPW) estimator.

$V_{S N I P W} (π_{e}, D) = \frac{\sum_{t = 1}^{n} p (s_{t}, a_{t}) r_{t}}{\sum_{t = 1}^{n} p (s_{t}, a_{t})}$

Where:

$D$ is the offline collected dataset.
$p (s_{t}, a_{t})$ is the importance weight defined as $p (s_{t}, a_{t}) = \frac{π_{e} (a_{t} | s_{t})}{π_{b} (a_{t} | s_{t})}$ .
$π_{e}$ is the target policy and $π_{b}$ is the behavior policy.
$r_{t}$ is the reward at time $t$ .
$n$ is the number of samples.

References

https://papers.nips.cc/paper_files/paper/2015/hash/39027dfad5138c9ca0c474d71db915c3-Abstract.html

estimate_policy_value() → float#: Estimate the value of the target policy using the SNIPW estimator.

estimate_weighted_rewards() → ndarray#: Estimate the weighted rewards using the SNIPW estimator.

class hopes.ope.estimators.DirectMethod(q_model_cls: type[RegressionBasedRewardModel], behavior_policy_obs: ndarray, behavior_policy_act: ndarray, behavior_policy_rewards: ndarray, steps_per_episode: int, q_model_type: str = 'random_forest', q_model_params: dict | None = None)#

Bases: BaseEstimator

Direct Method (DM) estimator.

$V_{D M} (π_{e}, D, Q) = \frac{1}{n} \sum_{t = 1}^{n} \sum_{a \in A} π_{e} (a | s_{0}) Q (s_{0}, a)$

Where:

$D$ is the offline collected dataset.
$π_{e}$ is the target policy.
$Q$ is the Q model trained to estimate the expected reward of the initial state under the behavior policy.
$n$ is the number of samples.
$a$ is the action taken in the set of actions $A$ .
$s_{0}$ is the initial state.

This estimator trains a Q model using supervised learning, then uses it to estimate the expected reward of the initial state under the target policy. The performance of this estimator depends on the quality of the Q model.

check_parameters() → None#

Check if the estimator parameters are valid.

Base estimator checks plus additional checks for the Q model.

estimate_policy_value() → float#: Estimate the value of the target policy using the Direct Method estimator.

estimate_weighted_rewards() → ndarray#: Estimate the weighted rewards using the Direct Method estimator.

fit() → dict[str, float] | None#

Fit the Q model to estimate the expected reward of the initial state under the behavior policy.

Returns:: the fit statistics of the Q model.

class hopes.ope.estimators.TrajectoryWiseImportanceSampling(steps_per_episode: int, discount_factor: float = 1.0)#

Bases: BaseEstimator, TrajectoryPerDecisionMixin

Trajectory-wise Importance Sampling (TIS) estimator.

$V_{T I S} (π_{e}, D) = \frac{1}{n} \sum_{i = 1}^{n} \sum_{t = 0}^{T - 1} γ^{t} w_{0 : T - 1}^{(i)} r_{t}^{(i)}$

Where:

$D$ is the offline collected dataset.
$w_{0 : T - 1}^{(i)}$ is the importance weight of the trajectory $i$ defined as $w_{0 : T - 1} = \prod_{t = 0}^{T - 1} \frac{π_{e} (a_{t} | s_{t})}{π_{b} (a_{t} | s_{t})}$
$π_{e}$ is the target policy and $π_{b}$ is the behavior policy.
$n$ is the number of trajectories.
$T$ is the length of the trajectory.
$γ_{t}$ is the discount factor at time $t$ .
$r_{t}^{(i)}$ is the reward at time $t$ of trajectory $i$ .

TIS can suffer from high variance due to the product operation of the importance weights, also when action space is large.

References

https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_faculty_pubs

check_parameters() → None#: Check if the estimator parameters are valid.

estimate_policy_value() → float#: Estimate the value of the target policy using the Trajectory-wise Importance Sampling estimator.

estimate_weighted_rewards() → ndarray#

Estimate the weighted rewards using the Trajectory-wise Importance Sampling estimator.

Returns:: the weighted rewards, or here the policy value per trajectory.

short_name() → str#

Return the short name of the estimator.

This method can be overridden by subclasses to customize the short name.

Returns:: the short name of the estimator. By default, it returns the abbreviation of the class name, ie “IPW”.

class hopes.ope.estimators.SelfNormalizedTrajectoryWiseImportanceSampling(steps_per_episode: int, discount_factor: float = 1.0)#

Bases: TrajectoryWiseImportanceSampling

Self-Normalized Trajectory-wise Importance Sampling (TIS) estimator.

V_{T I S} (π_{e}, D) = \frac{1}{n} \sum_{i = 1}^{n} \sum_{t = 0}^{T - 1} γ^{t} \frac{w_{0 : T - 1}^{(i)}}{\frac{1}{n} \sum_{j = 1}^{n} w_{0 : T - 1}^{(j)}} r_{t}^{(i)}

Where:

$D$ is the offline collected dataset.
$w_{0 : T - 1}^{(i)}$ is the importance weight of the trajectory $i$ defined as $w_{0 : T - 1} = \prod_{t = 0}^{T - 1} \frac{π_{e} (a_{t} | s_{t})}{π_{b} (a_{t} | s_{t})}$
$π_{e}$ is the target policy and $π_{b}$ is the behavior policy.
$n$ is the number of trajectories.
$T$ is the length of the trajectory.
$γ_{t}$ is the discount factor at time $t$ .
$r_{t}^{(i)}$ is the reward at time $t$ of trajectory $i$ .

SNTIS is a variance reduction technique for TIS. It divides the weighted rewards by the mean of the importance weights of the trajectories.

References

https://arxiv.org/abs/1906.03735

normalize(weights: ndarray) → ndarray#

Normalize the importance weights using the self-normalization strategy.

It uses self-normalization to reduce the variance of the estimator, using the mean of the importance weights over the trajectories.

Parameters:: weights – the importance weights to normalize.
Returns:: the normalized importance weights.

short_name() → str#

Return the short name of the estimator.

This method can be overridden by subclasses to customize the short name.

Returns:: the short name of the estimator. By default, it returns the abbreviation of the class name, ie “IPW”.

class hopes.ope.estimators.PerDecisionImportanceSampling(steps_per_episode: int, discount_factor: float = 1.0)#

Bases: BaseEstimator, TrajectoryPerDecisionMixin

Per-Decision Importance Sampling (PDIS) estimator.

$V_{P D I S} (π_{e}, D) = \frac{1}{n} \sum_{i = 1}^{n} \sum_{t = 0}^{T - 1} γ^{t} w_{t}^{(i)} r_{t}^{(i)}$

Where:

$D$ is the offline collected dataset.
$w_{t}^{(i)}$ is the importance weight of the decision $t$ of trajectory $i$ defined as $w_{t} = \frac{π_{e} (a_{t} | s_{t})}{π_{b} (a_{t} | s_{t})}$
$π_{e}$ is the target policy and $π_{b}$ is the behavior policy.
$n$ is the number of trajectories.
$T$ is the length of the trajectory.
$γ_{t}$ is the discount factor at time $t$ .
$r_{t}^{(i)}$ is the reward at time $t$ of trajectory $i$ .

References

https://arxiv.org/abs/1906.03735

check_parameters() → None#: Check if the estimator parameters are valid.

estimate_policy_value() → float#: Estimate the value of the target policy using the Trajectory-wise Importance Sampling estimator.

estimate_weighted_rewards() → ndarray#

Estimate the weighted rewards using the Trajectory-wise Importance Sampling estimator.

Returns:: the weighted rewards, or here the policy value per trajectory.

class hopes.ope.estimators.SelfNormalizedPerDecisionImportanceSampling(steps_per_episode: int, discount_factor: float = 1.0)#

Bases: PerDecisionImportanceSampling

Self-Normalized Per-Decision Importance Sampling (PDIS) estimator.

V_{P D I S} (π_{e}, D) = \frac{1}{n} \sum_{i = 1}^{n} \sum_{t = 0}^{T - 1} γ^{t} \frac{w_{t}^{(i)}}{\frac{1}{n} \sum_{j = 1}^{n} w_{t}^{(j)}} r_{t}^{(i)}

Where:

$D$ is the offline collected dataset.
$w_{t}^{(i)}$ is the importance weight of the decision $t$ of trajectory $i$ defined as $w_{t} = \frac{π_{e} (a_{t} | s_{t})}{π_{b} (a_{t} | s_{t})}$
$π_{e}$ is the target policy and $π_{b}$ is the behavior policy.
$n$ is the number of trajectories.
$T$ is the length of the trajectory.
$γ_{t}$ is the discount factor at time $t$ .
$r_{t}^{(i)}$ is the reward at time $t$ of trajectory $i$ .

SNPDIS is a variance reduction technique for PDIS.

References

https://arxiv.org/abs/1906.03735

normalize(weights: ndarray) → ndarray#

Normalize the importance weights using the self-normalization strategy.

It uses self-normalization to reduce the variance of the estimator, using the mean of the importance weights over the trajectories.

Parameters:: weights – the importance weights to normalize.
Returns:: the normalized importance weights.

Implementing a new estimator#

To implement a new estimator, you need to subclass hopes.ope.estimators.BaseEstimator and implement:

hopes.ope.estimators.BaseEstimator.estimate_weighted_rewards(). It should return the estimated weighted rewards.
hopes.ope.estimators.BaseEstimator.estimate_policy_value(). It should return the estimated value of the target policy. It typically uses the estimated weighted rewards.

Optionally, you can implement hopes.ope.estimators.BaseEstimator.short_name() to provide a short name for the estimator. When not implemented, the uppercase letters of the class name are used.

Below is the BaseEstimator class documentation.

class hopes.ope.estimators.BaseEstimator#

Base class for all estimators.

abstract estimate_policy_value() → float#

Estimate the value of the target policy.

This method should be overridden by subclasses to implement the specific estimator. The typical implementation should call estimate_weighted_rewards() to compute the weighted rewards, then compute the policy value.

Returns:: the estimated value of the target policy.

abstract estimate_weighted_rewards() → ndarray#

Estimate the weighted rewards.

This method should be overridden by subclasses to implement the specific estimator.

Returns:: the weighted rewards.

short_name() → str#

Return the short name of the estimator.

This method can be overridden by subclasses to customize the short name.

Returns:: the short name of the estimator. By default, it returns the abbreviation of the class name, ie “IPW”.