Hopes: Selection#

Roadmap#

  • [x] Confidence Interval estimation using Bootstrap

  • [x] Confidence Interval estimation using t-test

Introduction#

Running an Off-Policy Evaluation (OPE) experiment and then a selection of the best policies with Hopes is simple.

Example with a synthetic, random, dataset.

# create the behavior policy
behavior_policy = ClassificationBasedPolicy(
    obs=obs, act=act, classification_model="logistic"
)
behavior_policy.fit()

# create the target policies
target_policy_1 = RandomPolicy(num_actions=num_actions).with_name("p1")
target_policy_2 = RandomPolicy(num_actions=num_actions).with_name("p2")
target_policy_3 = ClassificationBasedPolicy(
    obs=obs, act=act, classification_model="random_forest"
).with_name("p3")
target_policy_3.fit()

# initialize the estimators
estimators = [
    InverseProbabilityWeighting(),
    SelfNormalizedInverseProbabilityWeighting(),
]

# run the off-policy evaluation
ope = OffPolicyEvaluation(
    obs=obs,
    rewards=rew,
    behavior_policy=behavior_policy,
    estimators=estimators,
    fail_fast=True,
    ci_method="t-test",
    ci_significance_level=0.1,
)

results = [
    ope.evaluate(target_policy)
    for target_policy in [target_policy_1, target_policy_2, target_policy_3]
]

# select the top k policies based on lower bound (confidence interval +-90%)
top_k_results = OffPolicySelection.select_top_k(results, metric="lower_bound", top_k=1)
print(top_k_results[0])

This should produce an output similar to:

Policy: p2
Confidence interval: +- 90.0%
=====  ========  ==========  =============  =============
..         mean         std    lower_bound    upper_bound
=====  ========  ==========  =============  =============
IPW    0.510251  0.00788465       0.497324       0.522907
SNIPW  0.499158  0.00523288       0.490235       0.507513
=====  ========  ==========  =============  =============

Note that confidence interval (CI) calculation can be based on several methods:

  • bootstrap (default)

  • t-test

The documentation of the CI calculation can be found in BaseEstimator.estimate_policy_value_with_confidence_interval. See implementation details for more information.

Implementation details#

class hopes.ope.evaluation.OffPolicyEvaluation(obs: ndarray, rewards: ndarray, behavior_policy: Policy, estimators: list[BaseEstimator], fail_fast: bool = True, ci_method: str = 'bootstrap', ci_significance_level: float = 0.05)#

Bases: object

Off-Policy evaluation of a target policy using a behavior policy and a set of estimators.

Example usage:

# create the behavior policy
behavior_policy = ClassificationBasedPolicy(obs=collected_obs, act=collected_act, classification_model="logistic")
behavior_policy.fit()

# create the target policy
target_policy = RandomPolicy(num_actions=num_actions)

# initialize the estimators
estimators = [
    InverseProbabilityWeighting(),
    SelfNormalizedInverseProbabilityWeighting(),
]

# run the off-policy evaluation
ope = OffPolicyEvaluation(
    obs=obs,
    rewards=rew,
    target_policy=target_policy,
    behavior_policy=behavior_policy,
    estimators=estimators,
    fail_fast=True,
    significance_level=0.1
)
results = ope.evaluate()
evaluate(target_policy: Policy) OffPolicyEvaluationResults#

Run the off-policy evaluation and return the estimated value of the target policy.

Returns:

a dict of OffPolicyEvaluationResult instances, one for each estimator

class hopes.ope.selection.OffPolicySelection#

Bases: object

static select_top_k(evaluation_results: list[OffPolicyEvaluationResults], metric: str = 'mean', top_k: int = 1) list[OffPolicyEvaluationResults]#

Select the top-k policies based on the given metric.

Parameters:
  • evaluation_results – The results of the off-policy evaluation for multiple policies.

  • metric – The metric to use for the selection. Can be “mean”, “lower_bound”, “upper_bound”.

  • top_k – The number of policies to select.

Returns:

The top-k policies based on the given metric.

class hopes.ope.estimators.BaseEstimator

Base class for all estimators.

estimate_policy_value_with_confidence_interval(method: str = 'bootstrap', significance_level: float = 0.05, num_samples: int = 1000) dict[str, float]

Estimate the confidence interval of the policy value.

The bootstrap method uses bootstrapping to estimate the confidence interval of the policy value. Bootstrapping consists in resampling the data with replacement to infer the distribution of the estimated weighted rewards. The confidence interval is then computed as the quantiles of the bootstrapped samples.

The t-test method (or Student’s t-test) uses the t-distribution of the estimated weighted rewards - assuming that the weighted rewards are normally distributed - to estimate the confidence interval of the policy value. It follows the t-distribution formula \(t = \frac{\hat{\mu} - \mu}{\hat{\sigma} / \sqrt{n}}\), where \(\hat{\mu}\) is the mean of the weighted rewards, \(\mu\) is the true mean of the weighted rewards, \(\hat{\sigma}\) is the standard deviation of the weighted rewards, and \(n\) is the number of samples. The confidence interval is then computed as:

\[[\hat{\mu} - t_{\mathrm{test}}(1 - \alpha, n-1) \frac{\hat{\sigma}}{\sqrt{n}}, \hat{\mu} + t_{\mathrm{test}}(1 - \alpha, n-1) \frac{\hat{\sigma}}{\sqrt{n}}]\]

The input data is sampled from the estimated weighted rewards, using estimate_weighted_rewards().

Example:

ipw = InverseProbabilityWeighting()
ipw.set_parameters(
    target_policy_action_probabilities=target_policy_action_probabilities,
    behavior_policy_action_probabilities=behavior_policy_action_probabilities,
    rewards=rewards,
)
metrics = ipw.estimate_policy_value_with_confidence_interval(
    method="bootstrap", significance_level=0.05
)
print(metrics)

Should output something like:

{
    "lower_bound": 10.2128,
    "upper_bound": 10.6167,
    "mean": 10.4148,
    "std": 6.72408,
}
Parameters:
  • method – the method to use for estimating the confidence interval. Currently, only “bootstrap” and “t-test” are supported.

  • significance_level – the significance level of the confidence interval.

  • num_samples – the number of bootstrap samples to use. Only used when method is “bootstrap”.

Returns:

a dictionary containing the confidence interval of the policy value. The keys are:

  • ”lower_bound”: the lower bound of the policy value, given the significance level.

  • ”upper_bound”: the upper bound of the policy value, given the significance level.

  • ”mean”: the mean of the policy value.

  • ”std”: the standard deviation of the policy value.