Hopes: Policies#

Implemented policies#

The base, abstract policy class is Policy.

Following policies implement it:

hopes.policy.RandomPolicy

A random policy that selects actions uniformly at random.

hopes.policy.ClassificationBasedPolicy

A policy that uses a classification model to predict the log-probabilities of actions given observations.

hopes.policy.PiecewiseLinearPolicy

A piecewise linear policy that selects actions based on a set of linear segments defined by thresholds and slopes.

hopes.policy.FunctionBasedPolicy

A policy based on a deterministic function that maps observations to actions.

hopes.policy.HttpPolicy

A policy that uses a remote HTTP server that returns log-probabilities for actions given observations.

hopes.policy.OnnxModelBasedPolicy

A policy that uses an existing ONNX model to predict the log-probabilities of actions given observations.

They aim to provide an integration with actual policies, which can be used in off-policy evaluation.

Policies documentation#

class hopes.policy.ClassificationBasedPolicy(obs: ndarray, act: ndarray, classification_model: str = 'logistic', model_params: dict | None = None)#

Bases: Policy

A policy that uses a classification model to predict the log-probabilities of actions given observations.

In absence of an actual control policy, this can be used to train a policy on a dataset of (obs, act) pairs that would have been collected offline.

It currently supports logistic regression, random forest and MLP models.

Example usage:

# train a classification-based policy
reg_policy = ClassificationBasedPolicy(obs=train_obs, act=train_act, classification_model="random_forest")
fit_stats = reg_policy.fit()

# compute action probabilities for new observations
act_probs = reg_policy.compute_action_probs(obs=new_obs)
fit() dict[str, float]#

Fit the classification model on the training data and return performance statistics computed on the training data.

Returns:

the accuracy and F1 score on the training data.

log_probabilities(obs: ndarray) ndarray#

Compute the log-probabilities of the actions under the classification-based policy for a given set of observations.

class hopes.policy.PiecewiseLinearPolicy(num_segments: int, obs: ndarray, act: ndarray, epsilon: float, actions_bins: list[float | int] | None = None)#

Bases: Policy

A piecewise linear policy that selects actions based on a set of linear segments defined by thresholds and slopes.

This can be used to estimate a probability distribution over actions drawn from a BMS reset rule, for instance an outdoor air reset that is a function of outdoor air temperature and is bounded by a minimum and maximum on both axis. This can also be helpful to model a simple schedule, where action is a function of time.

Since the output of the piecewise linear model is deterministic, the log-probabilities are computed by assuming the function is deterministic and assigning a probability of ~1 to the action returned by the function and an almost zero probability to all other actions.

Also, the piecewise linear policy output being continuous, we need to discretize the action space to compute the log-probabilities. This is done by binning the actions to the nearest action in the discretized action space.

fit() dict[str, float]#

Fit the piecewise linear model on the training data and return the RMSE and R² on the training.

Returns:

the RMSE (‘rmse’) and R² (‘r2’) on the training data.

log_probabilities(obs: ndarray) ndarray#

Compute the log-probabilities of the actions under the piecewise linear policy for a given set of observations.

class hopes.policy.OnnxModelBasedPolicy(onnx_model_path: str | Path, obs_input: tuple[str, dtype], state_dim: tuple[int, int, int] | None = None, seq_len: int | None = None, prev_n_actions: int | None = None, prev_n_rewards: int | None = None, state_input: tuple[str, dtype] | None = None, seq_len_input: tuple[str, dtype] | None = None, prev_actions_input: tuple[str, dtype] | None = None, prev_rewards_input: tuple[str, dtype] | None = None, state_output_name: str | None = None, action_output_name: str | None = None, action_probs_output_name: str | None = None, action_log_probs_output_name: str | None = None, action_dist_inputs_output_name: str | None = None)#

Bases: Policy

A policy that uses an existing ONNX model to predict the log-probabilities of actions given observations.

This class makes some opinionated assumptions about the structure of the ONNX model. You may need to override some methods if your model does not fit this structure.

It supports models with attention mechanisms, where the state of the model is updated at each step. The action log probabilities are computed from the output of the model, with 3 options depending on the output layer of the underlying model:

  • from the action probabilities output.

  • from the action log probabilities output.

  • from the action distribution inputs output.

Example of usage, based on a pre-trained model in Ray RLlib, saved using ray.rllib.algorithms.algorithm.Algorithm.export_policy_model(). This model uses an Attention-based Transformer model and passes 10 previous actions as inputs to the model. The action log probabilities are computed from the action distribution inputs output.

onnx_file_path = "model.onnx"
policy = OnnxModelBasedPolicy(
    onnx_model_path=onnx_file_path,
    obs_input=("default_policy/obs:0", np.float32),
    state_dim=(1, 10, 32),  # (num_transformers, memory, attention_dim)
    seq_len=10,
    prev_n_actions=10,
    prev_n_rewards=0,
    state_input=("default_policy/state_in_0:0", np.float32),
    seq_len_input=("default_policy/seq_lens:0", np.int32),
    prev_actions_input=("default_policy/prev_actions:0", np.int64),
    state_output_name="default_policy/Reshape_5:0",
    action_output_name="default_policy/cond_1/Merge:0",
    action_probs_output_name=None,
    action_log_probs_output_name=None,
    action_dist_inputs_output_name="default_policy/model_2/dense_6/BiasAdd:0",
)

policy.log_probabilities(obs=np.random.rand(1, 15))
compute_reward(obs: ndarray, action: int) float#

Compute the reward for the given observations and actions. Only necessary if prev_n_rewards is > 0.

Parameters:
  • obs – the observations for which to compute the reward.

  • action – the action for which to compute the reward.

Returns:

the computed reward.

log_probabilities(obs: ndarray) ndarray#

Compute the log-probabilities of the actions under the policy for a given set of observations.

Parameters:

obs – the observation for which to compute the log-probabilities, shape: (batch_size, obs_dim).

map_inputs(obs: ndarray) dict[str, ndarray]#

Prepare the inputs for the ONNX model.

Parameters:

obs – the observations for which to prepare the inputs.

Returns:

the inputs for the ONNX model.

property output_names: list[str]#

The names of the outputs of the ONNX model.

By default, returns all output names. It can be overridden to return a subset of output names.

reset_state() None#

Reset the state of the policy.

class hopes.policy.FunctionBasedPolicy(policy_function: callable, epsilon: float, actions_bins: list[float | int])#

Bases: Policy

A policy based on a deterministic function that maps observations to actions.

log-probabilities are computed by assuming the function is deterministic and assigning a probability of 1 to the action returned by the function and an almost zero probability to all other actions. The action space is discretized to compute the log-probabilities.

log_probabilities(obs: ndarray) ndarray#

Compute the log-probabilities of the actions under the function-based policy for a given set of observations.

class hopes.policy.RandomPolicy(num_actions: int)#

Bases: Policy

A random policy that selects actions uniformly at random.

It can serve as a baseline policy for comparison with other policies.

log_probabilities(obs: ndarray) ndarray#

Compute the log-probabilities of the actions under the random policy for a given set of observations.

Implementing a new policy#

To implement a new policy, you need to subclass Policy and implement the hopes.policy.Policy.log_likelihoods() method.

class hopes.policy.Policy#

Bases: ABC

An abstract class for policies.

The policy must be subclassed and the log_probabilities method must be implemented.

compute_action_probs(obs: ndarray) ndarray#

Compute the action probabilities under a given policy for a given set of observations.

Parameters:

obs – the observation for which to compute the action probabilities.

Returns:

the action probabilities.

property epsilon#
abstract log_probabilities(obs: ndarray) ndarray#

Compute the log-probabilities of the actions under the policy for a given set of observations.

Parameters:

obs – the observation for which to compute the log-probabilities, shape: (batch_size, obs_dim).

property name#
select_action(obs: ndarray, deterministic: float = False) ndarray#

Select actions under the policy for given observations.

Parameters:
  • obs – the observation(s) for which to select an action, shape (batch_size, obs_dim).

  • deterministic – whether to select actions deterministically.

Returns:

the selected action(s).

with_epsilon(epsilon: float) Policy#

Set the epsilon value for epsilon-greedy action selection. This is only needed if the policy is used for action selection and epsilon-greedy action selection is desired.

Parameters:

epsilon – the epsilon value for epsilon-greedy action selection.

with_name(name: str) Policy#

Set the name of the policy. This is optional but can be useful for logging, visualization and comparison with other policies.

Parameters:

name – the name of the policy.