RLlib Configuration

Link to GitHub project: maxpumperla/rllib-trainer

Defining a trainer

Here’s how you define and run a PPO Trainer, with and without Tune:

# Manual RLlib Trainer setup.
dqn_config = DQNConfig() \
    .training(gamma=0.9, lr=0.01) \
    .resources(num_gpus=0) \
    .rollouts(num_rollout_workers=1)
dqn_trainer = dqn_config.build(env="CartPole-v1")
print(dqn_trainer.train())


# With tune.
ppo_config = PPOConfig(kl_coeff=0.1).environment(env="CartPole-v1")
# Add a tune grid-search over learning rate.
ppo_config.training(lr=tune.grid_search([0.001, 0.0001]))

tune.run(
    "PPO",
    stop={"episode_reward_mean": 150.0},
    config=ppo_config.to_dict()
)

And here’s an example for a DQN trainer:

# With evaluation sub-config dict.
dqn_config = DQNConfig().evaluation(
    evaluation_interval=1,
    evaluation_num_workers=2,
    evaluation_config=DQNConfig().exploration(explore=False)
)
dqn_trainer = dqn_config.build(env="CartPole-v1")
results = dqn_trainer.train()
assert "evaluation" in results

If you define a DQN Trainer with the wrong config, your IDE will tell you on definition:

from rllib.dqn import DQNConfig

# "kl_coeff" is not defined in DQNConfig, your IDE will bark at you.
config = DQNConfig(kl_coeff=0.3) \
    .training(gamma=0.9, lr=0.01) \
    .resources(num_gpus=0) \
    .workers(num_workers=4)

Here’s a snapshot from PyCharm:

_images/dqn_fail.png

How to document Trainers in a less annoying way

Instead of having loooooooong lists of flat parameters, we can simply auto-generate class documentation, with types and stuff. Users might actually understand what’s going on!

TrainerConfigurator

class rllib.trainer.TrainerConfig(trainer_class=None)[source]

Bases: object

A RLlib TrainerConfig builds an RLlib trainer from a given configuration.

Example

>>> from rllib.trainer import TrainerConfig
>>> config = TrainerConfig.training(gamma=0.9, lr=0.01)
                 .environment(env="CartPole-v1")
                 .resources(num_gpus=0)
                 .workers(num_workers=4)
to_dict() dict[source]

Converts all settings into a legacy config dict for backward compatibility.

Returns

A complete TrainerConfigDict, usable in backward-compatible Tune/RLlib use cases, e.g. w/ tune.run().

build(env: Optional[Union[str, Any]] = None, logger_creator: Optional[Callable[[], ray.tune.logger.Logger]] = None)[source]

Builds a Trainer from the TrainerConfig.

Parameters
  • env – Name of the environment to use (e.g. a gym-registered str), a full class path (e.g. “ray.rllib.examples.env.random_env.RandomEnv”), or an Env class directly. Note that this arg can also be specified via the “env” key in config.

  • logger_creator – Callable that creates a ray.tune.Logger object. If unspecified, a default logger is created.

Returns

A ray.rllib.agents.dqn.DQNTrainer object.

training(gamma: Optional[float] = None, lr: Optional[float] = None, train_batch_size: Optional[int] = None, model: Optional[dict] = None, optimizer: Optional[dict] = None) rllib.trainer.TrainerConfig[source]
Parameters
  • gamma – Float specifying the discount factor of the Markov Decision process.

  • lr – The default learning rate.

  • train_batch_size – Training batch size, if applicable.

  • model – Arguments passed into the policy model. See models/catalog.py for a full list of the available model options.

  • optimizer – Arguments to pass to the policy optimizer.

Returns

This updated TrainerConfig object.

rollouts(*, num_rollout_workers: Optional[int] = None, num_envs_per_worker: Optional[int] = None, create_env_on_local_worker: Optional[bool] = None, rollout_fragment_length: Optional[int] = None, batch_mode: Optional[str] = None, remote_worker_envs: Optional[bool] = None, remote_env_batch_wait_ms: Optional[float] = None) rllib.trainer.TrainerConfig[source]

Sets the rollout worker configuration.

Parameters
  • num_rollout_workers – Number of rollout worker actors to create for parallel sampling. Setting this to 0 will force rollouts to be done in the local worker (driver process or the Trainer actor when using Tune).

  • num_envs_per_worker – Number of environments to evaluate vector-wise per worker. This enables model inference batching, which can improve performance for inference bottlenecked workloads.

  • create_env_on_local_worker – When num_workers > 0, the driver (local_worker; worker-idx=0) does not need an environment. This is because it doesn’t have to sample (done by remote_workers; worker_indices > 0) nor evaluate (done by evaluation workers; see below).

  • rollout_fragment_length – Divide episodes into fragments of this many steps each during rollouts. Sample batches of this size are collected from rollout workers and combined into a larger batch of train_batch_size for learning. For example, given rollout_fragment_length=100 and train_batch_size=1000: 1. RLlib collects 10 fragments of 100 steps each from rollout workers. 2. These fragments are concatenated and we perform an epoch of SGD. When using multiple envs per worker, the fragment size is multiplied by num_envs_per_worker. This is since we are collecting steps from multiple envs in parallel. For example, if num_envs_per_worker=5, then rollout workers will return experiences in chunks of 5*100 = 500 steps. The dataflow here can vary per algorithm. For example, PPO further divides the train batch into minibatches for multi-epoch SGD.

  • batch_mode – How to build per-Sampler (RolloutWorker) batches, which are then usually concat’d to form the train batch. Note that “steps” below can mean different things (either env- or agent-steps) and depends on the count_steps_by (multiagent) setting below. “truncate_episodes”: Each produced batch (when calling RolloutWorker.sample()) will contain exactly rollout_fragment_length steps. This mode guarantees evenly sized batches, but increases variance as the future return must now be estimated at truncation boundaries. “complete_episodes”: Each unroll happens exactly over one episode, from beginning to end. Data collection will not stop unless the episode terminates or a configured horizon (hard or soft) is hit.

  • remote_worker_envs – If using num_envs_per_worker > 1, whether to create those new envs in remote processes instead of in the same worker. This adds overheads, but can make sense if your envs can take much time to step / reset (e.g., for StarCraft). Use this cautiously; overheads are significant.

  • remote_env_batch_wait_ms – Timeout that remote workers are waiting when polling environments. 0 (continue when at least one env is ready) is a reasonable default, but optimal value could be obtained by measuring your environment step / reset and model inference perf.

Returns

This updated TrainerConfig object.

environment(*, env: Optional[Union[str, Any]] = None, env_config: Optional[dict] = None, observation_space: Optional[gym.spaces.space.Space] = None, action_space: Optional[gym.spaces.space.Space] = None) rllib.trainer.TrainerConfig[source]

Sets the config’s environment settings.

Parameters
  • env – The environment specifier. This can either be a tune-registered env, via tune.register_env([name], lambda env_ctx: [env object]), or a string specifier of an RLlib supported type. In the latter case, RLlib will try to interpret the specifier as either an openAI gym env, a PyBullet env, a ViZDoomGym env, or a fully qualified classpath to an Env class, e.g. “ray.rllib.examples.env.random_env.RandomEnv”.

  • env_config – Arguments dict passed to the env creator as an EnvContext object (which is a dict plus the properties: num_workers, worker_index, vector_index, and remote).

  • observation_space – The observation space for the Policies of this Trainer.

  • action_space – The action space for the Policies of this Trainer.

Returns

This updated TrainerConfig object.

exploration(*, explore: Optional[bool] = None, exploration_config: Optional[dict] = None)[source]

Sets the config’s exploration settings.

Parameters
  • explore – Default exploration behavior, iff explore=None is passed into compute_action(s). Set to False for no exploration behavior (e.g., for evaluation).

  • exploration_config – A dict specifying the Exploration object’s config.

Returns

This updated TrainerConfig object.

evaluation(*, evaluation_interval: Optional[int] = None, evaluation_duration: Optional[int] = None, evaluation_duration_unit: Optional[str] = None, evaluation_parallel_to_training: Optional[bool] = None, evaluation_config: Optional[Union[rllib.trainer.TrainerConfig, dict]] = None, evaluation_num_workers: Optional[int] = None, custom_evaluation_function: Optional[Callable] = None, always_attach_evaluation_results: Optional[bool] = None)[source]

Sets the config’s evaluation settings.

Parameters
  • evaluation_interval – Evaluate with every evaluation_interval training iterations. The evaluation stats will be reported under the “evaluation” metric key. Note that for Ape-X metrics are already only reported for the lowest epsilon workers (least random workers). Set to None (or 0) for no evaluation.

  • evaluation_duration – Duration for which to run evaluation each evaluation_interval. The unit for the duration can be set via evaluation_duration_unit to either “episodes” (default) or “timesteps”. If using multiple evaluation workers (evaluation_num_workers > 1), the load to run will be split amongst these. If the value is “auto”: - For evaluation_parallel_to_training=True: Will run as many episodes/timesteps that fit into the (parallel) training step. - For evaluation_parallel_to_training=False: Error.

  • evaluation_duration_unit – The unit, with which to count the evaluation duration. Either “episodes” (default) or “timesteps”.

  • evaluation_parallel_to_training – Whether to run evaluation in parallel to a Trainer.train() call using threading. Default=False. E.g. evaluation_interval=2 -> For every other training iteration, the Trainer.train() and Trainer.evaluate() calls run in parallel. Note: This is experimental. Possible pitfalls could be race conditions for weight synching at the beginning of the evaluation loop.

  • evaluation_config – Typical usage is to pass extra args to evaluation env creator and to disable exploration by computing deterministic actions. IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal policy, even if this is a stochastic one. Setting “explore=False” here will result in the evaluation workers not using this optimal policy!

  • evaluation_num_workers – Number of parallel workers to use for evaluation. Note that this is set to zero by default, which means evaluation will be run in the trainer process (only if evaluation_interval is not None). If you increase this, it will increase the Ray resource usage of the trainer since evaluation workers are created separately from rollout workers (used to sample data for training).

  • custom_evaluation_function – Customize the evaluation method. This must be a function of signature (trainer: Trainer, eval_workers: WorkerSet) -> metrics: dict. See the Trainer.evaluate() method to see the default implementation. The Trainer guarantees all eval workers have the latest policy state before this function is called.

  • always_attach_evaluation_results – Make sure the latest available evaluation results are always attached to a step result dict. This may be useful if Tune or some other meta controller needs access to evaluation metrics all the time.

Returns

This updated TrainerConfig object.

resources(*, num_gpus: Optional[Union[float, int]] = None, _fake_gpus: Optional[bool] = None, num_cpus_per_worker: Optional[int] = None, num_gpus_per_worker: Optional[Union[float, int]] = None, num_cpus_for_local_worker: Optional[int] = None)[source]

Specifies resources allocated for a Trainer and its ray actors/workers.

Parameters
  • num_gpus – Number of GPUs to allocate to the trainer process. Note that not all algorithms can take advantage of trainer GPUs. Support for multi-GPU is currently only available for tf-[PPO/IMPALA/DQN/PG]. This can be fractional (e.g., 0.3 GPUs).

  • _fake_gpus – Set to True for debugging (multi-)?GPU funcitonality on a CPU machine. GPU towers will be simulated by graphs located on CPUs in this case. Use num_gpus to test for different numbers of fake GPUs.

  • num_cpus_per_worker – Number of CPUs to allocate per worker.

  • num_gpus_per_worker – Number of GPUs to allocate per worker. This can be fractional. This is usually needed only if your env itself requires a GPU (i.e., it is a GPU-intensive video game), or model inference is unusually expensive.

  • custom_resources_per_worker – Any custom Ray resources to allocate per worker.

  • num_cpus_for_local_worker – Number of CPUs to allocate for the trainer. Note: this only takes effect when running in Tune. Otherwise, the trainer runs in the main program (driver).

Returns

This updated TrainerConfig object.

PPO

class rllib.ppo.PPOConfig(use_critic: bool = True, use_gae: bool = True, lambda_: float = 1.0, kl_coeff: float = 0.2)[source]

Bases: rllib.trainer.TrainerConfig

Defines a PPOTrainer from the given configuration

Parameters
  • use_critic – Should use a critic as a baseline (otherwise don’t use value baseline; required for using GAE).

  • use_gae – If true, use the Generalized Advantage Estimator (GAE) with a value function, see https://arxiv.org/pdf/1506.02438.pdf.

  • lambda – The GAE (lambda) parameter.

  • kl_coeff – Initial coefficient for KL divergence.

  • rollout_fragment_length – Size of batches collected from each worker.

  • train_batch_size – Number of timesteps collected for each SGD round. This defines the size of each SGD epoch.

Example

>>> from rllib.ppo import PPOConfig
>>> config = PPOConfig(kl_coeff=0.3).training(gamma=0.9, lr=0.01)                        .resources(num_gpus=0)                        .workers(num_workers=4)
>>> print(config.to_dict())
>>> trainer = config.build(env="CartPole-v1")
>>> trainer.train()

Example

>>> from rllib.ppo import PPOConfig
>>> trainer = PPOConfig().build(env="CartPole-v1")
>>> config_dict = trainer.get_config()
>>>
>>> config_dict.update({
      "lr": tune.grid_search([0.01, 0.001, 0.0001]),
    }),
>>> tune.run(
        "PPO",
        stop={"episode_reward_mean": 200},
        config=config_dict,
    )

DQN

class rllib.dqn.DQNConfig(dueling=True, hiddens=None, double_q=True, n_step=1)[source]

Bases: rllib.trainer.TrainerConfig

Defines a DQNTrainer from the given configuration

Parameters
  • dueling (bool) – Whether to use dueling architecture

  • hiddens – Dense-layer setup for each the advantage branch and the value branch in a dueling architecture.

  • double_q (bool) – Whether to use double Q-learning

  • n_step (int) – N-step Q learning

Example

>>> from rllib.dqn import DQNConfig
>>> config = DQNConfig(dueling=False).training(gamma=0.9, lr=0.01)
                .environment(env="CartPole-v1")
                .resources(num_gpus=0)
                .workers(num_workers=4)
>>> trainer = config.build()