Distributed REINFORCE with Flame Runner
This report describes the reinforcement-learning example merged in PR #424, summarizes its design against current upstream sources, and presents single-run throughput and reward metrics for Ant-v5 and CartPole-v1. Test hardware and the Podman-based Flame cluster (1 FSM / 3 FEM / 1 FOC) are spelled out in the executive summary; runs were not on a datacenter or Linux-only rig—treat all timings as indicative, not universal baselines. Reference implementation: examples/rl/ in the Flame repository.
Executive summary
| Item | Finding |
|---|---|
| Artifact | Distributed REINFORCE training using flamepy.runner; discrete and continuous Gymnasium environments. |
| Test environment | Host: Apple M4 MacBook Pro (8 CPU cores, 24 GB unified memory). VM: Podman Machine (5 vCPU, 8 GB RAM). Flame on Podman Compose: 1× FSM, 3× FEM, 1× FOC (same-machine cluster; not a datacenter or bare-metal Linux farm). |
Ant-v5 (10000 episodes) |
Distributed 29.8 eps/sec vs local 20.8 eps/sec; relative throughput gain approximately 43.3%. |
CartPole-v1 (10000 episodes) |
Distributed 19.3 eps/sec vs local 8.1 eps/sec; relative throughput gain approximately 138.3%. |
| Reward quality | Single-trial outcomes differ by mode and environment; multi-seed study recommended before policy conclusions. |
Scope and motivation
The primary objective is to assess whether Flame Runner reduces rollout bottlenecks in policy-gradient training. In RL pipelines, episode collection is typically the dominant wall-clock cost and is naturally parallelizable because episodes are independent. The example adopts an actor-learner structure:
- Remote executors collect trajectories.
- A centralized learner computes gradients and updates policy weights.
- Updated weights are published for the next collection round.
Supported environments
| Environment | Type | Observation | Action | Typical Episode Time |
|---|---|---|---|---|
cartpole |
Discrete | 4 | 2 | ~1ms |
hopper |
Continuous (MuJoCo) | 11 | 3 | ~15ms |
halfcheetah |
Continuous (MuJoCo) | 17 | 6 | ~20ms |
walker2d |
Continuous (MuJoCo) | 17 | 6 | ~20ms |
ant |
Continuous (MuJoCo) | 105 | 8 | ~50ms |
System architecture
The training loop follows a distributed actor-learner flow:
- The learner keeps a local policy network.
- Policy weights are serialized and sent to remote workers.
- Workers run complete episodes and return trajectories.
- The learner computes discounted returns and policy gradients.
- Updated weights are broadcast for the next iteration.
Flame Runner is responsible for remote task scheduling and result collection; policy optimization remains in a single local learner loop.
Core implementation (flamepy.runner)
Excerpt — learner control path. The listing below is abbreviated from upstream examples/rl/basic/main.py; the learner-side loss and optimizer step are omitted for brevity.
from functools import partial
from flamepy.runner import Runner
collect_fn = partial(collect_episode, env_name=env_name)
with Runner(f"rl-{env_name}") as rr:
collector = rr.service(collect_fn)
for iteration in range(num_iterations):
weights_ref = rr.put_object(policy.state_dict())
futures = [collector(weights_ref) for _ in range(episodes_per_iteration)]
episodes = rr.get(futures)
# ... policy loss, backward(), optimizer.step()
Excerpt — remote worker. The worker callable matches the upstream structure: third-party and framework modules are imported inside the function body so execution environments resolve them at task run time (see upstream file for the full rollout loop).
def collect_episode(weights, env_name: str) -> dict:
import gymnasium as gym
import torch
from model import ENV_CONFIGS, create_policy
env_config = ENV_CONFIGS[env_name]
model = create_policy(env_config)
model.load_state_dict(weights)
model.eval()
env = gym.make(env_config.name)
# ... rollout until termination ...
return {
"states": states,
"actions": actions,
"rewards": rewards,
"total_reward": sum(rewards),
}
Parallelism. Each invocation of collector(...) may execute on a distinct remote executor, yielding concurrent episode collection subject to cluster capacity and scheduling policy.
flamepy.runner interface notes
Runner(name)
Runner is the lifecycle container for a distributed run:
with Runner("rl-ant") as rr:
...
- Initializes a named runner context.
- Manages runner/session connectivity during execution.
- Releases resources on context exit.
rr.service(callable)
Registers a callable as a runner service and returns an invokable handle:
collector = rr.service(collect_fn)
- Accepts a function or callable class implementing remote work.
- Returns a handle that is invoked like a local function.
- Each call schedules remote execution and returns a future-like object.
Remote invocation and futures
The training loop fans out episode jobs:
futures = [collector(weights_ref) for _ in range(episodes_per_iteration)]
- Calls are asynchronous.
- Pattern: fan-out task submission, then fan-in result collection.
rr.get(futures)
Waits for completion and returns resolved outputs:
episodes = rr.get(futures)
- Typical pattern: submit batch -> resolve batch -> run learner update.
- Preserves clear separation between collection and optimization.
rr.put_object(value)
Publishes a payload on the active Runner session and yields a reference suitable for remote task arguments. Upstream training uses the policy state_dict() as the published value (examples/rl/basic/main.py):
weights_ref = rr.put_object(policy.state_dict())
episodes = rr.get([collector(weights_ref) for _ in range(episodes_per_iteration)])
- Publication is scoped to the runner instance (
rr.put_object); earlier variants that used a module-levelput_objectwith explicit string keys are not required here. - Workers observe a resolved
state_dictwhere the runtime maps the reference to materialized data (per upstream docstring onweights).
Summary. The listed APIs suffice for an actor-learner layout with localized changes relative to a single-process trainer.
Reproduction procedure
Environment
- Start the Flame cluster (repository default compose workflow):
docker compose up -d
- Open a shell in the console container and set the working directory to the example:
docker compose exec -it flame-console /bin/bash
cd /opt/examples/rl
Commands exercised in this report
Distributed Ant-v5:
uv run main.py --env ant
Local Ant-v5 baseline:
uv run main.py --env ant --local
Additional command patterns (not all re-measured here):
# distributed cartpole (default env)
uv run main.py
# distributed mujoco envs
uv run main.py --env halfcheetah
uv run main.py --env hopper
uv run main.py --env walker2d
# custom training schedule
uv run main.py --env ant --iterations 50 --episodes-per-iter 50
# reward plot
uv run main.py --plot
CLI options
| Flag | Description | Default |
|---|---|---|
--env |
cartpole, halfcheetah, hopper, walker2d, ant |
cartpole |
--local |
Run locally without Flame cluster | off |
--iterations |
Number of training iterations | 100 |
--episodes-per-iter |
Episodes collected per iteration | 100 |
--plot |
Show reward plot after training | off |
Benchmark configuration (both environments reported below).
- Training iterations:
100 - Episodes per iteration:
100 - Total episodes:
10000
Test platform. All runs were executed on an Apple M4 MacBook Pro (8 CPU cores, 24 GB unified memory) under macOS. For distributed mode, the Flame cluster ran inside Podman Machine (5 vCPU, 8 GB RAM) using Podman Compose with 1 FSM, 3 FEM, and 1 FOC replicas—i.e. a single-host, VM-contained stack rather than a datacenter or Linux-only rig. Hardware, VM sizing, and service replica counts were not varied within this note.
Results — Ant-v5
Data source. Single pair of runs on that macOS setup; logs supplied for this note.
Measurements.
| Mode | Total Time | Episodes | Throughput | Final Mean Reward |
|---|---|---|---|---|
Distributed (--env ant) |
335.71s | 10000 | 29.8 eps/sec | -252.6 |
Local (--env ant --local) |
481.70s | 10000 | 20.8 eps/sec | -166.1 |
Throughput
Relative improvement of distributed over local throughput:
\[\frac{29.8 - 20.8}{20.8} \approx 43.3\%\]Reward
Under the stated single trial, the local configuration achieved a higher terminal mean reward than the distributed configuration (-166.1 vs -252.6), while distributed mode completed in less wall time. Policy-gradient estimators are sensitive to sample noise; generalization of reward ordering requires replicated experiments with controlled randomness.
Results — CartPole-v1
Data source. Single pair of runs on the same macOS setup; logs supplied for this note.
Measurements.
| Mode | Total Time | Episodes | Throughput | Final Mean Reward |
|---|---|---|---|---|
Distributed (uv run main.py) |
517.94s | 10000 | 19.3 eps/sec | 499.8 |
Local (uv run main.py --local) |
1230.12s | 10000 | 8.1 eps/sec | 193.4 |
Throughput
Relative improvement of distributed over local throughput:
\[\frac{19.3 - 8.1}{8.1} \approx 138.3\%\]Reward
Distributed mode reported a higher terminal mean reward than local mode (499.8 vs 193.4). The local trace exhibits a sharp regression at the final logged iteration, which is consistent with stochastic policy-gradient training rather than a definitive ranking of modes.
Interpretation — when distribution helps
Distributed execution incurs coordination and transfer overhead; net benefit scales with per-episode simulation cost and batching parameters:
| Workload | Expected Distributed Benefit |
|---|---|
| Very fast envs (for example CartPole) | Environment-dependent; can still benefit significantly in practice |
| Medium-cost envs (for example Hopper/HalfCheetah) | Moderate, depends on batch parallelism |
| Heavier envs (for example Ant) | Strong with enough episodes per iteration |
| Expensive real-world/sim workloads | Very strong; often essential |
Observations under this report’s configuration.
Throughput and reward ordering below were measured only on the M4 MacBook Pro + Podman Machine (5 vCPU / 8 GB) + 1 FSM / 3 FEM / 1 FOC setup summarized in the executive summary. Absolute rates (episodes per second) will change on other CPUs, memory pressure, or different Flame replica layouts; relative distributed-vs-local gains may also shift once coordination overhead dominates or the VM is resized.
Ant-v5: distributed throughput approximately 43.3% above local.CartPole-v1: distributed throughput approximately 138.3% above local.
Recommendations (operations)
Operational parameters suggested by the example and measured runs:
- Increase
--episodes-per-iterto amortize scheduling and transfer overhead. - Prefer heavier environments when validating speedups.
- Scale executor count with rollout demand.
- Use
--localfor lightweight environments and debugging loops.
Conclusion
Within the two single-run configurations documented here—and only on the stated M4 host and Podman-backed Flame cluster—Flame Runner–backed distributed rollout increased episode throughput relative to local rollout for both Ant-v5 and CartPole-v1. Terminal reward favored local training in the Ant trial and distributed training in the CartPole trial, underscoring that throughput and policy quality should be evaluated on separate criteria and, where reward matters, with multi-seed statistical replication.
Implementation status (upstream examples/rl/)
The following items are reflected in the current main branch layout:
- Shared policy and environment configuration in
model.pyto stabilize imports on remote workers. - Discounted return computation implemented with linear-time reverse indexing.
- Weight publication via
Runner.put_object(policy.state_dict())(session-scoped object handle).
References
- Flame: Add RL example — PR #424
- Flame repository: xflops/flame
- Reference code:
examples/rl/basic/main.py - Reference documentation:
examples/rl/basic/README.md