Evaluation Metrics

Evaluation Metrics#

The ECML 2026 challenge is the newest competition around the Flatland environment.

In this edition, we are encouraging participants to develop innovative solutions that leverage reinforcement learning. The scenario setup and the evaluation metrics are designed accordingly. However, we are still open for other solutions as well, e.g. operations research, and encourage participants to benchmark their state-of-the art algorithms

⚖ Evaluation metrics#

Normalized Episode Rewards#

The primary metrics uses the normalized return from your agents - the higher the better.

What is the normalized return?

The returns are the sum of Flatland’s default rewards your agents accumulate during each episode as described in rewards.md
To normalize these return, we scale them so that they stays in the range \([0.0, 1.0]\). To guarantee this, the maximum penalty per agent can be at most max_episode_steps. This normalized rewards allows to compare results between environments of different dimensions and different number of agents.

In code:

normalized_reward = sum([max(cumulative_rewards[agent.handle], - self.env.max_episode_steps) for agent in agents]) / (
        self.env.max_episode_steps * self.env.get_num_agents()) + 1

The episodes finish when all the trains have reached their target, or when the maximum number of time steps is reached. Therefore:

The minimum possible value (i.e. worst possible) is 0.0, which occurs if none of the agents reach their goal during the episode.
The maximum possible value (i.e. best possible) is 1.0, which would occur if all the agents would reach their targets and intermediate stops on time, i.e. not receive any penalty.

Submission Score#

The submission score is the sum of the normalized scenario rewards.

Evaluation is stopped when a submission does not reach the threshold of 25% completed agents within a level (5 scenarios).

Factors in reward function#

The factors for the reward function in this competition are:

factor	value
journey not started (cancellation factor)	5
cancellation time buffer	0
delay at target	1
target not reached minimum penalty	100
intermediate stop not served	50
intermediate late arrival	0.5
intermediate early departure	0.5
collision	250

This configuration is implemented using --rewards flatland.envs.rewards.ECML2026Rewards.

⛽ Time and Resource limits#

The agents have to act within time limits:

You are allowed up to 30 minutes per scenario.
The full evaluation must finish in 5 hours.

The agents are evaluated in a container with resource limits

4 CPU cores
15 GB of main memory.

We do not provide GPUs.

Detailed overview over resource limits#

Limit[1]	Value	Submission outcome	Details
`dailyLimit`	`2`	Not created	Error in frontend as error `429 TOO_MANY_REQUESTS` from backend.
`WAIT_FOR_POD_TO_RUN_LIMIT`	`1200` (20 min)	Failure	submission pod should be listed by now, i.e. pulling has started by now.
`WAIT_FOR_POD_TO_START_LIMIT`	`1200` (20 min)	Failure	submission pod should have reached running state by now, i.e. pulling should be done by now
`RUNNING_TIME_LIMIT`	`1800` (30 min)	Success with termination cause	per scenario; evaluation terminated; results do notexcl. the overlong scenario
`TOTAL_RUNNING_TIME_LIMIT`	`18000` (5h)	Success with termination cause	all scenarios, excluding technical overhead for starting pods and running offline trajectory evaluation; results do not include the overlong scenario
`ACTIVE_DEADLINE_SECONDS`	`3600` (1h)	Failure/cleanup	everything including technical overhead for starting pods for submission
`PERCENTAGE_COMPLETE_THRESHOLD`	`0.25` (25%)	Success with termination cause	`Mean percentage of done agents during the last test was too low`; results do include the test, but stop after the test.
`ORCHESTRATION_JOB_K8S_RESOURCE_ALLOCATION`	`{"requests": {"memory": "5Gi", "cpu": "1"}, "limits": {"memory": "5Gi", "cpu": "1"}}`	Failure	resource limits for pod running the submission
`K8S_RESOURCE_ALLOCATION`	`{"requests": {"memory": "15Gi", "cpu": "4"}, "limits": {"memory": "15Gi", "cpu": "4"}}`	Failure	resource limits for pod running the submission
`ORCHESTRATION_JOB_ACTIVE_DEADLINE_SECONDS`	`28800` (8h)	Failure/cleanup	everything including technical overhead for starting pods for orchestration and evaluation

📪 Daily Submission Limits and Submission Closure.#

You can submit up to 2 times per day.

Evaluation Metrics

Contents

Evaluation Metrics#