Enabling Symbiosis in Multi-Robot Systems via MARL (ICPS 2025)

Post's content provided by: Xuezhi

Authors: Xuezhi Niu, Natalia Calvo Barajas and Didem Gürdür Broo
Multi-robot systems often waste energy and time because agents act in isolation. We frame coordination as symbiosis: agents share essential state (battery) during training so policies learn to cooperate under CTDE. We implement a VDN-style MARL with a centralized critic and decentralized execution; the critic sees each robot’s battery, encouraging cooperative charging and task allocation. In a warehouse case (2 AMRs, continuous jobs), this improves delivery time and balances workload/energy. In simulation, symbiotic MARL cuts task completion time by 10.7% and improves workload/energy balance by 13.81% on average vs. non-symbiotic/heuristic baselines.

Introduction

Warehouse robot teams must coordinate charging and delivery under partial observability and shared resources. We study a core question: can minimal state sharing (battery levels only) during CTDE improve cooperation in heterogeneous AMRs? We implement a VDN-based MARL where a centralized critic observes team batteries and trains decentralized policies to schedule charging and allocate tasks. We take inspiration from mycorrhizal networks: sharing a minimal vital signal (battery) enables cooperative resource use. During training, agents expose battery state to the critic; at execution, each runs locally. This reduces interference (e.g., charger contention) and stabilizes learning.
Formally, each agent learns an individual \(Q^{*i}(s^i,a^i)\) and an online \(Q^i(\tilde{s}^i,a^i)\) where \(\tilde{s}^i= [s_t^i, b_t^1, \ldots, b_t^{i-1}, b_t^{i+1}, b_t^n]\); the team value is \(Q^{\text{tot}}(\tilde{s}_t, a_t) \gets \sum_{i=1}^N Q^i(\tilde{s}_t^i, a_t^i)\). Rewards combine a local term (deliveries, low-battery penalty) and a global term (task/energy balance, collision at chargers).

Symbiosis Mechanism

Symbiosis is integrated into the framework by augmenting state representations with battery levels for all agents, enabling the critic to optimize energy efficiency and cooperative task allocation. For each agent \(i\), the online Q-network is updated by: $$ Q^i_{t}(\tilde{s}^i_t,a^i_t)\leftarrow Q^i_t(\tilde{s}^i_t,a^i_t)+\alpha (r_t+\gamma \max\limits_{a^i_{t}} Q^{*i}_{t}(s^i_{t},a_{t}^i) - Q^i_t(\tilde{s}^i_t,a^i_t)) $$ where \(Q^i_t(\tilde{s}^i_t,a^i_t)\) is the online Q-value at time \(t\) for symbiotic states \(\tilde{s}^i_t\) and action \(a^i_t\), \(\alpha\) is the learning rate controlling the magnitude of updates, and \(Q^{*i}\) represents the individual target Q-value function with local states \(s^i_t\). The target network is updated periodically as \(Q^{*i}_{t}(s^i_t,a^i_t)\leftarrow Q^{i}t(s_t,a_t) + \tau (Q^i{t}(\tilde{s}_t,a_t) - Q^*_t(s_t,a_t))\).

state diagram
Algorithm
Agents share battery information via symbiosis connections (blue dashed lines) while maintaining individual Q-networks for local decision-making. The framework integrates environment sampling (orange arrows), symbiotic information sharing, and learning.
pseudo code algorithm

The centralized critic \(Q^{\text{tot}} \left( \tilde{s}_t , a_t \right)\) is defined as the sum of online Q-values, \(Q^{\text{tot}} ( \tilde{s}t, a_t ) = \sum{i=1}^N Q^i \left( \tilde{s}_t^i, a_t^i \right)\), where \(\tilde{s}_t^i\) includes agent \(i\)’s current state \(s_t^i\) and the battery levels of all other agents. This joint value guides the loss computation, \(L = \mathbb{E} [\left( y^\text{tot}_t - Q^{\text{tot}}_t \right)^2]\), with the target \(y^\text{tot}_t = r_t + \gamma \max\limits_{ \mathbf{a}_{t+1} } Q^{\text{tot}} \left( s_{t+1}, \mathbf{a}_{t+1} \right)\). The loss is backpropagated during training, enabling the critic to learn a joint policy that incorporates symbiotic information for coordinated behavior. The centralized critic evaluates group-level symbiotic metrics, while agents act based on local observations, supporting cooperation and scalability.

Warehouse Case and Actions

Environment: 60 m × 60 m, one loading station, three unloading stations, two chargers, obstacles/restricted areas. Two AMRs repeatedly load-deliver-unload; chargers are also start locations. Actions: continue, recharge at station 1, recharge at station 2. Chargers are randomized per episode for robustness.

block diagram of the system architecture
Overview of the multi-robot warehouse system architecture. The diagram shows control loops, including the plant, local controllers, and global controller. The plant models robot dynamics, energy consumption, and behaviors. Local controllers manage path planning and execution, while the global controller handles collision avoidance, task allocation, and integrates the MARL planner for optimized task execution and energy management
Simulated warehouse layout
Simulated warehouse layout (60m x 60m) highlighting key areas: one loading station, three unloading stations (Unloading 1–3), and two charging stations (Charging 1–2). Blue and green lines show example robot trajectories; black regions indicate obstacles and restricted areas.

Robots: Simulation matches reported endurance (~180 min). Observations: per-agent [delivered%,\(b_t\),\(b_{t-1}\)](normalized). Rewards: local delivery bonus + sharp penalty for \(b_t < 10\%\); global term penalizes charger conflicts and balances workload. Learning: VDN with double-DQN style targets; soft target updates; CTDE with battery sharing only. Notes (for practitioners).

Results at a Glance

Training curves show earlier reward gains and steadier package throughput under Sym-MARL; non-symbiotic baselines plateau and show imbalance (“lazy” agent). Evaluation against Static Recharging (threshold-based policy) confirms better completion time and fairer task/energy split across milestones (200–500 packages).

  • Faster completion: −10.7% time.
  • Fairer distribution: −13.81% variance in tasks/energy.
  • Fewer simultaneous low-battery events and less charger contention.

Citation

If you find the idea useful, please consider citing our work:

                                    
@inproceedings{niu2025enabling, title = {Enabling Symbiosis in Multi-Robot Systems Through Multi-Agent Reinforcement Learning}, author = {Niu, Xuezhi and Barajas, Natalia Calvo and Broo, Didem Gürdür}, booktitle = {2025 IEEE 8th International Conference on Industrial Cyber-Physical Systems (ICPS)}, year = {2025}, doi = {10.1109/ICPS65515.2025.11087893}, pages = {1--7} }

Event Gallery

Tech Tour
Technical tour at Volkswagen’s Emden plant (est. 1964): longtime home of the Passat, now upgraded for EVs like the ID.7; on the North Sea with its own export port. Xuezhi is third from the right.
Tech Tour
Technical tour at Meyer Werft, Papenburg (founded 1795). A leading cruise-ship yard with covered docks for year-round builds. Xuezhi is fifth from the right.
Gala dinner
Gala Dinner!
Gala dinner
Gala dinner with colleagues at the conference; Xuezhi is second from left.

Slides

ICPS.pdf