Enabling Symbiosis in Multi-Robot Systems via MARL (ICPS 2025)
Post's content provided by: Xuezhi
Authors: Xuezhi Niu, Natalia Calvo Barajas and Didem Gürdür Broo
Multi-robot systems often waste energy and time because agents act in isolation. We frame coordination as symbiosis:
agents share essential state (battery) during training so policies learn to cooperate under CTDE. We implement a VDN-style
MARL with a centralized critic and decentralized execution; the critic sees each robot’s battery, encouraging cooperative
charging and task allocation. In a warehouse case (2 AMRs, continuous jobs), this improves delivery time and balances
workload/energy. In simulation, symbiotic MARL cuts task completion time by 10.7% and improves
workload/energy balance by 13.81% on average vs. non-symbiotic/heuristic baselines.
Introduction
Warehouse robot teams must coordinate charging and delivery under partial observability and shared resources. We study
a core question: can minimal state sharing (battery levels only) during CTDE improve cooperation in heterogeneous
AMRs? We implement a VDN-based MARL where a centralized critic observes team batteries and trains decentralized
policies to schedule charging and allocate tasks. We take inspiration from mycorrhizal networks: sharing a
minimal vital signal (battery) enables cooperative resource use. During training, agents expose battery state to the critic;
at execution, each runs locally. This reduces interference (e.g., charger contention) and stabilizes learning.
Formally, each agent learns an individual \(Q^{*i}(s^i,a^i)\) and an online \(Q^i(\tilde{s}^i,a^i)\) where
\(\tilde{s}^i= [s_t^i, b_t^1, \ldots, b_t^{i-1}, b_t^{i+1}, b_t^n]\); the team value is \(Q^{\text{tot}}(\tilde{s}_t, a_t)
\gets \sum_{i=1}^N Q^i(\tilde{s}_t^i, a_t^i)\). Rewards combine a local term (deliveries, low-battery penalty)
and a global term (task/energy balance, collision at chargers).
Symbiosis Mechanism
Symbiosis is integrated into the framework by augmenting state representations with battery levels for all agents, enabling the critic to optimize energy efficiency and cooperative task allocation. For each agent \(i\), the online Q-network is updated by: $$ Q^i_{t}(\tilde{s}^i_t,a^i_t)\leftarrow Q^i_t(\tilde{s}^i_t,a^i_t)+\alpha (r_t+\gamma \max\limits_{a^i_{t}} Q^{*i}_{t}(s^i_{t},a_{t}^i) - Q^i_t(\tilde{s}^i_t,a^i_t)) $$ where \(Q^i_t(\tilde{s}^i_t,a^i_t)\) is the online Q-value at time \(t\) for symbiotic states \(\tilde{s}^i_t\) and action \(a^i_t\), \(\alpha\) is the learning rate controlling the magnitude of updates, and \(Q^{*i}\) represents the individual target Q-value function with local states \(s^i_t\). The target network is updated periodically as \(Q^{*i}_{t}(s^i_t,a^i_t)\leftarrow Q^{i}t(s_t,a_t) + \tau (Q^i{t}(\tilde{s}_t,a_t) - Q^*_t(s_t,a_t))\).


The centralized critic \(Q^{\text{tot}} \left( \tilde{s}_t , a_t \right)\) is defined as the sum of online Q-values, \(Q^{\text{tot}} ( \tilde{s}t, a_t ) = \sum{i=1}^N Q^i \left( \tilde{s}_t^i, a_t^i \right)\), where \(\tilde{s}_t^i\) includes agent \(i\)’s current state \(s_t^i\) and the battery levels of all other agents. This joint value guides the loss computation, \(L = \mathbb{E} [\left( y^\text{tot}_t - Q^{\text{tot}}_t \right)^2]\), with the target \(y^\text{tot}_t = r_t + \gamma \max\limits_{ \mathbf{a}_{t+1} } Q^{\text{tot}} \left( s_{t+1}, \mathbf{a}_{t+1} \right)\). The loss is backpropagated during training, enabling the critic to learn a joint policy that incorporates symbiotic information for coordinated behavior. The centralized critic evaluates group-level symbiotic metrics, while agents act based on local observations, supporting cooperation and scalability.
Warehouse Case and Actions
Environment: 60 m × 60 m, one loading station, three unloading stations, two chargers, obstacles/restricted areas. Two AMRs repeatedly load-deliver-unload; chargers are also start locations. Actions: continue, recharge at station 1, recharge at station 2. Chargers are randomized per episode for robustness.


Robots: Simulation matches reported endurance (~180 min). Observations: per-agent [delivered%,\(b_t\),\(b_{t-1}\)](normalized). Rewards: local delivery bonus + sharp penalty for \(b_t < 10\%\); global term penalizes charger conflicts and balances workload. Learning: VDN with double-DQN style targets; soft target updates; CTDE with battery sharing only. Notes (for practitioners).
Results at a Glance
Training curves show earlier reward gains and steadier package throughput under Sym-MARL; non-symbiotic baselines plateau and show imbalance (“lazy” agent). Evaluation against Static Recharging (threshold-based policy) confirms better completion time and fairer task/energy split across milestones (200–500 packages).
- Faster completion: −10.7% time.
- Fairer distribution: −13.81% variance in tasks/energy.
- Fewer simultaneous low-battery events and less charger contention.
Citation
If you find the idea useful, please consider citing our work:
@inproceedings{niu2025enabling, title = {Enabling Symbiosis in Multi-Robot Systems Through Multi-Agent Reinforcement Learning}, author = {Niu, Xuezhi and Barajas, Natalia Calvo and Broo, Didem Gürdür}, booktitle = {2025 IEEE 8th International Conference on Industrial Cyber-Physical Systems (ICPS)}, year = {2025}, doi = {10.1109/ICPS65515.2025.11087893}, pages = {1--7} }
Event Gallery



