Training the DRL Agent
Our framework enables training agents using Deep Reinforcement Learning (DRL) algorithms across three types of observation and action spaces:
Global: The traditional setting used in CyberBattleSim and most of the existing literature.
Local: Introduced by the study Leveraging Deep Reinforcement Learning for Cyber-Attack Paths Prediction (Terranova et al.). This view focuses only on a pair of nodes as observation and the selection of a local vulnerability or switching action at each timestep.
Continuous: Proposed in this project and detailed in our accompanying paper, enabling more expressive and flexible agent actions.
The framework integrates all major algorithms from the Stable Baselines3 library and is built for flexibility and advanced experimentation.
Training Command
To launch training, use the following command:
cd cyberbattle/agent
python3 train_agent.py \
--name LOGS_FOLDER_NAME \
--num_runs NUM_RUNS \
--environment_type {global, local, continuous} \
--algorithm {TRPO, PPO, A2C, DQN, DDPG, SAC, TD3} \
[--load_seeds SEEDS_FILE] \
[--random_seeds] \
[--load_envs ENVS_FOLDER] \
[--holdout] \
[--early_stopping PATIENCE] \
--goal {control, discovery, disruption} \
--nlp_extractor NLP_EXTRACTOR \
[--finetune_model MODEL_PATH]
--name: Name of the logs folder.--num_runs: Number of training runs (for robustness across different seeds).--environment_type: Type of observation/action space (global, local, or continuous).--algorithm: DRL algorithm from Stable Baselines3.--load_seeds/--random_seeds: Specify a file containing seeds or generate randomly the seeds to use.--load_envs: Folder with scenario files (if not set, default ones from rootconfig.yamlfile will be used).--holdout: Use training/validation split.--early_stopping: Patience before stopping on no validation improvement.--goal: Agent threat model betweencontrol,discovery, ordisruption(set can be extended).--nlp_extractor: NLP model used for encoding vulnerabilities (defines part of the action space) and the content of the feature vectors forming the observation vector.--finetune_model: Path to a pre-trained model to fine-tune, without starting from a randomly initialized model.
More detailed settings (e.g., agent architecture, episode config, reward shaping, normalization) can be configured via the following configutation files:
cyberbattle/agent/config/train_config.yamlcyberbattle/agent/config/algo_config.yamlcyberbattle/agent/config/rewards_config.yaml
Training Output
After training, the log folder contains models, checkpoints, environments, and TensorBoard logs:
LOGS_FOLDER
├── app.log
├── checkpoints/
│ └── 1/
│ ├── checkpoint_10000_steps.zip
│ └── ...
├── envs/
│ ├── 1.pkl
│ └── ...
├── seeds.yaml
├── split.yaml
├── train_config.yaml
├── TRPO_1/
│ └── events.out.tfevents...
└── validation/
└── 1/
├── checkpoint_123.5_reward.zip
└── ...
checkpoints/: Saved models during training.
validation/: Best validation checkpoints.
envs/: Pickled environment scenarios.
YAML files: Used configs and seeds for reproducibility.
Monitoring with TensorBoard
To visualize training and validation logs:
tensorboard --logdir LOGS_FOLDER_NAME
Then open the browser at the provided URL. Tracked metrics include:
Exploited vulnerabilities (local/remote)
Action selection stats
Nodes controlled/discovered/disrupted
Episode reward and length
Action success rates
…
Batch Sampling
To evaluate performance across all combinations of goals and NLP extractors you can use the following command:
python3 sample_agent.py ...
The option and configuration files are the same of train_agent.py script, without the --goal and --nlp_extractor parameters.
This produces separate log folders for each combination:
LOGS_FOLDER/
├── ALGO_1_control_bert/
├── ALGO_2_discovery_roberta/
└── ...
The configuration and command-line options are identical to the main training script.
Hyperparameter Optimization
Hyperparameters can be optimized using Optuna with the following command:
python3 cyberbattle/agent/hyperopt_agent.py ... \
--num_trials NUM_TRIALS \
--optimization_type {grid, random, tpe, cmaes, ...}
New parameters:
--num_trials: Number of optimization trials.--optimization_type: Search strategy (e.g.,tpe,random).
Hyperparameter search spaces are defined in cyberbattle/agent/config/hyperparams_ranges.yaml.
The default optimization goal is to maximize average validation reward, sampling across goals and NLP extractors.
Visualization with Optuna Dashboard
To visualize the hyperparameter optimization process you can use the following command:
optuna dashboard --storage sqlite:///cyberbattle/agent/logs/LOGS_FOLDER/ALGO_hyperopt.db
Access the dashboard via the provided URL to track trial performance, parameter importances, and convergence.
Bibliography
Leveraging Deep Reinforcement Learning for Cyber-Attack Paths Prediction. Franco Terranova, Abdelkader Lahmadi, and Isabelle Chrisment. 2024. Leveraging Deep Reinforcement Learning for Cyber-Attack Paths Prediction: Formulation, Generalization, and Evaluation. In Proceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses (RAID ‘24). Association for Computing Machinery.