Threat Models and Reward Function

The agent’s reward function—and consequently the actions and outcomes it learns to prioritize—depends on the threat model or goal under which it is trained. Each threat model defines specific objectives and termination criteria for the episode. The table below summarizes the different goals supported by the framework:

Threat Models / Agent Goals

Goal

Description

Prioritized Outcomes

Episode End Condition

Control

Gain control over as many nodes as possible in the network.

CredentialAccess, LateralMove, PrivilegeEscalation

Terminates when all controllable nodes reachable from the starter node are owned with ROOT privilege.

Disruption

Disable the largest possible portion of the network.

DenialOfService

Terminates when all stoppable nodes reachable from the starter node are disrupted.

Discovery

Achieve full visibility over the topology, collect all data, and exfiltrate it.

Reconnaissance, Discovery, Collection, Exfiltration

Terminates when all nodes reachable from the starter node are discovered, made visible, and their data collected and exfiltrated.

Reward Function Formulation

The agent’s learning is guided by a reward function that combines a weighted sum of bonuses and penalties:

\[R(o, a) = \sum_j w^B_j \cdot B_j(o, a) - \sum_k w^P_k \cdot P_k(o, a)\]

Where:

  • \(o\): Agent’s current observation

  • \(a\): Selected action

  • \(B_j(o, a)\): Bonus terms rewarding desirable behavior (e.g., owning nodes, collecting data)

  • \(P_k(o, a)\): Penalty terms for undesirable outcomes (e.g., hitting firewalls, unnecessary actions)

  • \(w^B_j, w^P_k\): Weights determining the importance of each term

This formulation allows the agent to learn trade-offs appropriate for its goal.

Bonus and Penalty Terms

Bonus terms include both fixed rewards and scaled coefficients tied to metrics such as the number of owned, disrupted, or discovered nodes. Penalty terms account for various costs and limitations, such as:

  • Exploitation cost based on CVSS Exploitability Score

  • Distance in embedding space (to improve latent space structure)

  • Actions blocked by firewalls

  • Attempting invalid actions (e.g., scanning already scanned nodes)

The reward structure thus supports goal-aligned behavior through nuanced and flexible design. The following table provides an overview of the types of coefficients, constant rewards, and penalty terms currently implemented:

Reward Function Components

Weighted Coefficients

Constant Rewards

Penalty Terms

Node Ownership Value Coefficient

Data Collected Reward

Exploitation Cost

Denial of Service Coefficient

Privilege Escalation Reward

No Enough Privileges

Node Discovery Coefficient

Visibility Acquired Reward

Failed Success Rate

Data Exfiltration Reward

No Data to Collect

Defense Evasion Reward

No Data to Discover

Winning Episode Reward

No Data to Exfiltrate

Already Persistent

Machine Already Stopped

Node Already Owned

Node Already Visible

Already Defense Evasion

Scanning Unopen Port

Privilege Escalation in Node Not Owned

Privilege Escalation to Level Already Had

Blocked by Local Firewall

Blocked by Remote Firewall

Distance Penalty

Losing Episode Penalty