Threat Models and Reward Function

The agent’s reward function—and consequently the actions and outcomes it learns to prioritize—depends on the threat model or goal under which it is trained. Each threat model defines specific objectives and termination criteria for the episode. The table below summarizes the different goals supported by the framework:

**Threat Models / Agent Goals**
Goal	Description	Prioritized Outcomes	Episode End Condition
Control	Gain control over as many nodes as possible in the network.	`CredentialAccess`, `LateralMove`, `PrivilegeEscalation`	Terminates when all controllable nodes reachable from the starter node are owned with ROOT privilege.
Disruption	Disable the largest possible portion of the network.	`DenialOfService`	Terminates when all stoppable nodes reachable from the starter node are disrupted.
Discovery	Achieve full visibility over the topology, collect all data, and exfiltrate it.	`Reconnaissance`, `Discovery`, `Collection`, `Exfiltration`	Terminates when all nodes reachable from the starter node are discovered, made visible, and their data collected and exfiltrated.

Reward Function Formulation

The agent’s learning is guided by a reward function that combines a weighted sum of bonuses and penalties:

\[R(o, a) = \sum_j w^B_j \cdot B_j(o, a) - \sum_k w^P_k \cdot P_k(o, a)\]

Where:

\(o\): Agent’s current observation
\(a\): Selected action
\(B_j(o, a)\): Bonus terms rewarding desirable behavior (e.g., owning nodes, collecting data)
\(P_k(o, a)\): Penalty terms for undesirable outcomes (e.g., hitting firewalls, unnecessary actions)
\(w^B_j, w^P_k\): Weights determining the importance of each term

This formulation allows the agent to learn trade-offs appropriate for its goal.

Bonus and Penalty Terms

Bonus terms include both fixed rewards and scaled coefficients tied to metrics such as the number of owned, disrupted, or discovered nodes. Penalty terms account for various costs and limitations, such as:

Exploitation cost based on CVSS Exploitability Score
Distance in embedding space (to improve latent space structure)
Actions blocked by firewalls
Attempting invalid actions (e.g., scanning already scanned nodes)

The reward structure thus supports goal-aligned behavior through nuanced and flexible design. The following table provides an overview of the types of coefficients, constant rewards, and penalty terms currently implemented:

**Reward Function Components**
Weighted Coefficients	Constant Rewards	Penalty Terms
Node Ownership Value Coefficient	Data Collected Reward	Exploitation Cost
Denial of Service Coefficient	Privilege Escalation Reward	No Enough Privileges
Node Discovery Coefficient	Visibility Acquired Reward	Failed Success Rate
	Data Exfiltration Reward	No Data to Collect
	Defense Evasion Reward	No Data to Discover
	Winning Episode Reward	No Data to Exfiltrate
		Already Persistent
		Machine Already Stopped
		Node Already Owned
		Node Already Visible
		Already Defense Evasion
		Scanning Unopen Port
		Privilege Escalation in Node Not Owned
		Privilege Escalation to Level Already Had
		Blocked by Local Firewall
		Blocked by Remote Firewall
		Distance Penalty
		Losing Episode Penalty