Research — Saatvix AI Security Lab

Research Tool Devudaaaa Research Lab · 2025

Claw v0.3.0: An OPA-Gated MCP Server with Formal Argumentation for AI Policy Resolution

When multiple AI governance policies conflict simultaneously, how should a system decide which wins? Claw answers this with mathematical precision — using Dung's Abstract Argumentation Frameworks to compute defensible, auditable policy resolutions with full reasoning chains.

Enterprise AI deployments face a governance challenge existing tools cannot solve: policy signals contradict each other in real time. A content safety policy says block. A compliance policy says pass. A context policy flags for review. The system must decide — and the decision must be auditable, defensible, and explainable to a regulator or incident responder.

The Argumentation Engine

Claw's core contribution applies Dung's Abstract Argumentation Frameworks (AAF) to the AI governance layer. Each policy signal becomes an argument. Attack relations encode logical relationships between conflicting policies. The engine computes extensions — grounded, preferred, and stable — representing sets of arguments that survive mutual scrutiny. The grounded extension provides the skeptical, most conservative resolution; preferred extensions identify all defensible positions.

This is not a rule engine. It is a formal reasoning system that explains why a particular resolution was reached, traces which arguments defeated which, and produces a complete reasoning chain suitable for audit logs and incident response documentation.

OPA Integration & Four-Stage Pipeline

Claw operates as an MCP (Model Context Protocol) server, placing a governance gateway between AI agents and their execution environment. Incoming requests pass through four stages: PII scanning, OPA policy evaluation (Rego-based rules against the policy corpus), argumentation resolution (AAF extension computation for conflicting signals), and context masking before the model layer.

Cross-Browser Extension

Claw ships with a Manifest V3 browser extension tested on Chrome, Edge, Brave, and Firefox — rendering OPA decision states, PII scan results, and argumentation outcomes inline during browser-based AI interactions.

Technical Impact

28/28

Python test suite

20/20

OPA Rego tests

3

Extension semantics: grounded, preferred, stable

References

[1] Dung, P.M. (1995). "On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games." Artificial Intelligence, 77(2), 321–357. Note: Direct academic lineage — the lead researcher studied under Prof. Phan Minh Dung at AIT.

[2] Open Policy Agent Project (2023). "OPA Documentation: Policy as Code." https://www.openpolicyagent.org/docs/

[3] Anthropic (2024). "Model Context Protocol Specification." https://modelcontextprotocol.io/

[4] Modgil, S. & Prakken, H. (2014). "The ASPIC+ framework for structured argumentation." Argument & Computation, 5(1), 31–62.

Access Claw v0.3.0 — Available for research and enterprise evaluation. No public repository. Request via email and we respond within 24 hours.

Request → contact@saatvix.com

Research Tool Devudaaaa Research Lab · 2025

Agora v0.3.0: A Session-Level Behavioral Vocabulary for AI Abuse Detection

Most AI security tools inspect individual requests in isolation. Agora examines the session — the sequence of interactions over time — and identifies behavioral patterns that signal systematic abuse, social engineering, or policy circumvention attempts.

Individual prompt-level filtering is necessary but not sufficient. A sophisticated adversary distributes their attack across many individually benign interactions. Only when examined as a sequence does the pattern reveal itself: escalating authority claims after refusal, language switching to evade filters, progressive hypothetical framing to normalize dangerous requests. Agora names these patterns and operationalizes them as detection rules.

The Signal Taxonomy

Refusal-Rephrase Cycles (A-01): Repeated reformulation of semantically equivalent requests following model refusal. The signal is persistence and structural variation, not the content of any single prompt.
Language Switch on Refusal (A-02): Switching interaction language immediately following a refusal event — exploiting potential gaps in multilingual policy coverage.
Role-Claim Escalation (A-03): Sequential increases in claimed authority following refusals. "I'm a researcher" → "I'm a licensed professional" → "I'm authorized by the system administrator."
Hypothetical Frame Injection (A-04): Progressive normalization of harmful requests through fictional, educational, or hypothetical framing. Each step appears benign; the sequence reveals the trajectory.

Wazuh Integration & SBE Schema

Agora's signals are operationalized as Wazuh detection rules, enabling real-time behavioral alerting within standard SIEM infrastructure. Session Behavioral Events (SBE v1) are emitted in a defined JSON schema, ingested by Wazuh decoders, and matched against pattern rules. When multiple Agora signals fire simultaneously, alert priorities are composed by Claw's argumentation engine — the two tools are architecturally complementary.

Technical Impact

4

Primary behavioral signal classes

28/28

Python test suite

SBE v1

Open session event schema

References

[1] OWASP Foundation (2025). "OWASP Top 10 for Large Language Model Applications v2.0." — Agora signals map directly to LLM01 (prompt injection) and LLM08 (excessive agency).

[2] Perez, E. & Ribeiro, I. (2022). "Ignore Previous Prompt: Attack Techniques For Language Models." arXiv:2211.09527.

[3] Greshake, K. et al. (2023). "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injections." arXiv:2302.12173.

[4] Elastic NV / Wazuh (2024). "Custom Decoders and Rules." https://documentation.wazuh.com/

Access Agora v0.3.0 — Request via email. We'll share the package and discuss SIEM integration.

Request → contact@saatvix.com

Position Paper Devudaaaa Research Lab · 2026 · Forthcoming

Constitutional AI Meets Cyber: From Policy Philosophy to Formal Compliance Tooling — and a Potential Integration with Activation Steering

Constitutional AI provides principles for what AI systems should and shouldn't do. Formal argumentation provides machinery to adjudicate conflicts between those principles at runtime. Activation steering provides a mechanism to enforce behavioral constraints at the model weight level. Can these three approaches be unified into a coherent, auditable behavioral governance stack?

The alignment research community has produced two significant technical contributions that, taken together, create an opportunity for a genuinely new class of security tool. Anthropic's Constitutional AI framework establishes a methodology for encoding behavioral constraints through self-critique and revision. Contrastive Activation Addition (CAA) and related techniques provide a means of directly modifying model behavior by adding vectors in the residual stream that correspond to desired or undesired behavioral directions.

The Policy Resolution Gap

Both approaches solve different problems, and neither alone addresses the governance challenge Claw was built to solve: what happens when behavioral constraints conflict? CAA can steer a model away from harmful behavior — but which constraint takes priority when multiple apply simultaneously? Without a principled resolution mechanism, the outcome is unpredictable. Claw's argumentation engine addresses this gap directly: constraints become arguments, conflict relations are encoded as attack relations, and the computed extension specifies which constraints are jointly defensible — not as a priority number but as a formal proof.

Potential Collaboration with Elthea

Elthea AI's work on activation-steering infrastructure creates a compelling integration possibility. If Claw's argumentation engine determines which behavioral constraints are jointly defensible, and Elthea's platform enforces those constraints at the activation level, the combined system offers something novel: governance that is both formally reasoned at the policy layer and mechanistically enforced at the model layer. We are exploring this potential collaboration and welcome contact from researchers working at this intersection.

References

[1] Bai, Y. et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073. Anthropic.

[2] Panickssery, A. et al. (2023). "Steering Llama 2 via Contrastive Activation Addition." arXiv:2312.06681.

[3] Dung, P.M. (1995). "On the acceptability of arguments." Artificial Intelligence, 77(2), 321–357.

[4] Hendrycks, D. et al. (2023). "Aligning AI With Shared Human Values." arXiv:2008.02275.

[5] Seshia, S.A. et al. (2018). "Formal Specification for Deep Neural Networks." ATVA 2018.

Research Collaboration — Exploring integration with activation-steering research. If you work in this space, we'd like to hear from you.

Reach Out → contact@saatvix.com

Pioneering AI-Age Security

Claw v0.3.0: An OPA-Gated MCP Server with Formal Argumentation for AI Policy Resolution

The Argumentation Engine

OPA Integration & Four-Stage Pipeline

Cross-Browser Extension

Technical Impact

References

Agora v0.3.0: A Session-Level Behavioral Vocabulary for AI Abuse Detection

The Signal Taxonomy

Wazuh Integration & SBE Schema

Technical Impact

References

Constitutional AI Meets Cyber: From Policy Philosophy to Formal Compliance Tooling — and a Potential Integration with Activation Steering

The Policy Resolution Gap

Potential Collaboration with Elthea

References