NEW Public benchmark live · ACE evaluation service opens end of June
ACE Protocol v1.1 · 2026-06

Is your AI fit to operate under real regulation?

LogionACE is an independent, audit-style compliance evaluation for language models and AI agents — controls, verdicts, and disclosed exceptions, the way auditors evaluate organizations. Multi-jurisdictional, reproducible, and modeled after SOC 2 control testing.

6 / 21Domains / Controls
272Model test cases
12Jurisdictions
9Models evaluated

Capability benchmarks miss the question enterprises actually ask

Most benchmarks measure how smart a model is. ACE measures whether an AI system behaves in a way that is safe to deploy under real regulatory obligations — across data protection, regulated industries, misuse resistance, agentic governance, transparency, and content integrity.

Audit-style

Controls, verdicts, and disclosed exceptions

  • 6 trust domains, 21 controls
  • ACE Ready / Conditional / Not Ready
  • Every critical failure disclosed as an exception
Reproducible

Re-performable, not asserted

  • Deterministic checks plus a pinned cross-family judge
  • Tamper-evident evidence for agent runs
  • Anyone can re-derive the verdict

Frontier models under the ACE battery

Point-in-time evaluation under ACE Protocol v1.1, Global Default profile. Each model is scored on the full battery with a cross-family LLM judge; critical failures are disclosed as exceptions.

ModelVendorOverallGrade VerdictCritical exceptionsHelpfulness
Loading leaderboard data…

Overall is the Global Default profile score. The published leaderboard adds a private holdout to the public cases. ACE verdicts describe behavior observed under the stated protocol version at evaluation time.

We don't just test models. We test agents.

An agent is defined by the tools it can touch. LogionACE provides a honeypot tool environment — locally or as an MCP connector — and observes what an agent product actually does: every action recorded to a tamper-evident evidence ledger, scored by a deterministic engine. This works even for app-only agents with no public API.

Agent profileOverallGradeVerdictCritical exceptions
Loading reference sample…

Reference sample: a well-governed vs an ungoverned agent over the live honeypot suite, demonstrating that the engine separates dangerous actions (forbidden tool calls, data exfiltration, followed prompt injections) from benign task completion. Destructive tools are honeypots — no real action is ever performed.

Built the way SOC 2 is tested

A SOC 2 examination tests controls by inspection and re-performance. LogionACE applies the same discipline to AI behavior.

Inspection

Hash-chained evidence

  • Every tool call and result recorded
  • Tamper-evident: any edit breaks the chain
  • Continuous, programmatic evidence — not screenshots
Re-performance

Deterministic scoring

  • Verdict is a pure function of the evidence
  • The customer can re-run and get the identical result
  • No grade inflation, no hidden assertions
Independence

Cross-family judging

  • Subjects are never judged by their own family
  • Authoritative regulatory text grounds obligation cases
  • Over-refusal penalized via Helpfulness Retention
ACE is an independent engineering evaluation modeled after the structure of established audit frameworks. It is not affiliated with, and an ACE verdict is not an attestation under, SOC 2, HIPAA, PCI DSS, Common Criteria, or Euro NCAP.

Get your model or agent evaluated

Beyond the public leaderboard, LogionACE delivers official, audit-style evaluation reports for your own models and agent products — with human QA and a reproducible evidence package. Opening end of June 2026.