Overview

ISMS Copilot conducts rigorous internal testing before deploying new AI models or model updates. This ensures the platform maintains audit-grade accuracy for compliance frameworks like ISO 27001, SOC 2, and ISO 42001.

This article explains our model testing workflow and the quality standards we apply before any model reaches production.

Testing Workflow

When evaluating a new model or model variant, we follow this process:

1. Isolated Branch Testing

We deploy the candidate model in a dedicated branch environment. This isolates testing from production systems and allows comprehensive evaluation without affecting active users.

2. Compliance Task Evaluation

We test the model on core compliance tasks that represent real-world ISMS Copilot usage:

  • Framework mapping - Accurately mapping controls between standards (e.g., ISO 27001 ↔ ISO 42001)

  • Control reference accuracy - Correctly citing Annex A controls vs. management system clauses

  • Policy generation - Producing audit-ready documents with proper structure and terminology

  • Gap analysis - Identifying compliance gaps in uploaded documents

Test prompts use the same dynamic knowledge injection system that powers production, ensuring realistic evaluation conditions.

3. Decision Criteria

A model must meet these requirements to advance to production:

  • Zero control hallucinations - No fabricated or misidentified framework controls

  • Structural accuracy - Correct distinction between Annex A controls and clauses

  • Error acknowledgment - Ability to recognize and correct mistakes when challenged

  • Performance gains - Measurable improvements (speed, token limits, cost) without accuracy loss

Models that fail accuracy tests are rejected regardless of performance benefits. Audit-facing work demands reliability over speed.

4. Deployment Pipeline

If testing succeeds:

  1. Deploy to development environment for extended validation

  2. Monitor real-world performance and edge cases

  3. Deploy to production with rollback capability

If testing fails, we revert to the previous model and document findings for future reference.

Real-World Example: Grok-4-Fast-Reasoning

This example shows our testing standards in action.

Test Context

Objective: Evaluate Grok-4-Fast-Reasoning as a replacement for Grok-4 to solve token limit errors and reduce costs.

Test task: Map ISO 27001:2022 controls to ISO 42001:2023 controls with accurate control references provided in context.

The Failure

The model produced this mapping error:

  • ISO 42001 Control: A.8.5 Information for interested parties

  • Grok-4-Fast-Reasoning mapped to: A.7.4 Communication

  • Correct mapping: Clause 7.4 Communication (not Annex A.7.4)

In ISO 27001:2022, Annex A.7.4 is "Physical security monitoring" (surveillance/detection in facilities). The model conflated Annex A control numbering with management system clause numbering—a fundamental structural error for compliance work.

Error Acknowledgment Failure

The model's response to correction was equally concerning:

  1. Asked to spot its mistake → Did not identify the error

  2. Asked specifically about A.7.4 → Provided correct information but didn't acknowledge the table error

  3. Challenged directly → Stated "I did not hallucinate" and defended the incorrect mapping

  4. Admitted error only after being called "dishonest" with the problematic table quoted back

Decision

Result: ❌ Not suitable for production

Reasoning:

  • Speed was impressive, but control reference failures are unacceptable for audit-facing outputs

  • Poor error acknowledgment could mislead users who trust the output

  • May work for drafts but requires human validation on every control reference

Action taken: Reverted to Grok-4 for production deployment.

What This Means for Users

When you use ISMS Copilot, you benefit from models that have passed these quality gates:

  • Framework accuracy - Controls and clauses are correctly referenced

  • Reliability - Models that hallucinate or refuse correction are rejected

  • Audit readiness - Outputs are tested against real compliance mapping tasks

While we test rigorously, always verify AI outputs against official standards before submitting to auditors. See our responsible use guidelines for best practices.

Was this helpful?