AI Safety

Content Moderation & Safety - ISMS Copilot

ISMS Copilot uses automated content moderation to detect and prevent inappropriate or harmful content in chat messages. This process runs in the background to maintain a safe, compliant environment for all users while preserving your privacy and workflow speed.

Moderation runs asynchronously after you send a message — it adds zero latency to your chat experience.

How Moderation Works

When you send a chat message, ISMS Copilot saves it immediately and delivers your AI response without delay. In parallel, a content moderation check runs in the background:

  1. Message analyzed — Your message is sent to a moderation API (OpenAI by default, Mistral AI for Advanced Data Protection users)

  2. Categories checked — The API scans for policy violations including hate speech, harassment, violence, self-harm, and other harmful content

  3. Result recorded — The moderation result is stored in our audit logs with category scores and timestamps

  4. Admins alerted — If content is flagged, our team receives an automated alert for review

This process is fully automated and fire-and-forget — your chat continues without interruption.

Moderation Providers

ISMS Copilot uses different moderation APIs based on your data protection settings:

  • OpenAI Moderation API — Default for all users. Checks for: sexual content, hate, harassment, violence, self-harm

  • Mistral AI Moderation API — Used when Advanced Data Protection is enabled. Checks for: sexual content, hate and discrimination, violence and threats, dangerous and criminal content, self-harm, health, financial, law, personally identifiable information (PII)

Mistral's categories include health, financial, law, and PII checks. These may occasionally flag legitimate ISMS compliance discussions. Our team reviews all alerts to avoid false positives.

Advanced Data Protection and Moderation

If you've enabled Advanced Data Protection, your chat messages are normally not stored on our servers or sent to third-party AI providers. However, content moderation creates one exception:

  • Clean messages — Message content NOT stored; only metadata and moderation scores retained for 30 days

  • Flagged messages — Full content always stored for 1 year and included in admin alerts, regardless of ADP setting

Safety override: Flagged content is always stored and shared with our team, even with Advanced Data Protection enabled. This is necessary for legal compliance, abuse prevention, and maintaining platform safety for all users.

This override is based on legitimate interest under GDPR Article 6(1)(f) — preventing harm and enforcing our Acceptable Use Policy is a legitimate interest that overrides individual data protection preferences in flagged cases.

Data Retention

Moderation events are retained according to the following schedule:

  • Non-flagged events — Metadata and moderation scores retained for 30 days; message content NOT stored

  • Flagged events — Full message content and metadata retained for 1 year for audit and legal compliance purposes

Flagged message content may be retained longer if required for ongoing investigations, legal proceedings, or regulatory obligations.

What Happens When Content Is Flagged

When the moderation API flags your message as potentially violating our policies:

  1. Alert sent — Our admin team receives a webhook notification with the flagged categories, timestamp, and message preview

  2. Human review — A team member reviews the message and context to confirm whether it violates our Acceptable Use Policy

  3. Action (if confirmed) — We may contact you, issue a warning, suspend features, or terminate your account depending on severity and repeat violations

  4. False positives — If the flag was incorrect (e.g., legitimate compliance discussion), no action is taken

Rate limiting: You can only trigger one moderation alert per hour. Subsequent flagged messages within that window are logged but don't generate duplicate alerts.

Privacy and Transparency

We're committed to transparency about our moderation practices:

  • No silent censorship — We don't block or filter your messages in real-time. Moderation is for safety enforcement, not content control

  • Third-party processors — OpenAI (US-based) and Mistral AI (France-based) act as sub-processors for moderation only. See our Register of Processing Activities for details

  • Full disclosure — This policy and our Privacy Policy document all moderation data flows and legal bases

Content moderation is based on:

  • Legitimate interest (GDPR Art. 6(1)(f)) — Preventing abuse, enforcing our terms, and maintaining platform safety

  • Contractual necessity (GDPR Art. 6(1)(b)) — Enforcing our Terms of Service and Acceptable Use Policy

  • Legal obligation (GDPR Art. 6(1)(c)) — Complying with applicable laws requiring removal or reporting of illegal content

Your Rights

Under GDPR, you have rights regarding your moderation data:

  • Access — Request copies of moderation events associated with your account

  • Rectification — Request correction of inaccurate moderation records

  • Erasure — Request deletion of non-flagged moderation data (flagged data may be retained for legal compliance)

  • Object — Object to moderation processing, though we may continue if we have compelling legitimate grounds (safety, legal obligations)

To exercise your rights or ask questions about moderation, contact us at [email protected].

Questions?

For more information about our privacy and safety practices, see:

Was this helpful?