Unsafe and custom topic detection
AI Security for Apps (formerly Firewall for AI) can detect when an LLM prompt touches on unsafe or unwanted subjects. There are two layers of topic detection:
- Default unsafe topics — A built-in set of safety categories that detect harmful content such as violent crimes, hate speech, and sexual content.
- Custom topics — Topics you define to match your organization's specific policies, such as "competitors" or "financial advice".
When AI Security for Apps is enabled, it automatically evaluates prompts against a set of default unsafe topic categories and populates two fields:
- LLM Unsafe topic detected (
cf.llm.prompt.unsafe_topic_detected) —trueif any unsafe topic was found. - LLM Unsafe topic categories (
cf.llm.prompt.unsafe_topic_categories) — An array of the specific categories detected.
Default unsafe topic categories
| Category | Description |
|---|---|
S1 | Violent crimes |
S2 | Non-violent crimes |
S3 | Sex-related crimes |
S4 | Child sexual exploitation |
S5 | Defamation |
S6 | Specialized advice |
S7 | Privacy |
S8 | Intellectual property |
S9 | Indiscriminate weapons |
S10 | Hate |
S11 | Suicide and self-harm |
S12 | Sexual content |
S13 | Elections |
S14 | Code interpreter abuse |
-
When incoming requests match:
Field Operator Value LLM Unsafe topic detected equals True Expression when using the editor:
(cf.llm.prompt.unsafe_topic_detected) -
Action: Block
-
When incoming requests match:
Field Operator Value LLM Unsafe topic categories is in S1: Violent CrimesS10: HateExpression when using the editor:
(any(cf.llm.prompt.unsafe_topic_categories[*] in {"S1" "S10"})) -
Action: Block
Custom topic detection lets you define your own topics and AI Security for Apps will score each prompt against them. You can then use these scores in custom rules or rate limiting rules to block, challenge, or log matching requests.
This capability uses a zero-shot classification model that evaluates prompts at runtime. No model training is required.
- You define a list of up to 20 custom topics via the dashboard or API. Each topic consists of:
- A label — Used in rule expressions and analytics
- A topic string — The descriptive text the model uses to classify prompts