Unsafe and custom topic detection

AI Security for Apps (formerly Firewall for AI) can detect when an LLM prompt touches on unsafe or unwanted subjects. There are two layers of topic detection:

Default unsafe topics — A built-in set of safety categories that detect harmful content such as violent crimes, hate speech, and sexual content.
Custom topics — Topics you define to match your organization's specific policies, such as "competitors" or "financial advice".

Default unsafe topics

When AI Security for Apps is enabled, it automatically evaluates prompts against a set of default unsafe topic categories and populates two fields:

LLM Unsafe topic detected (cf.llm.prompt.unsafe_topic_detected) — true if any unsafe topic was found.
LLM Unsafe topic categories (cf.llm.prompt.unsafe_topic_categories) — An array of the specific categories detected.

Default unsafe topic categories

Category	Description
`S1`	Violent crimes
`S2`	Non-violent crimes
`S3`	Sex-related crimes
`S4`	Child sexual exploitation
`S5`	Defamation
`S6`	Specialized advice
`S7`	Privacy
`S8`	Intellectual property
`S9`	Indiscriminate weapons
`S10`	Hate
`S11`	Suicide and self-harm
`S12`	Sexual content
`S13`	Elections
`S14`	Code interpreter abuse

Example rules — default unsafe topics

Block any prompt with unsafe content

When incoming requests match:

Field Operator Value
LLM Unsafe topic detected equals True

Expression when using the editor:
(cf.llm.prompt.unsafe_topic_detected)
Action: Block

Field	Operator	Value
LLM Unsafe topic detected	equals	True

Block only specific unsafe categories

When incoming requests match:

Field Operator Value
LLM Unsafe topic categories is in S1: Violent Crimes S10: Hate

Expression when using the editor:
(any(cf.llm.prompt.unsafe_topic_categories[*] in {"S1" "S10"}))
Action: Block

Field	Operator	Value
LLM Unsafe topic categories	is in	`S1: Violent Crimes` `S10: Hate`

Custom topics

Custom topic detection lets you define your own topics and AI Security for Apps will score each prompt against them. You can then use these scores in custom rules or rate limiting rules to block, challenge, or log matching requests.

This capability uses a zero-shot classification model that evaluates prompts at runtime. No model training is required.

How custom topics work

You define a list of up to 20 custom topics via the dashboard or API. Each topic consists of:
- A label — Used in rule expressions and analytics
- A topic string — The descriptive text the model uses to classify prompts