Name	Name	Last commit message	Last commit date
parent directory ..
batch-summarization-rag	batch-summarization-rag
batch-synthetic-data-generation	batch-synthetic-data-generation
code-generation	code-generation
deep-research	deep-research
interactive-chat	interactive-chat
reasoning	reasoning
README.md	README.md

Workload Catalog

This directory contains a catalog of real-world benchmarking workloads. Extending beyond standard single-turn prompts, it models the distinct system constraints of agentic workflows—where models generate commands, pause for external tool execution, and inject results into subsequent prompts. By simulating these causal dependencies and inter-request wait times, users can evaluate KV cache retention, prefix-aware routing, and memory fragmentation under realistic agent-driven load.

Goal

The aim of this catalog is to standardize and provide a way to reproducibly benchmark real-world workloads. Regardless of the benchmark harness used underneath, these workload catalog entries should contain enough detail to allow for reproduction of the performance characteristics.

Structure

Each workload in the catalog is organized into a directory containing:

config.json: Describes the use case related parameters in detail (e.g., input/output sequence lengths, distributions, number of turns). This file is intended to be harness-agnostic and purely descriptive of the workload's statistical profile.
inference-perf.yaml: A benchmark configuration file specific to inference-perf that can be used to run the workload directly.
README.md: Documentation explaining the use case, rationale for distributions, reference datasets, and system impact.

Config File Schema

The config.json file in each workload directory defines the statistical profile of the workload. Below is a description of the fields:

Field	Type	Description
`input_sequence_length`	Object	Distribution of input tokens (prompt size). Contains `min`, `max`, `mean`, `standard_deviation`, `distribution_type`.
`output_sequence_length`	Object	Distribution of output tokens (completion size). Contains `min`, `max`, `mean`, `standard_deviation`, `distribution_type`.
`number_of_turns`	Object	Distribution of the number of turns in the conversation.
`time_between_turns`	Object	Distribution of time (in seconds) between turns.
`system_prompt`	Object	Distribution of the system prompt length.
`input_sequence_length_per_turn`	Object	Distribution of input tokens added per turn (for multi-turn).
`multi_turn`	Boolean	Indicates if the workload is multi-turn.
`metadata`	Object	Additional parameters for workload classification and optimization mapping.

Metadata and Workload Classification

The metadata field contains additional parameters that help classify the workload (e.g., prefill heavy vs decode heavy). These can be used to:

Filter and Query: Classify and filter benchmarks based on workload characteristics (e.g., finding workloads with specific prefill/decode ratios).
Map to Optimizations: Understand how workload characteristics map to underlying optimizations at the inference server level. For example:
- Prefill Heavy Workloads: May benefit from chunked prefill or prefill/decode disaggregation.
- Shared Prefixes / System Prompts: Benefit from prefix caching or prefix-aware routing.
- Long Contexts: May require KV cache offloading or specific quantization techniques.

Standardization

When these workloads are run through a harness (like inference-perf), they should produce a report in the upcoming standard benchmarking format. This will make it easier to compare results across different serving stacks and hardware configurations.

Available Workloads

interactive-chat: Simulates multi-turn chat conversations with human-scale inter-request latency.
code-generation: Simulates coding agents (e.g., Tree of Thought) evaluating parallel paths, stressing prefix caching across shared codebase contexts.
deep-research: Simulates an autonomous research agent (e.g., Sequential ReAct) executing sequential tool calls, testing KV cache retention during causally dependent wait periods.
reasoning: Simulates step-by-step reasoning tasks (e.g., Chain of Thought) characterized by high output sequence lengths (OSL) and continuous decode phases.
batch-summarization-rag: Simulates RAG or batch summarization with long inputs.
batch-synthetic-data-generation: Simulates high-throughput synthetic data generation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Workload Catalog

Goal

Structure

Config File Schema

Metadata and Workload Classification

Standardization

Available Workloads

FilesExpand file tree

workload-catalog

Directory actions

More options

Directory actions

More options

Latest commit

History

workload-catalog

Folders and files

parent directory

README.md

Workload Catalog

Goal

Structure

Config File Schema

Metadata and Workload Classification

Standardization

Available Workloads