You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
allow shared prefix question and system prompt variance and calculate… (#301)
This pull request introduces several enhancements and new features to
the inference performance benchmarking and reporting framework. The main
focus is on supporting Service Level Objective (SLO) tracking for
latency metrics (TTFT and TPOT), making prompt and output length
distributions more flexible, and improving metric calculation and
reporting. The changes touch data models, configuration, data
generation, metric collection, and reporting.
**Key changes include:**
### SLO Tracking and Metric Enhancements
* Added new fields to `RequestLifecycleMetric` (`ttft`, `tpot`,
`ttft_slo`, `tpot_slo`, `ttft_slo_met`, `tpot_slo_met`, `ntpot`) to
track time-to-first-token, time-per-output-token, their SLO thresholds,
and attainment status. (`inference_perf/apis/base.py`)
* Extended `APIConfig` to allow configuration of SLO thresholds and
header names for TTFT and TPOT, and updated the OpenAI client to
calculate these metrics and evaluate SLO attainment for each request.
(`inference_perf/config.py`,
`inference_perf/client/modelserver/openai_client.py`)
[[1]](diffhunk://#diff-b20b7de6376037a1e80b0a93291951ae95cfa9893a3bf5fb2530c08a68304596R35-R37)
[[2]](diffhunk://#diff-205d24014798b80a3f0ec5bca09dd11a20da8cf3edb8c6279aac366cc62f9313L203-R252)
* Introduced a `calculate_slo_metrics` function to aggregate SLO
attainment statistics and goodput, and integrated these metrics into the
summary reporting. (`inference_perf/reportgen/base.py`)
### Flexible Prompt and Output Length Distribution
* Added support for specifying standard deviation, min, and max for both
question and output lengths in `SharedPrefix` config, and updated the
data generator to use these parameters for more realistic prompt and
output length distributions. (`inference_perf/config.py`,
`inference_perf/datagen/shared_prefix_datagen.py`)
* Ensured that prompt and user session shuffling is handled correctly to
avoid ordering effects in data generation.
(`inference_perf/datagen/shared_prefix_datagen.py`)
### Streaming API and Payload Improvements
* Updated `to_payload` methods for chat and completion APIs to include
`stream_options` when streaming, and fixed a parameter name for clarity
in user session completion API data. (`inference_perf/apis/chat.py`,
`inference_perf/apis/completion.py`,
`inference_perf/apis/user_session.py`)
### Test Updates
* Updated streaming API tests to account for the new `stream_options`
field in the payload. (`tests/apis/test_completion.py`)
Example added in the stage_x_lifecycle_metric.json:
"slo_metrics": {
"ttft_slo": {
"attainment_pct": 83,
"requests_met": 166,
"requests_failed": 34,
"total_requests": 200,
"slo": 2
},
"tpot_slo": {
"attainment_pct": 100,
"requests_met": 200,
"requests_failed": 0,
"total_requests": 200,
"slo": 0.2
},
"combined_slo": {
"attainment_pct": 83,
"requests_met": 166,
"requests_failed": 34,
"total_requests": 200,
"ttft_slo": 2,
"tpot_slo": 0.2,
"goodput_rate": 23397.1484983487
}
}
},
Copy file name to clipboardExpand all lines: docs/config.md
+25-10Lines changed: 25 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,15 +22,20 @@ This document provides complete documentation for all configuration options avai
22
22
23
23
### API Configuration
24
24
25
-
Controls the API interaction behavior:
25
+
Controls the API interaction behavior. If SLO headers are present, each request is evaluated for SLO compliance and SLO-related metrics are reported:
26
26
27
27
```yaml
28
28
api:
29
-
type: completion # API type (completion|chat) (default: completion), completion is the default since the chat API is not typically enabled on model servers such as vLLM by default without additional configuration.
30
-
streaming: false # Enable/disable streaming (default: false), needs to be enabled for metrics like TTFT, ITL and TPOT to be measured
31
-
headers: # Add custom http headers to the request sent to the inference server
29
+
type: completion # API type (completion|chat). completionis default since chat may require extra server config
30
+
streaming: true # Enable streaming for TTFT, ITL, and TPOT metrics
31
+
headers: # Optional custom HTTP headers
32
32
x-inference-model: llama
33
33
x-routing-strategy: round-robin
34
+
x-slo-tpot-ms: "2"
35
+
x-slo-ttft-ms: "1000"
36
+
slo_unit: "ms"# Optional SLO unit (e.g., ms, s), default is ms
37
+
slo_tpot_header: "x-slo-tpot-ms"# Optional header name for TPOT SLO Header, default is x-slo-tpot-ms
38
+
slo_ttft_header: "x-slo-ttft-ms"# Optional header name for TTFT SLO Header, default is x-slo-ttft-ms
34
39
```
35
40
36
41
### Data Generation
@@ -53,12 +58,22 @@ data:
53
58
mean: 50
54
59
std_dev: 10
55
60
total_count: 100
56
-
shared_prefix:
57
-
num_unique_system_prompts: 10# Number of distinct shared prefixes (formerly num_groups)
58
-
num_users_per_system_prompt: 10# Number of unique questions per shared prefix (formerly num_prompts_per_group)
59
-
system_prompt_len: 100# Length of the shared prefix (in tokens)
60
-
question_len: 50# Length of the unique question part (in tokens)
61
-
output_len: 50# Target length for the model's generated output (in tokens)
61
+
shared_prefix: # For shared_prefix type
62
+
num_groups: 10# Number of shared prefix groups
63
+
num_prompts_per_group: 10# Unique questions per group
0 commit comments