allow shared prefix question and system prompt variance and calculate…#301
Conversation
|
@jjk-g @Bslabe123 This PR has some modifications I needed for Inference Gateway experiments. Specifically, I added variance for question/output lengths in shared-prefix scenarios and enabled SLO attainment calculation. These seem generally applicable beyond my use case, so I wanted to upstream them. Let me know your thoughts! |
|
Thanks for raising!
@huaxig to recommend a test to add |
|
thanks
thanks. sure, i will work on these once i am back on vacation on 02/01 |
|
@jjk-g @Bslabe123 just a gentle nudge whenever you get a chance to take a look 🙂 |
|
@jjk-g addressed the comments. PTAL |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jjk-g, kaushikmitr The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This pull request introduces several enhancements and new features to the inference performance benchmarking and reporting framework. The main focus is on supporting Service Level Objective (SLO) tracking for latency metrics (TTFT and TPOT), making prompt and output length distributions more flexible, and improving metric calculation and reporting. The changes touch data models, configuration, data generation, metric collection, and reporting.
Key changes include:
SLO Tracking and Metric Enhancements
RequestLifecycleMetric(ttft,tpot,ttft_slo,tpot_slo,ttft_slo_met,tpot_slo_met,ntpot) to track time-to-first-token, time-per-output-token, their SLO thresholds, and attainment status. (inference_perf/apis/base.py)APIConfigto allow configuration of SLO thresholds and header names for TTFT and TPOT, and updated the OpenAI client to calculate these metrics and evaluate SLO attainment for each request. (inference_perf/config.py,inference_perf/client/modelserver/openai_client.py) [1] [2]calculate_slo_metricsfunction to aggregate SLO attainment statistics and goodput, and integrated these metrics into the summary reporting. (inference_perf/reportgen/base.py)Flexible Prompt and Output Length Distribution
SharedPrefixconfig, and updated the data generator to use these parameters for more realistic prompt and output length distributions. (inference_perf/config.py,inference_perf/datagen/shared_prefix_datagen.py)inference_perf/datagen/shared_prefix_datagen.py)Streaming API and Payload Improvements
to_payloadmethods for chat and completion APIs to includestream_optionswhen streaming, and fixed a parameter name for clarity in user session completion API data. (inference_perf/apis/chat.py,inference_perf/apis/completion.py,inference_perf/apis/user_session.py)Test Updates
stream_optionsfield in the payload. (tests/apis/test_completion.py)Example added in the stage_x_lifecycle_metric.json:
"slo_metrics": {
"ttft_slo": {
"attainment_pct": 83,
"requests_met": 166,
"requests_failed": 34,
"total_requests": 200,
"slo": 2
},
"tpot_slo": {
"attainment_pct": 100,
"requests_met": 200,
"requests_failed": 0,
"total_requests": 200,
"slo": 0.2
},
"combined_slo": {
"attainment_pct": 83,
"requests_met": 166,
"requests_failed": 34,
"total_requests": 200,
"ttft_slo": 2,
"tpot_slo": 0.2,
"goodput_rate": 23397.1484983487
}
}
},