Skip to content

allow shared prefix question and system prompt variance and calculate…#301

Merged
jjk-g merged 48 commits into
kubernetes-sigs:mainfrom
tomatillo-and-multiverse:main
Feb 13, 2026
Merged

allow shared prefix question and system prompt variance and calculate…#301
jjk-g merged 48 commits into
kubernetes-sigs:mainfrom
tomatillo-and-multiverse:main

Conversation

@kaushikmitr
Copy link
Copy Markdown
Contributor

@kaushikmitr kaushikmitr commented Dec 8, 2025

This pull request introduces several enhancements and new features to the inference performance benchmarking and reporting framework. The main focus is on supporting Service Level Objective (SLO) tracking for latency metrics (TTFT and TPOT), making prompt and output length distributions more flexible, and improving metric calculation and reporting. The changes touch data models, configuration, data generation, metric collection, and reporting.

Key changes include:

SLO Tracking and Metric Enhancements

  • Added new fields to RequestLifecycleMetric (ttft, tpot, ttft_slo, tpot_slo, ttft_slo_met, tpot_slo_met, ntpot) to track time-to-first-token, time-per-output-token, their SLO thresholds, and attainment status. (inference_perf/apis/base.py)
  • Extended APIConfig to allow configuration of SLO thresholds and header names for TTFT and TPOT, and updated the OpenAI client to calculate these metrics and evaluate SLO attainment for each request. (inference_perf/config.py, inference_perf/client/modelserver/openai_client.py) [1] [2]
  • Introduced a calculate_slo_metrics function to aggregate SLO attainment statistics and goodput, and integrated these metrics into the summary reporting. (inference_perf/reportgen/base.py)

Flexible Prompt and Output Length Distribution

  • Added support for specifying standard deviation, min, and max for both question and output lengths in SharedPrefix config, and updated the data generator to use these parameters for more realistic prompt and output length distributions. (inference_perf/config.py, inference_perf/datagen/shared_prefix_datagen.py)
  • Ensured that prompt and user session shuffling is handled correctly to avoid ordering effects in data generation. (inference_perf/datagen/shared_prefix_datagen.py)

Streaming API and Payload Improvements

  • Updated to_payload methods for chat and completion APIs to include stream_options when streaming, and fixed a parameter name for clarity in user session completion API data. (inference_perf/apis/chat.py, inference_perf/apis/completion.py, inference_perf/apis/user_session.py)

Test Updates

  • Updated streaming API tests to account for the new stream_options field in the payload. (tests/apis/test_completion.py)

Example added in the stage_x_lifecycle_metric.json:

"slo_metrics": {
"ttft_slo": {
"attainment_pct": 83,
"requests_met": 166,
"requests_failed": 34,
"total_requests": 200,
"slo": 2
},
"tpot_slo": {
"attainment_pct": 100,
"requests_met": 200,
"requests_failed": 0,
"total_requests": 200,
"slo": 0.2
},
"combined_slo": {
"attainment_pct": 83,
"requests_met": 166,
"requests_failed": 34,
"total_requests": 200,
"ttft_slo": 2,
"tpot_slo": 0.2,
"goodput_rate": 23397.1484983487
}
}
},

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 8, 2025
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 8, 2025
@SachinVarghese SachinVarghese requested a review from jjk-g December 11, 2025 17:28
@SachinVarghese SachinVarghese added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Dec 18, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 18, 2025
@kaushikmitr
Copy link
Copy Markdown
Contributor Author

@jjk-g @Bslabe123 This PR has some modifications I needed for Inference Gateway experiments. Specifically, I added variance for question/output lengths in shared-prefix scenarios and enabled SLO attainment calculation. These seem generally applicable beyond my use case, so I wanted to upstream them. Let me know your thoughts!

@jjk-g
Copy link
Copy Markdown
Collaborator

jjk-g commented Jan 8, 2026

Thanks for raising!

  1. Can you you take a look at the check failures (lint and unit test)
  2. Can you add example output to the PR
  3. Add documentation to a relevant readme

@huaxig to recommend a test to add

@kaushikmitr
Copy link
Copy Markdown
Contributor Author

thanks

Thanks for raising!

  1. Can you you take a look at the check failures (lint and unit test)
  2. Can you add example output to the PR
  3. Add documentation to a relevant readme

@huaxig to recommend a test to add

thanks. sure, i will work on these once i am back on vacation on 02/01

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 29, 2026
@kaushikmitr
Copy link
Copy Markdown
Contributor Author

Thanks for raising!

  1. Can you you take a look at the check failures (lint and unit test)
  2. Can you add example output to the PR
  3. Add documentation to a relevant readme

@huaxig to recommend a test to add

@jjk-g addressed 1, 2, and 3

@kaushikmitr
Copy link
Copy Markdown
Contributor Author

@jjk-g @Bslabe123 just a gentle nudge whenever you get a chance to take a look 🙂

Comment thread inference_perf/client/modelserver/openai_client.py Outdated
Comment thread inference_perf/config.py Outdated
Comment thread inference_perf/datagen/shared_prefix_datagen.py Outdated
Comment thread inference_perf/reportgen/base.py Outdated
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 5, 2026
@kaushikmitr
Copy link
Copy Markdown
Contributor Author

@jjk-g addressed the comments. PTAL

@kaushikmitr kaushikmitr requested a review from jjk-g February 11, 2026 21:09
Copy link
Copy Markdown
Contributor Author

@kaushikmitr kaushikmitr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

Comment thread inference_perf/config.py Outdated
@jjk-g
Copy link
Copy Markdown
Collaborator

jjk-g commented Feb 13, 2026

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jjk-g, kaushikmitr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 13, 2026
@jjk-g jjk-g merged commit fc877d3 into kubernetes-sigs:main Feb 13, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants