allow shared prefix question and system prompt variance and calculate… by kaushikmitr · Pull Request #301 · kubernetes-sigs/inference-perf

kaushikmitr · 2025-12-08T23:31:38Z

This pull request introduces several enhancements and new features to the inference performance benchmarking and reporting framework. The main focus is on supporting Service Level Objective (SLO) tracking for latency metrics (TTFT and TPOT), making prompt and output length distributions more flexible, and improving metric calculation and reporting. The changes touch data models, configuration, data generation, metric collection, and reporting.

Key changes include:

SLO Tracking and Metric Enhancements

Added new fields to RequestLifecycleMetric (ttft, tpot, ttft_slo, tpot_slo, ttft_slo_met, tpot_slo_met, ntpot) to track time-to-first-token, time-per-output-token, their SLO thresholds, and attainment status. (inference_perf/apis/base.py)
Extended APIConfig to allow configuration of SLO thresholds and header names for TTFT and TPOT, and updated the OpenAI client to calculate these metrics and evaluate SLO attainment for each request. (inference_perf/config.py, inference_perf/client/modelserver/openai_client.py) [1] [2]
Introduced a calculate_slo_metrics function to aggregate SLO attainment statistics and goodput, and integrated these metrics into the summary reporting. (inference_perf/reportgen/base.py)

Flexible Prompt and Output Length Distribution

Added support for specifying standard deviation, min, and max for both question and output lengths in SharedPrefix config, and updated the data generator to use these parameters for more realistic prompt and output length distributions. (inference_perf/config.py, inference_perf/datagen/shared_prefix_datagen.py)
Ensured that prompt and user session shuffling is handled correctly to avoid ordering effects in data generation. (inference_perf/datagen/shared_prefix_datagen.py)

Streaming API and Payload Improvements

Updated to_payload methods for chat and completion APIs to include stream_options when streaming, and fixed a parameter name for clarity in user session completion API data. (inference_perf/apis/chat.py, inference_perf/apis/completion.py, inference_perf/apis/user_session.py)

Test Updates

Updated streaming API tests to account for the new stream_options field in the payload. (tests/apis/test_completion.py)

Example added in the stage_x_lifecycle_metric.json:

"slo_metrics": {
"ttft_slo": {
"attainment_pct": 83,
"requests_met": 166,
"requests_failed": 34,
"total_requests": 200,
"slo": 2
},
"tpot_slo": {
"attainment_pct": 100,
"requests_met": 200,
"requests_failed": 0,
"total_requests": 200,
"slo": 0.2
},
"combined_slo": {
"attainment_pct": 83,
"requests_met": 166,
"requests_failed": 34,
"total_requests": 200,
"ttft_slo": 2,
"tpot_slo": 0.2,
"goodput_rate": 23397.1484983487
}
}
},

kaushikmitr · 2026-01-05T15:30:56Z

@jjk-g @Bslabe123 This PR has some modifications I needed for Inference Gateway experiments. Specifically, I added variance for question/output lengths in shared-prefix scenarios and enabled SLO attainment calculation. These seem generally applicable beyond my use case, so I wanted to upstream them. Let me know your thoughts!

jjk-g · 2026-01-08T16:42:23Z

Thanks for raising!

Can you you take a look at the check failures (lint and unit test)
Can you add example output to the PR
Add documentation to a relevant readme

@huaxig to recommend a test to add

kaushikmitr · 2026-01-13T05:08:44Z

thanks

Thanks for raising!

Can you you take a look at the check failures (lint and unit test)

Can you add example output to the PR

Add documentation to a relevant readme

@huaxig to recommend a test to add

thanks. sure, i will work on these once i am back on vacation on 02/01

kaushikmitr · 2026-02-01T01:59:56Z

Thanks for raising!

Can you you take a look at the check failures (lint and unit test)

Can you add example output to the PR

Add documentation to a relevant readme

@huaxig to recommend a test to add

@jjk-g addressed 1, 2, and 3

kaushikmitr · 2026-02-04T23:13:45Z

@jjk-g @Bslabe123 just a gentle nudge whenever you get a chance to take a look 🙂

… slo metrics

kaushikmitr · 2026-02-09T21:23:55Z

@jjk-g addressed the comments. PTAL

kaushikmitr

resolved

jjk-g · 2026-02-13T00:28:45Z

/lgtm
/approve

k8s-ci-robot · 2026-02-13T00:28:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jjk-g, kaushikmitr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jjk-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 8, 2025

k8s-ci-robot requested review from ArangoGutierrez and Bslabe123 December 8, 2025 23:31

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 8, 2025

SachinVarghese requested a review from jjk-g December 11, 2025 17:28

SachinVarghese added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Dec 18, 2025

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 18, 2025

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 29, 2026

jjk-g reviewed Feb 5, 2026

View reviewed changes

Comment thread inference_perf/client/modelserver/openai_client.py Outdated

Comment thread inference_perf/config.py Outdated

Comment thread inference_perf/datagen/shared_prefix_datagen.py Outdated

Comment thread inference_perf/reportgen/base.py Outdated

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 5, 2026

kaushikmitr added 13 commits February 7, 2026 19:27

allow shared prefix question and system prompt variance and calculate…

e8e96b1

… slo metrics

fix bug

fa0dd37

fix shared prefix bug

6baa036

fix unit tests

e724dc9

fix test errors

a3fb970

fix completion.py to_payload method

eb2d9e2

fix completion.py to_payload method 2

7fc8965

fix lint error

179dbda

fix lint error 2

c4a89be

fix lint error 3

e63d45a

fix lint error 4

f317737

make slo headers configurable

a6bbd5b

make slo headers configurable 2

b796162

kaushikmitr added 16 commits February 7, 2026 19:38

fix rebase errors 2

369b4cd

fix rebase errors 3

f1ad486

fix rebase errors 4

800b490

fix rebase errors 5

bfb15b9

fix rebase errors 6

3788ed5

fix rebase errors 7

dd850c3

fix rebase errors 8

74db16d

fix rebase errors 9

f507467

fix rebase errors 10

cd10f95

fix rebase errors 11

d7acb61

fix rebase errors 12

93d4c80

fix rebase errors 13

74927a2

fix rebase errors 14

ffe7635

fix rebase errors 15

fc35bbe

fix rebase errors 16

c8be278

fix rebase errors 17

0ba9ace

kaushikmitr requested a review from jjk-g February 11, 2026 21:09

restore shared prefix defaults

e7023e8

kaushikmitr commented Feb 12, 2026

View reviewed changes

Comment thread inference_perf/config.py Outdated

kaushikmitr added 2 commits February 12, 2026 00:05

restore shared prefix defaults 1

094ebc7

use dist class for shared prefix

89e4cb3

k8s-ci-robot assigned jjk-g Feb 13, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 13, 2026

jjk-g merged commit fc877d3 into kubernetes-sigs:main Feb 13, 2026
4 of 5 checks passed

kaushikmitr mentioned this pull request Mar 25, 2026

Bug Report: Flaky stage hangs in multi-worker load generation #370

Open

jjk-g mentioned this pull request Apr 10, 2026

Update goodput reporting based on latency SLOs #427

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow shared prefix question and system prompt variance and calculate…#301

allow shared prefix question and system prompt variance and calculate…#301
jjk-g merged 48 commits into
kubernetes-sigs:mainfrom
tomatillo-and-multiverse:main

kaushikmitr commented Dec 8, 2025 •

edited

Loading

Uh oh!

kaushikmitr commented Jan 5, 2026

Uh oh!

jjk-g commented Jan 8, 2026

Uh oh!

kaushikmitr commented Jan 13, 2026

Uh oh!

kaushikmitr commented Feb 1, 2026

Uh oh!

kaushikmitr commented Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaushikmitr commented Feb 9, 2026

Uh oh!

kaushikmitr left a comment

Uh oh!

Uh oh!

jjk-g commented Feb 13, 2026

Uh oh!

k8s-ci-robot commented Feb 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kaushikmitr commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SLO Tracking and Metric Enhancements

Flexible Prompt and Output Length Distribution

Streaming API and Payload Improvements

Test Updates

Uh oh!

kaushikmitr commented Jan 5, 2026

Uh oh!

jjk-g commented Jan 8, 2026

Uh oh!

kaushikmitr commented Jan 13, 2026

Uh oh!

kaushikmitr commented Feb 1, 2026

Uh oh!

kaushikmitr commented Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaushikmitr commented Feb 9, 2026

Uh oh!

kaushikmitr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jjk-g commented Feb 13, 2026

Uh oh!

k8s-ci-robot commented Feb 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kaushikmitr commented Dec 8, 2025 •

edited

Loading