Bug Report: Flaky stage hangs in multi-worker load generation
Symptoms
Load generator intermittently gets stuck mid-stage with zero progress. Workers are alive (event loops running) but not processing any requests. No errors or timeouts reported. The issue is flaky — same config works on some runs but hangs on others. Observed even at low QPS (rate=4).
Last known good commit
fc877d3 — allow shared prefix question and system prompt variance (#301)
Config used with 10 model server on H100-80 GB (vllm, tp = 2, qwen-32b)
load:
type: poisson
num_workers: 6
worker_max_concurrency: 1000
stages:
- rate: 1
duration: 100
- rate: 2
duration: 100
# ... up to rate: 9
data:
type: shared_prefix
shared_prefix:
num_groups: 150
num_prompts_per_group: 5
system_prompt_len: 6000
question_distribution:
min: 1
max: 10000
mean: 1200
std_dev: 360
output_distribution:
min: 1
max: 10000
mean: 1000
std_dev: 300
enable_multi_turn_chat: true
Regression window
The following commits between fc877d3 and c2cb68b (current HEAD) may have introduced the issue:
Bug Report: Flaky stage hangs in multi-worker load generation
Symptoms
Load generator intermittently gets stuck mid-stage with zero progress. Workers are alive (event loops running) but not processing any requests. No errors or timeouts reported. The issue is flaky — same config works on some runs but hangs on others. Observed even at low QPS (rate=4).
Last known good commit
fc877d3— allow shared prefix question and system prompt variance (#301)Config used with 10 model server on H100-80 GB (vllm, tp = 2, qwen-32b)
Regression window
The following commits between
fc877d3andc2cb68b(current HEAD) may have introduced the issue:4ad7d4b— fix: use worker_id in request queue for concurrent load generation (fix: use worker_id in request queue for ocncurrent load generation #347)f25a0de— Refactor OpenAI client to fix connection leaks (Refactor OpenAI client to fix connection leaks and improve error telemetry #247)c2cb68b— Refactor RequestQueueData to NamedTuple (Refactor RequestQueueData to NamedTuple for better readability #333)