Skip to content

Bug Report: Flaky stage hangs in multi-worker load generation #370

@kaushikmitr

Description

@kaushikmitr

Bug Report: Flaky stage hangs in multi-worker load generation

Symptoms

Load generator intermittently gets stuck mid-stage with zero progress. Workers are alive (event loops running) but not processing any requests. No errors or timeouts reported. The issue is flaky — same config works on some runs but hangs on others. Observed even at low QPS (rate=4).

Last known good commit

fc877d3 — allow shared prefix question and system prompt variance (#301)

Config used with 10 model server on H100-80 GB (vllm, tp = 2, qwen-32b)

load:
  type: poisson
  num_workers: 6
  worker_max_concurrency: 1000
  stages:
    - rate: 1
      duration: 100
    - rate: 2
      duration: 100
    # ... up to rate: 9

data:
  type: shared_prefix
  shared_prefix:
    num_groups: 150
    num_prompts_per_group: 5
    system_prompt_len: 6000
    question_distribution:
      min: 1
      max: 10000
      mean: 1200
      std_dev: 360
    output_distribution:
      min: 1
      max: 10000
      mean: 1000
      std_dev: 300
    enable_multi_turn_chat: true

Regression window

The following commits between fc877d3 and c2cb68b (current HEAD) may have introduced the issue:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions