Agent-ready playbooks for LLM serving benchmarks, torch-profiler triage, SGLang optimization, production incidents, and model PR intelligence.
This repository is built for AI infrastructure engineers who want agents to do real work, not recite generic prompts.
It gives an agent the operational memory needed to benchmark SGLang, vLLM, and TensorRT-LLM fairly; split prefill and decode profiler evidence; turn traces into kernel and fusion opportunities; triage SGLang production incidents from a replay; and keep model-family optimization history close to the code that actually changed.
If this saves you one stale model-support assumption, one misleading profiler trace, or one late-night benchmark loop, a star helps more AI-infra engineers find it.
| Signal | What makes it useful |
|---|---|
| 7 core operational skills | Small, focused playbooks for benchmark search, profiler analysis, SOTA loops, incidents, architecture diagrams, and H100 runs. |
| 58 model optimization runbooks | SGLang and vLLM model-family skills for DeepSeek, Qwen, GLM, Kimi, MiniMax, Llama, Mistral, Nemotron, and more. |
| 58 PR history dossiers | Diff-backed model evolution notes that record what changed, where it changed, and what risks remain. |
| Stage-separated profiler workflow | Prefill and decode are profiled as separate workloads so hot kernels do not get misattributed. |
| Framework-neutral benchmark schema | Compare SGLang, vLLM, and TensorRT-LLM with the same workload, SLA, artifact layout, and result table. |
| Profiler-to-action fusion catalog | Connect torch-profiler rows to known SGLang/vLLM fusion, overlap, and torch.compile patterns. |
| Replay-first incident triage | Preserve evidence, reproduce the request path, and choose the next debug tool before patching. |
| Goal | Start here |
|---|---|
| Search the best serving command across frameworks | llm-serving-auto-benchmark |
| Explain a torch-profiler trace with kernel, overlap, and fusion tables | llm-torch-profiler-analysis |
| Drive a full SGLang performance loop against vLLM/TensorRT-LLM | sglang-sota-performance |
| Debug a live or recent SGLang serving incident from evidence | sglang-prod-incident-triage |
| Find original public model architecture diagrams | model-architecture-diagram |
| Reuse model-family optimization knowledge | skills/model-optimization |
| Read model PR evolution by framework | model-pr-optimization-history |
| Skill | Use it when |
|---|---|
llm-serving-auto-benchmark |
You need a fair, bounded serving benchmark search for SGLang, vLLM, TensorRT-LLM, or another OpenAI-compatible stack. |
llm-torch-profiler-analysis |
You need a three-table profiler report that keeps extend/prefill and decode evidence separate. |
sglang-sota-performance |
You want SGLang to match or beat the best observed framework result for a specific model and workload. |
sglang-prod-incident-triage |
You need to turn queue growth, timeouts, wrong outputs, crashes, or distributed stalls into a replay and next debug step. |
model-architecture-diagram |
You need original public architecture diagrams for popular LLM, VLM, MoE, OCR, and diffusion model families. |
h100 |
You need an H100 operator runbook for SGLang validation in the configured remote environment. |
h100-sglang-diffusion |
You need the H100 workflow with diffusion-specific paths and validation expectations. |
The model optimization layer is intentionally larger than the core skill set. Core skills teach an agent how to work; model runbooks teach it what has already happened for each model family.
| Framework | Runbooks | PR histories |
|---|---|---|
| SGLang | 29 | 29 |
| vLLM | 29 | 29 |
Covered families include:
DeepSeek V3/R1/V3.1/V3.2/V4, Qwen3, Qwen3-Coder, Qwen3-Next,
Qwen3.5/Qwen3.6, Qwen VLM/Omni/ASR, GLM 4.5/4.6/4.7/5,
Kimi, MiniMax, Llama 4, Mistral Small 4, Mixtral, Nemotron,
Gemma, Ernie 4.5, Intern-S1, InternVL, Hunyuan, MOSS-VL,
GPT-OSS, Step 3.5, Mimo, and model-specific MoE/quantization paths.
Each model-family history is designed to answer practical questions:
- Which PRs changed this model path?
- Was the PR merged, closed, or still open?
- Which files and symbols moved?
- What optimization or correctness risk should be checked before touching it?
- Which upstream idea should be compared before writing a new kernel or fusion?
The repo is opinionated about evidence because performance work gets noisy fast.
- Benchmark rows should include model, framework, GPU count, workload, request rate or concurrency, SLA status, launch command, benchmark command, and raw artifacts.
- Profiler reports should keep prefill and decode separate, then emit the same three tables: kernel table, overlap-opportunity table, and fuse-opportunity table.
- SOTA claims should be scoped to the exact model, hardware, framework commits, precision, workload, and SLA used in the run.
- Incident triage should start from replayable evidence instead of changing code from symptoms alone.
- Model optimization notes should point back to PRs, files, diffs, and risk surfaces rather than vague summary text.
Copy only the skills you want into your agent skill directory:
cp -r skills/llm-serving-auto-benchmark <agent-skill-dir>/llm-serving-auto-benchmark
cp -r skills/llm-torch-profiler-analysis <agent-skill-dir>/llm-torch-profiler-analysis
cp -r skills/sglang-sota-performance <agent-skill-dir>/sglang-sota-performance
cp -r skills/sglang-prod-incident-triage <agent-skill-dir>/sglang-prod-incident-triage
cp -r skills/model-architecture-diagram <agent-skill-dir>/model-architecture-diagramInstall a model-family skill when you are working on that exact family:
cp -r skills/model-optimization/sglang/sglang-qwen3-core-optimization <agent-skill-dir>/sglang-qwen3-core-optimization
cp -r skills/model-optimization/vllm/vllm-qwen3-core-optimization <agent-skill-dir>/vllm-qwen3-core-optimizationThe H100 skills document a concrete operator environment. If you adapt them, replace the SSH alias, container name, and workspace paths in one pass, and keep secrets such as Hugging Face tokens out of the repository.
skills/
├── llm-serving-auto-benchmark/ # serving benchmark search and comparison
├── llm-torch-profiler-analysis/ # profiler capture and trace triage
├── sglang-sota-performance/ # end-to-end SGLang optimization loop
├── sglang-prod-incident-triage/ # replay-first serving incident workflow
├── model-architecture-diagram/ # public architecture diagram resolver
├── h100/ # H100 operator runbook
├── h100-sglang-diffusion/ # H100 diffusion operator runbook
└── model-optimization/
├── model-pr-diff-dossier/ # shared PR dossier standard
├── sglang/ # 29 SGLang model-family runbooks
└── vllm/ # 29 vLLM model-family runbooks
model-pr-optimization-history/
├── sglang/ # 29 SGLang model-family histories
└── vllm/ # 29 vLLM model-family histories