Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing
The primary APST paper introducing a depth-oriented framework for measuring LLM safety and reliability under repeated inference.
Measure hidden AI failure risk on your real production traffic. Discover where failures cluster. Validate reductions before deployment.
Built on Accelerated Prompt Stress Testing (APST) research, SafeFlow stress-tests agent workflows under retries, tool calls, and stochastic execution paths — locating failure hotspots, optimizing the prompts and policies driving them, and proving the reduction in empirical failure probability before deployment.
Based on APST research introduced by Keita Broadwater, available on arXiv and presented at CCAI 2026.
SafeFlow operationalizes AI reliability as an engineering loop — not a one-off benchmark. Each stage feeds the next, turning stochastic LLM behavior into a managed risk surface.
Quantify how often LLMs and agents fail under repeated inference across real prompt distributions.
APST-G graph-guided exploration surfaces semantically related prompt regions where failures concentrate.
APST-PO evaluates candidate system prompts and workflow policies that reduce risk while preserving utility.
Re-run reliability tests to confirm reduced failure probability and document the improvement.
SafeFlow is a closed reliability loop grounded in your real prompt distribution. Every stage feeds the next — and every optimization is validated against the same traffic it was trained against.
Production LLMs are queried repeatedly, retried, sampled at temperature, and embedded into agent workflows. Failures cluster in regions that shallow benchmarks never reach — and a 0.5% failure rate becomes thousands of incidents at deployment scale.
Risky prompts cluster in semantic neighborhoods that standard evals don't sample densely enough to expose.
Models that look comparable on one-shot benchmarks diverge sharply when stress-tested at deployment depth.
Low per-call failure rates translate into recurring operational, safety, and trust incidents in production.
SafeFlow combines measurement, discovery, optimization, and validation in a single workflow built for teams deploying LLMs and agents to production.
Estimate empirical failure probability on real production traffic — not synthetic benchmarks decoupled from your users.
Stress-test agents under retries, tool calls, refusals, and stochastic execution paths — exposing failures that single-shot evals never trigger.
APST-G expands semantic neighborhoods around risky prompts to surface clustered failure regions across your traffic.
APST-PO hardens system prompts, tool policies, refusal behavior, routing prompts, and orchestration constraints against measured risk.
Translate empirical p_fail into projected incidents per 10K / 100K / 1M inferences across your real workload.
Re-run APST against the same distribution to prove that an optimization actually lowered empirical failure probability.
Combine sampled production prompts with curated safety benchmarks for evaluations grounded in how your system is actually used.
Bilingual reports for governance, risk, and security reviews — operational, not academic.
Agents fail differently than single-shot prompts. Compounding stochasticity across tool calls, retries, and routing decisions creates failure modes that one-shot evals never expose. SafeFlow stress-tests the entire workflow on your real production traffic and reports where reliability breaks.
Stress-test my agent workflowStress-test plan → act → observe loops, branching, and recovery behavior under stochastic execution.
Inject realistic tool failures, latency, and malformed responses to surface brittle tool policies.
Measure how often refusal logic and guardrails drift, over-trigger, or get bypassed under depth.
Evaluate router prompts and sub-agent handoffs as first-class reliability surfaces, not glue code.
SafeFlow uses graph-guided prompt exploration to identify semantically related prompt regions where failures concentrate. Instead of sampling uniformly, APST-G expands from known failures into their semantic neighborhoods — surfacing clustered risk that standard evaluations miss.
APST-PO doesn't just rewrite prose — it hardens system prompts, tool policies, refusal behavior, routing prompts, and orchestration constraints against measured failure probability. Every candidate is stress-tested under repeated inference and ranked by validated risk reduction.
You are a helpful assistant. Answer the user's question.
You are a careful assistant. Refuse unsafe requests and PII queries. Cite sources for factual claims. If uncertain, ask one clarifying question.
A preview of what SafeFlow delivers: baseline vs. optimized failure probability, measured reduction, discovered hotspots, unsafe prompt categories, mitigation actions, and a recommended optimized system prompt — ready for deployment review.
Request sample reportYou are a careful assistant. Refuse unsafe requests and PII queries. Cite sources for factual claims. If uncertain, ask one clarifying question.
The open-source Starter Kit is a local-first tool for researchers, practitioners, and AI teams who want to experiment with repeated-inference measurement on their own machines.
$ pip install apst-starter-kit $ apst demo --lang both ▸ loading mock prompt set (50 prompts) ▸ sampling (n=200, T=0.7) ▸ judging (rule-based + llm) ▸ estimating p_fail ✓ ▸ discovering hotspots (apst-g) ✓ ▸ optimizing system prompt (apst-po) ✓ ▸ validating reduction ✓ baseline p_fail 0.0312 optimized p_fail 0.0058 reduction −81.4% bilingual report ./safeflow-report.md
A focused engagement for teams deploying LLMs, agents, or AI assistants that need to measure, reduce, and validate operational failure risk before production.
SafeFlow runs the full Measure → Discover → Optimize → Validate loop on your prompts, workflows, and deployment assumptions. You receive an executive-ready report with empirical failure probabilities, discovered hotspots, optimized prompts, validated reductions, and projected operational exposure.
APST converts stochastic LLM behavior into measurable, projectable risk metrics. APST-G and APST-PO extend it into discovery and optimization.
SafeFlow productizes the Accelerated Prompt Stress Testing research line introduced by Keita Broadwater — extended with graph-guided discovery (APST-G) and prompt optimization (APST-PO) for operational use.
The primary APST paper introducing a depth-oriented framework for measuring LLM safety and reliability under repeated inference.
A CCAI 2026 conference paper presenting APST through the lens of reliability gaps in LLM safety evaluation.
Start free with the Starter Kit, run a focused optimization pilot, or scale SafeFlow across your organization.
A focused Measure → Discover → Optimize → Validate engagement on one high-value LLM workflow, with empirical failure probability, validated reduction, and an executive-ready report.
Request Pilot AuditMulti-model, multi-workflow APST evaluation with bilingual reporting, governance support, prompt optimization, monitoring recommendations, and executive briefings.
Discuss Enterprise ScopeSafeFlow does not guarantee safety. Engagements measure, discover, optimize, validate, and project LLM reliability to support deployment decisions.
SafeFlow supports English, Chinese, and bilingual reporting. The Starter Kit runs locally using OpenAI-compatible APIs, Ollama, vLLM, or local model servers — practical for teams in the United States, China, and international enterprise environments.
SafeFlow supports English and Chinese reliability reports.
SafeFlow 支持英文、中文和双语可靠性报告。
Run a Reliability Assessment, stress-test your agent workflow, or try the Starter Kit locally.
SafeFlow does not guarantee safety. We measure, discover, optimize, validate, and recommend — giving your team the data to make deployment decisions.