Reliability optimization for LLMs and agent workflows

Reduce operational AI failures before deployment.

Measure hidden AI failure risk on your real production traffic. Discover where failures cluster. Validate reductions before deployment.

Built on Accelerated Prompt Stress Testing (APST) research, SafeFlow stress-tests agent workflows under retries, tool calls, and stochastic execution paths — locating failure hotspots, optimizing the prompts and policies driving them, and proving the reduction in empirical failure probability before deployment.

Local-first: prompts, traffic, and outputs stay in your environment.

Based on APST research introduced by Keita Broadwater, available on arXiv and presented at CCAI 2026.

Reliability Optimization
validated
Baseline p_fail
0.0312
before optimization
Optimized p_fail
0.0058
after APST-PO
Failure-rate reduction−81.4%
Hotspots
7
discovered
Prompts
3.2K
explored
Validated
before/after
The reliability loop

Measure. Discover. Optimize. Validate.

SafeFlow operationalizes AI reliability as an engineering loop — not a one-off benchmark. Each stage feeds the next, turning stochastic LLM behavior into a managed risk surface.

01
Measure
Estimate empirical failure probability

Quantify how often LLMs and agents fail under repeated inference across real prompt distributions.

02
Discover
Find hidden failure hotspots

APST-G graph-guided exploration surfaces semantically related prompt regions where failures concentrate.

03
Optimize
Generate safer system prompts

APST-PO evaluates candidate system prompts and workflow policies that reduce risk while preserving utility.

04
Validate
Prove before / after reduction

Re-run reliability tests to confirm reduced failure probability and document the improvement.

System architecture

From production traffic to validated reduction.

SafeFlow is a closed reliability loop grounded in your real prompt distribution. Every stage feeds the next — and every optimization is validated against the same traffic it was trained against.

stage 01Production prompt distributionyour real traffic, sampled
stage 02Reliability profilingempirical p_fail under repeated inference
stage 03Hotspot discoveryAPST-G graph-guided exploration
stage 04Prompt & policy optimizationAPST-PO system prompts, tools, refusals, routing
stage 05Validated reductionbefore / after Δp_fail proven on the same distribution
stage 06Deployment monitoringre-validation cadence on live traffic
Closed loop: monitoring feeds new traffic back into reliability profiling.Runs locally against your environment.
The problem

Stochastic failures don't show up in one-shot evaluations.

Production LLMs are queried repeatedly, retried, sampled at temperature, and embedded into agent workflows. Failures cluster in regions that shallow benchmarks never reach — and a 0.5% failure rate becomes thousands of incidents at deployment scale.

Hidden failure regions

Risky prompts cluster in semantic neighborhoods that standard evals don't sample densely enough to expose.

Rankings shift under depth

Models that look comparable on one-shot benchmarks diverge sharply when stress-tested at deployment depth.

Rare failures compound

Low per-call failure rates translate into recurring operational, safety, and trust incidents in production.

Platform capabilities

A reliability engineering platform — not a benchmark.

SafeFlow combines measurement, discovery, optimization, and validation in a single workflow built for teams deploying LLMs and agents to production.

Reliability profiling on your production prompt distribution

Estimate empirical failure probability on real production traffic — not synthetic benchmarks decoupled from your users.

Agent workflow stress testing

Stress-test agents under retries, tool calls, refusals, and stochastic execution paths — exposing failures that single-shot evals never trigger.

Graph-guided hotspot discovery

APST-G expands semantic neighborhoods around risky prompts to surface clustered failure regions across your traffic.

System prompt, tool, and policy optimization

APST-PO hardens system prompts, tool policies, refusal behavior, routing prompts, and orchestration constraints against measured risk.

Deployment-scale risk projection

Translate empirical p_fail into projected incidents per 10K / 100K / 1M inferences across your real workload.

Before / after reduction validation

Re-run APST against the same distribution to prove that an optimization actually lowered empirical failure probability.

Production traffic + benchmark blending

Combine sampled production prompts with curated safety benchmarks for evaluations grounded in how your system is actually used.

Executive-ready reliability reports

Bilingual reports for governance, risk, and security reviews — operational, not academic.

Agent workflow reliability

Stress-test agent workflows under retries, tool calls, and stochastic execution paths.

Agents fail differently than single-shot prompts. Compounding stochasticity across tool calls, retries, and routing decisions creates failure modes that one-shot evals never expose. SafeFlow stress-tests the entire workflow on your real production traffic and reports where reliability breaks.

Stress-test my agent workflow

Multi-step orchestration

Stress-test plan → act → observe loops, branching, and recovery behavior under stochastic execution.

Tool calls & APIs

Inject realistic tool failures, latency, and malformed responses to surface brittle tool policies.

Refusals & guardrails

Measure how often refusal logic and guardrails drift, over-trigger, or get bypassed under depth.

Routing & sub-agents

Evaluate router prompts and sub-agent handoffs as first-class reliability surfaces, not glue code.

APST-G · Graph-guided discovery

Discover hidden failure hotspots.

SafeFlow uses graph-guided prompt exploration to identify semantically related prompt regions where failures concentrate. Instead of sampling uniformly, APST-G expands from known failures into their semantic neighborhoods — surfacing clustered risk that standard evaluations miss.

  • Prompts represented as graph neighborhoods
  • Expansion guided by judge feedback and embedding distance
  • Risky regions cluster into named hotspots with example prompts
  • Discovery budget bounded — converges on the most informative samples
apst-g · prompt graph
safe risky failure
hotspot · pii-leakagehotspot · instruction-drift
Seed prompts
312
Neighbors explored
3,184
Hotspots found
7
APST-PO · Workflow hardening

Optimize the prompts and policies driving production risk.

APST-PO doesn't just rewrite prose — it hardens system prompts, tool policies, refusal behavior, routing prompts, and orchestration constraints against measured failure probability. Every candidate is stress-tested under repeated inference and ranked by validated risk reduction.

System prompts
Tool policies
Refusal behavior
Routing prompts
Orchestration constraints
baseline · system prompt v0
high risk
You are a helpful assistant.
Answer the user's question.
p_fail
0.0312
empirical
/ 100K
3,120
projected
Utility
0.84
task score
optimized · system prompt v7
−81% p_fail
You are a careful assistant.
Refuse unsafe requests and PII queries.
Cite sources for factual claims.
If uncertain, ask one clarifying question.
p_fail
0.0058
empirical
/ 100K
580
projected
Utility
0.86
task score
Validation re-runs APST against the same prompt distribution to confirm the reduction in empirical failure probability before any change reaches production.
Sample deliverable

Sample Reliability Optimization Report

A preview of what SafeFlow delivers: baseline vs. optimized failure probability, measured reduction, discovered hotspots, unsafe prompt categories, mitigation actions, and a recommended optimized system prompt — ready for deployment review.

Request sample report
safeflow-report · preview
Customer-assistant workflow · n=2,000 samples
v1.0
Baseline p_fail
0.0312
before
Optimized p_fail
0.0058
after
Reduction
−81%
validated
/ 100K saved
2,540
projected
Top unsafe categories
  • Unsafe completion38%
  • Hallucinated citation27%
  • Instruction drift19%
  • PII leakage9%
  • Tool-call error7%
Discovered hotspot clusters
  • pii-leakage · email parsing0.071high
  • instruction-drift · multi-turn0.044high
  • citation · medical queries0.029med
Recommended mitigation actions
  • Adopt optimized system prompt v7 (validated −81% p_fail)
  • Add rule-based PII pre-filter for email-parsing workflow
  • Lower temperature to 0.3 for citation-heavy medical queries
  • Re-validate weekly against sampled production traffic
Recommended system prompt
You are a careful assistant.
Refuse unsafe requests and PII queries.
Cite sources for factual claims.
If uncertain, ask one clarifying question.
Local-first: prompts and outputs stay in your environment.
APST Starter Kit

Try the reliability loop locally.

The open-source Starter Kit is a local-first tool for researchers, practitioners, and AI teams who want to experiment with repeated-inference measurement on their own machines.

  • Run a mock demo without an API key
  • Connect OpenAI, Together.ai, Ollama, vLLM, or any OpenAI-compatible endpoint
  • Keep prompts and outputs in your own environment
  • Generate English, Chinese, or bilingual reports
  • Available on GitHub and PyPI
terminal
$ pip install apst-starter-kit
$ apst demo --lang both

▸ loading mock prompt set      (50 prompts)
▸ sampling                     (n=200, T=0.7)
▸ judging                      (rule-based + llm)
▸ estimating p_fail            ✓
▸ discovering hotspots         (apst-g)  ✓
▸ optimizing system prompt     (apst-po) ✓
▸ validating reduction         ✓

  baseline p_fail     0.0312
  optimized p_fail    0.0058
  reduction           −81.4%
  bilingual report    ./safeflow-report.md
Enterprise offering

Reliability Optimization Audit

A focused engagement for teams deploying LLMs, agents, or AI assistants that need to measure, reduce, and validate operational failure risk before production.

SafeFlow runs the full Measure → Discover → Optimize → Validate loop on your prompts, workflows, and deployment assumptions. You receive an executive-ready report with empirical failure probabilities, discovered hotspots, optimized prompts, validated reductions, and projected operational exposure.

Pilot audits typically run 1–2 weeks and focus on one high-value LLM workflow.
Request a Reliability Optimization Audit
Deliverables
  • 01Custom prompt-risk taxonomy aligned to your workflow
  • 02APST measurement across selected models and configurations
  • 03APST-G hotspot discovery with named, scoped risk clusters
  • 04APST-PO candidate system prompts and policies
  • 05Validated before/after failure probability reduction
  • 06Projected operational exposure per 10K / 100K / 1M inferences
  • 07Executive report in English, Chinese, or bilingual format
  • 08Monitoring and re-validation cadence recommendations
Method

The math behind the reliability loop.

APST converts stochastic LLM behavior into measurable, projectable risk metrics. APST-G and APST-PO extend it into discovery and optimization.

  1. 01Profile reliability on your prompt distribution
  2. 02Discover hotspots with APST-G graph-guided exploration
  3. 03Generate candidate prompts and policies with APST-PO
  4. 04Stress-test candidates under repeated inference
  5. 05Validate measured reduction in failure probability
  6. 06Recommend deployment thresholds and monitoring
Core formulas
p_fail = failures / valid judged generations
Empirical failure probability
Δp = p_fail_baseline − p_fail_optimized
Validated reduction
Expected failures = p_fail × n
Deployment-scale projection
Research foundation

Research-backed, deployment-focused.

SafeFlow productizes the Accelerated Prompt Stress Testing research line introduced by Keita Broadwater — extended with graph-guided discovery (APST-G) and prompt optimization (APST-PO) for operational use.

Paper

Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

The primary APST paper introducing a depth-oriented framework for measuring LLM safety and reliability under repeated inference.

Paper

Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling

A CCAI 2026 conference paper presenting APST through the lens of reliability gaps in LLM safety evaluation.

Who it's for

Built for teams shipping AI into production.

AI platform teams
Enterprise AI deployment teams
Agent workflow builders
AI governance and risk teams
Security and red-team teams
Financial and healthcare AI teams
Pricing

Engagements sized to your deployment.

Start free with the Starter Kit, run a focused optimization pilot, or scale SafeFlow across your organization.

Starter Kit
Free
open source

Local-first APST demo for researchers and practitioners.

View Starter Kit
Most requested
Reliability Optimization Pilot
$5K–$10K USD
typical 1–2 week pilot

A focused Measure → Discover → Optimize → Validate engagement on one high-value LLM workflow, with empirical failure probability, validated reduction, and an executive-ready report.

Request Pilot Audit
Enterprise Reliability Program
Custom
tailored scope

Multi-model, multi-workflow APST evaluation with bilingual reporting, governance support, prompt optimization, monitoring recommendations, and executive briefings.

Discuss Enterprise Scope

SafeFlow does not guarantee safety. Engagements measure, discover, optimize, validate, and project LLM reliability to support deployment decisions.

International readiness

Designed for international AI teams.

SafeFlow supports English, Chinese, and bilingual reporting. The Starter Kit runs locally using OpenAI-compatible APIs, Ollama, vLLM, or local model servers — practical for teams in the United States, China, and international enterprise environments.

SafeFlow supports English and Chinese reliability reports.

SafeFlow 支持英文、中文和双语可靠性报告。

Contact

Ready to reduce hidden AI failure risk?

Run a Reliability Assessment, stress-test your agent workflow, or try the Starter Kit locally.

SafeFlow does not guarantee safety. We measure, discover, optimize, validate, and recommend — giving your team the data to make deployment decisions.

We respond within 1–2 business days.