Reliability optimization for LLMs and agent workflows

Reduce operational AI failures before deployment.

Measure hidden AI failure risk on your real production traffic. Discover where failures cluster. Validate reductions before deployment.

Reduce your p_fail See the architecture

Built on Accelerated Prompt Stress Testing (APST) research, SafeFlow stress-tests agent workflows under retries, tool calls, and stochastic execution paths — locating failure hotspots, optimizing the prompts and policies driving them, and proving the reduction in empirical failure probability before deployment.

Local-first: prompts, traffic, and outputs stay in your environment.

Based on APST research introduced by Keita Broadwater, available on arXiv and presented at CCAI 2026.

Reliability Optimization

validated

Baseline p_fail

0.0312

before optimization

Optimized p_fail

0.0058

after APST-PO

Failure-rate reduction−81.4%

Hotspots

discovered

Prompts

3.2K

explored

Validated

✓

before/after

The reliability loop

Measure. Discover. Optimize. Validate.

SafeFlow operationalizes AI reliability as an engineering loop — not a one-off benchmark. Each stage feeds the next, turning stochastic LLM behavior into a managed risk surface.

Measure

Estimate empirical failure probability

Quantify how often LLMs and agents fail under repeated inference across real prompt distributions.

Discover

Find hidden failure hotspots

APST-G graph-guided exploration surfaces semantically related prompt regions where failures concentrate.

Optimize

Generate safer system prompts

APST-PO evaluates candidate system prompts and workflow policies that reduce risk while preserving utility.

Validate

Prove before / after reduction

Re-run reliability tests to confirm reduced failure probability and document the improvement.

System architecture

From production traffic to validated reduction.

SafeFlow is a closed reliability loop grounded in your real prompt distribution. Every stage feeds the next — and every optimization is validated against the same traffic it was trained against.

stage 01Production prompt distributionyour real traffic, sampled

stage 02Reliability profilingempirical p_fail under repeated inference

stage 03Hotspot discoveryAPST-G graph-guided exploration

stage 04Prompt & policy optimizationAPST-PO system prompts, tools, refusals, routing

stage 05Validated reductionbefore / after Δp_fail proven on the same distribution

stage 06Deployment monitoringre-validation cadence on live traffic

Closed loop: monitoring feeds new traffic back into reliability profiling.Runs locally against your environment.

The problem

Stochastic failures don't show up in one-shot evaluations.

Production LLMs are queried repeatedly, retried, sampled at temperature, and embedded into agent workflows. Failures cluster in regions that shallow benchmarks never reach — and a 0.5% failure rate becomes thousands of incidents at deployment scale.

Hidden failure regions

Risky prompts cluster in semantic neighborhoods that standard evals don't sample densely enough to expose.

Rankings shift under depth

Models that look comparable on one-shot benchmarks diverge sharply when stress-tested at deployment depth.

Rare failures compound

Low per-call failure rates translate into recurring operational, safety, and trust incidents in production.

Platform capabilities

A reliability engineering platform — not a benchmark.

SafeFlow combines measurement, discovery, optimization, and validation in a single workflow built for teams deploying LLMs and agents to production.

Reliability profiling on your production prompt distribution

Estimate empirical failure probability on real production traffic — not synthetic benchmarks decoupled from your users.

Agent workflow stress testing

Stress-test agents under retries, tool calls, refusals, and stochastic execution paths — exposing failures that single-shot evals never trigger.

Graph-guided hotspot discovery

APST-G expands semantic neighborhoods around risky prompts to surface clustered failure regions across your traffic.

System prompt, tool, and policy optimization

APST-PO hardens system prompts, tool policies, refusal behavior, routing prompts, and orchestration constraints against measured risk.

Deployment-scale risk projection

Translate empirical p_fail into projected incidents per 10K / 100K / 1M inferences across your real workload.

Before / after reduction validation

Re-run APST against the same distribution to prove that an optimization actually lowered empirical failure probability.

Production traffic + benchmark blending

Combine sampled production prompts with curated safety benchmarks for evaluations grounded in how your system is actually used.

Executive-ready reliability reports

Bilingual reports for governance, risk, and security reviews — operational, not academic.

Agent workflow reliability

Stress-test agent workflows under retries, tool calls, and stochastic execution paths.

Agents fail differently than single-shot prompts. Compounding stochasticity across tool calls, retries, and routing decisions creates failure modes that one-shot evals never expose. SafeFlow stress-tests the entire workflow on your real production traffic and reports where reliability breaks.

Stress-test my agent workflow

Multi-step orchestration

Stress-test plan → act → observe loops, branching, and recovery behavior under stochastic execution.

Tool calls & APIs

Inject realistic tool failures, latency, and malformed responses to surface brittle tool policies.

Refusals & guardrails

Measure how often refusal logic and guardrails drift, over-trigger, or get bypassed under depth.

Routing & sub-agents

Evaluate router prompts and sub-agent handoffs as first-class reliability surfaces, not glue code.

APST-G · Graph-guided discovery

Discover hidden failure hotspots.

SafeFlow uses graph-guided prompt exploration to identify semantically related prompt regions where failures concentrate. Instead of sampling uniformly, APST-G expands from known failures into their semantic neighborhoods — surfacing clustered risk that standard evaluations miss.

Prompts represented as graph neighborhoods
Expansion guided by judge feedback and embedding distance
Risky regions cluster into named hotspots with example prompts
Discovery budget bounded — converges on the most informative samples

apst-g · prompt graph

safe risky failure

Seed prompts

312

Neighbors explored

3,184

Hotspots found

APST-PO · Workflow hardening

Optimize the prompts and policies driving production risk.

APST-PO doesn't just rewrite prose — it hardens system prompts, tool policies, refusal behavior, routing prompts, and orchestration constraints against measured failure probability. Every candidate is stress-tested under repeated inference and ranked by validated risk reduction.

System prompts

Tool policies

Refusal behavior

Routing prompts

Orchestration constraints

baseline · system prompt v0

high risk

You are a helpful assistant.
Answer the user's question.

p_fail

0.0312

empirical

/ 100K

3,120

projected

Utility

0.84

task score

optimized · system prompt v7

−81% p_fail

You are a careful assistant.
Refuse unsafe requests and PII queries.
Cite sources for factual claims.
If uncertain, ask one clarifying question.

p_fail

0.0058

empirical

/ 100K

580

projected

Utility

0.86

task score

Validation re-runs APST against the same prompt distribution to confirm the reduction in empirical failure probability before any change reaches production.

Sample deliverable

Sample Reliability Optimization Report

A preview of what SafeFlow delivers: baseline vs. optimized failure probability, measured reduction, discovered hotspots, unsafe prompt categories, mitigation actions, and a recommended optimized system prompt — ready for deployment review.

Request sample report

safeflow-report · preview

Customer-assistant workflow · n=2,000 samples

v1.0

Baseline p_fail

0.0312

before

Optimized p_fail

0.0058

after

Reduction

−81%

validated

/ 100K saved

2,540

projected

Top unsafe categories

Unsafe completion38%
Hallucinated citation27%
Instruction drift19%
PII leakage9%
Tool-call error7%

Discovered hotspot clusters

pii-leakage · email parsing0.071high
instruction-drift · multi-turn0.044high
citation · medical queries0.029med

Recommended mitigation actions

Adopt optimized system prompt v7 (validated −81% p_fail)
Add rule-based PII pre-filter for email-parsing workflow
Lower temperature to 0.3 for citation-heavy medical queries
Re-validate weekly against sampled production traffic

Recommended system prompt

You are a careful assistant.
Refuse unsafe requests and PII queries.
Cite sources for factual claims.
If uncertain, ask one clarifying question.

Local-first: prompts and outputs stay in your environment.

APST Starter Kit

Try the reliability loop locally.

The open-source Starter Kit is a local-first tool for researchers, practitioners, and AI teams who want to experiment with repeated-inference measurement on their own machines.

Run a mock demo without an API key
Connect OpenAI, Together.ai, Ollama, vLLM, or any OpenAI-compatible endpoint
Keep prompts and outputs in your own environment
Generate English, Chinese, or bilingual reports
Available on GitHub and PyPI

View GitHub View PyPI Read Quickstart

terminal

$ pip install apst-starter-kit
$ apst demo --lang both

▸ loading mock prompt set      (50 prompts)
▸ sampling                     (n=200, T=0.7)
▸ judging                      (rule-based + llm)
▸ estimating p_fail            ✓
▸ discovering hotspots         (apst-g)  ✓
▸ optimizing system prompt     (apst-po) ✓
▸ validating reduction         ✓

  baseline p_fail     0.0312
  optimized p_fail    0.0058
  reduction           −81.4%
  bilingual report    ./safeflow-report.md

Enterprise offering

Reliability Optimization Audit

A focused engagement for teams deploying LLMs, agents, or AI assistants that need to measure, reduce, and validate operational failure risk before production.

SafeFlow runs the full Measure → Discover → Optimize → Validate loop on your prompts, workflows, and deployment assumptions. You receive an executive-ready report with empirical failure probabilities, discovered hotspots, optimized prompts, validated reductions, and projected operational exposure.

Pilot audits typically run 1–2 weeks and focus on one high-value LLM workflow.

Request a Reliability Optimization Audit

Deliverables

01Custom prompt-risk taxonomy aligned to your workflow
02APST measurement across selected models and configurations
03APST-G hotspot discovery with named, scoped risk clusters
04APST-PO candidate system prompts and policies
05Validated before/after failure probability reduction
06Projected operational exposure per 10K / 100K / 1M inferences
07Executive report in English, Chinese, or bilingual format
08Monitoring and re-validation cadence recommendations

Method

The math behind the reliability loop.

APST converts stochastic LLM behavior into measurable, projectable risk metrics. APST-G and APST-PO extend it into discovery and optimization.

01Profile reliability on your prompt distribution
02Discover hotspots with APST-G graph-guided exploration
03Generate candidate prompts and policies with APST-PO
04Stress-test candidates under repeated inference
05Validate measured reduction in failure probability
06Recommend deployment thresholds and monitoring

Core formulas

p_fail = failures / valid judged generations

Empirical failure probability

Δp = p_fail_baseline − p_fail_optimized

Validated reduction

Expected failures = p_fail × n

Deployment-scale projection

Research foundation

Research-backed, deployment-focused.

SafeFlow productizes the Accelerated Prompt Stress Testing research line introduced by Keita Broadwater — extended with graph-guided discovery (APST-G) and prompt optimization (APST-PO) for operational use.

Paper

Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

The primary APST paper introducing a depth-oriented framework for measuring LLM safety and reliability under repeated inference.

arXiv DOI

Paper

Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling

A CCAI 2026 conference paper presenting APST through the lens of reliability gaps in LLM safety evaluation.

arXiv DOI

Who it's for

Built for teams shipping AI into production.

AI platform teams

Enterprise AI deployment teams

Agent workflow builders

AI governance and risk teams

Security and red-team teams

Financial and healthcare AI teams

Pricing

Engagements sized to your deployment.

Start free with the Starter Kit, run a focused optimization pilot, or scale SafeFlow across your organization.

Starter Kit

Free

open source

Local-first APST demo for researchers and practitioners.

View Starter Kit

Most requested

Reliability Optimization Pilot

$5K–$10K USD

typical 1–2 week pilot

A focused Measure → Discover → Optimize → Validate engagement on one high-value LLM workflow, with empirical failure probability, validated reduction, and an executive-ready report.

Request Pilot Audit

Enterprise Reliability Program

Custom

tailored scope

Multi-model, multi-workflow APST evaluation with bilingual reporting, governance support, prompt optimization, monitoring recommendations, and executive briefings.

Discuss Enterprise Scope

SafeFlow does not guarantee safety. Engagements measure, discover, optimize, validate, and project LLM reliability to support deployment decisions.

International readiness

Designed for international AI teams.

SafeFlow supports English, Chinese, and bilingual reporting. The Starter Kit runs locally using OpenAI-compatible APIs, Ollama, vLLM, or local model servers — practical for teams in the United States, China, and international enterprise environments.

SafeFlow supports English and Chinese reliability reports.

SafeFlow 支持英文、中文和双语可靠性报告。

Contact

Ready to reduce hidden AI failure risk?

Run a Reliability Assessment, stress-test your agent workflow, or try the Starter Kit locally.

Request a Reliability Optimization Audit View Starter Kit

SafeFlow does not guarantee safety. We measure, discover, optimize, validate, and recommend — giving your team the data to make deployment decisions.