Score your voicebot quality with CAIQS.

Make sure that your voicebot meets your customers’ expectations. CAIQS is a tool to quickly score your voicebot`s quality under real conditions from poor to perfect.

You trust in your newest voicebot to handle customer orders and requests. It works fine when testing it yourself. But has the voicebot experienced real life yet – with different phone types, background noise, indistinct talking manners, networks inbetween?

A score for QA

The voicebot score is built for end‑to‑end evaluation over telephony and similar call flows. It measures what users actually experience, not just synthetic transcripts. Therefore, we run automated end‑to‑end tests using our real‑device phone lab (no emulation). We use real networks, controlled realism, configure caller profiles, environments, and network conditions.

Every run produces an easy readable score that gives you a feeling for the voicebot quality – the CAIQS (Conversational AI Quality Score). To dig deeper, you can review detailed reports with audio recordings and transcripts you can share.

You are currently viewing a placeholder content from YouTube. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.

More Information

CAIQS is a 0–100 composite metric built from objective and versioned measurements: task success, understanding accuracy, robustness, latency, MOS audio quality, compliance, security — and long‑dialog degradation. New in CAIQS v1.1: LDR (Long‑Dialog Resilience), a dedicated subscore that captures quality drift when conversations run long and context starts to degrade.

CAIQS formula

Voice quality failures rarely come from one place. That said, the score for voice quality of a voicebot is calculated using 11 components (see table below). CAIQS separates task correctness, understanding, robustness, latency, audio, and compliance/security so teams can diagnose and improve systematically. Each of them produces a normalized score 0–100.

The composite CAIQS is a weighted sum. The weight matches the CAIQS-Default v1.1 definition, resulting in the formular:

CAIQS = Σ (wᵢ · Sᵢ)

wᵢ … weight of a component
Sᵢ … subscore 0-100

The list of 11 components and their weight of the formula to rank a voicebot’s quality:

Summary

  • CAIQS is an easy readable tool by QiTASC that indicated voicebot quality using a score system.
  • The score improves real-world reliability by testing scenarios which involve real phones and real networks.

Latest articles

Category

Meaning

Weight

What is measured?

TS

Task success

0,30

Objective goal achievement (appointment booked, price collected, etc.).

UA

Understanding accuracy

0,15

Black-box semantic correctness per turn (intent/slot/state match).

SD

Slang robustness

0,15

Handling of slang/non‑standard phrasing variants.

FR

Fallback ratio

0,10

How often the bot falls back (“Sorry, I didn’t understand”). Lower is better (inverted).

DT

Dialog efficiency

0,04

Actual turns vs. scenario “oracle” minimum turns.

LAT95

Turn-taking latency

0,03

95th percentile latency between end-of-user-speech and start-of-bot-speech (scaled).

LDR

Long‑dialog resilience

0,03

Measures quality drift later vs earlier in long dialogs (context loss, repair loops, slowdowns).

AQ

Audio quality (MOS)

0,05

Objective MOS (e.g., ITU‑T P.563/P.863, locked by profile).

CT

Compliance & tone

0,05

Rule-based compliance + judgment-based tone (versioned).

SP

Security & privacy

0,05

Penalty-based score for privacy/security failures (PII leakage, auth bypass, etc.).

FI

Filler intelligence

0,05

Appropriate acknowledgements/fillers when latency would otherwise harm UX.

Sample criteria: LDR: Long‑dialog degradation subscore

A relevant criteria is the quality of a long lasting dialog. Thereby, LDR measures whether the bot performs worse later than earlier in the same call. It’s intentionally different from dialog efficiency (DT): DT cares about how many turns, LDR cares about time-dependent degradation.

CAIQS uses LDR as subscore, focussing on long duration dialogs. This is relevant for the quality of a real-life scenario, because such long calls can degrade quality: the model loses context, repair loops increase, latency creeps up, and fallbacks become more frequent. LDR quantifies that drift by comparing the early and late segments of the same dialog.

Inputs used by LDR (per turn):

correctₜ

semantic correctness

fallbackₜ

fallback classification

latencyₜ

turn-taking latency

In long dialogs, LDR compares early vs. late windows and computes a “degradation” delta.

# Long-dialog threshold T_long = 8 turns

# Window size W = max(3, ceil(0.25 * T))

# Early window = turns 1..W

# Late window = turns (T-W+1)..T UA_X = mean(correct_t) over window X FR_X = mean(fallback_t) over window X LAT95_X = P95(latency_t) over window X sLAT_X = scale_lat_v1_0(LAT95_X) / 100 Q_X = (UA_X + (1 – FR_X) + sLAT_X) / 3 Degradation = max(0, Q_early – Q_late) LDR = clamp(0, 1, 1 – Degradation) S_LDR = 100 * LDR

IA high LDR (close to 100) means the bot stays stable as the call runs long. A low LDR indicates quality drift (late segment worse than early), which often matches real user complaints: “It started fine, then it got confused.”

LDR is intentionally sensitive to long duration dialogs: if a call is short (T < 8 turns), LDR defaults to 100 (no drift penalty).

How to interpret CAIQS

CAIQS is most valuable when used for trend and regression (version‑to‑version comparisons) and for diagnostics (which subscore is pulling you down). If short tests look fine but customers complain, run longer scenarios and focus on LDR to find context drift and compounding failures.

Remember: For defensible comparisons, keep CAIQS version, profile id, and scaling/judge versions consistent across runs. Treat CAIQS as a measurement system, not a marketing number.

90-100

Best in class

Perfect. 
Typically stable long journeys (high LDR), low fallbacks, strong task success. Suitable for critical flows.

80-89

Production ready

Good. 
Minor gaps remain (often latency spikes, occasional drift in long dialogs, or slang coverage).

70-79

Ok

Needs optimization. 
Expect notable friction for some users. Check FR, UA/SD, and LDR on long scenarios.

60-69

Weak

High risk.
Significant issues likely visible in recordings. Often high fallback loops or slow turn-taking.

< 60

Poor

Not recommended.
Fundamental quality gaps. A long dialog will likely collapse (very low LDR) and task success suffers.

Conclusion:

CAIQS: Turning voicebot performance into measurable business impact

With CAIQS, voicebot quality moves beyond gut feeling and isolated KPIs, by turning transparent, structured, and objectively measurable. The formula behind converts real interaction performance into a standardized quality score. This enables organizations to benchmark systems, compare providers, and clearly identify optimization potential. It creates a reliable foundation for continuous improvement and strategic decision-making.

For businesses, the benefits go far beyond technical insight. CAIQS transforms voicebot performance into a controllable business asset, driving higher customer satisfaction, greater efficiency, and sustainable competitive advantage.