DeltaBench Evaluator Pro

Measure AI response quality with precision metrics

Benchmark Evaluator

Key Metrics We Measure

Compression

Measures how much shorter the response is without losing meaning, calculated as percentage reduction.

Semantic Similarity

BERT-based cosine similarity score (0-1) showing how well the responses match in meaning.

Consent Markers

Tracks relative increase in collaborative language patterns like "would you like" or "can I".

Directive Reduction

Measures decrease in imperative language ("you must", "do not") which can feel authoritarian.

Readability

Flesch Reading Ease scores (0-100) comparing how easy each version is to understand.

Sentiment

Detects polarity shifts between positive/negative/neutral in the Δ version.

Ready to Benchmark Your AI Responses?

Get precise metrics to improve your conversational AI frameworks.