Benchmark Evaluator
Key Metrics We Measure
Compression
Measures how much shorter the response is without losing meaning, calculated as percentage reduction.
Semantic Similarity
BERT-based cosine similarity score (0-1) showing how well the responses match in meaning.
Consent Markers
Tracks relative increase in collaborative language patterns like "would you like" or "can I".
Directive Reduction
Measures decrease in imperative language ("you must", "do not") which can feel authoritarian.
Readability
Flesch Reading Ease scores (0-100) comparing how easy each version is to understand.
Sentiment
Detects polarity shifts between positive/negative/neutral in the Δ version.
Ready to Benchmark Your AI Responses?
Get precise metrics to improve your conversational AI frameworks.