DeltaBench Evaluator

Benchmark Evaluator

Prompt

Baseline (NON-Δ) Output

Δ-Framework Output

Measures how much shorter the response is without losing meaning, calculated as percentage reduction.

BERT-based cosine similarity score (0-1) showing how well the responses match in meaning.

Tracks relative increase in collaborative language patterns like "would you like" or "can I".

Measures decrease in imperative language ("you must", "do not") which can feel authoritarian.

Flesch Reading Ease scores (0-100) comparing how easy each version is to understand.

Detects polarity shifts between positive/negative/neutral in the Δ version.

Get precise metrics to improve your conversational AI frameworks.