A2UI Benchmark · Soverius

Benchmark reports

Generated PDF per machine — full prompts, completions, judge rationales

Top in charts Rank best / worst by / rows

Semantic accuracy

Avg across selection · best

Schema accuracy

Valid A2UI JSON share

Avg inference

Per-prompt latency

Prompts evaluated

In current selection

Output tokens

Total generated

Distinct models

engines · families

Schema accuracy

% of prompts producing a valid A2UI payload

higher is better

Semantic accuracy

% of prompts judged correct by the LLM judge

higher is better

Inference time

Average seconds per prompt

lower is better

Output tokens generated

Total decoded tokens over selection

sum

Per-model results

Click a row to inspect per-prompt grading

No rows match the current filters.