Benchmark reports
Generated PDF per machine — full prompts, completions, judge rationalesSemantic accuracy
%
Avg across selection · best
Schema accuracy
%
Valid A2UI JSON share
Avg inference
s
Per-prompt latency
Prompts evaluated
In current selection
Output tokens
Total generated
Distinct models
engines · families
Schema accuracy
% of prompts producing a valid A2UI payload
higher is better
Semantic accuracy
% of prompts judged correct by the LLM judge
higher is better
Inference time
Average seconds per prompt
lower is better
Output tokens generated
Total decoded tokens over selection
sum
Per-model results
Click a row to inspect per-prompt grading
| No rows match the current filters. | |||||||||||||||||||||
|
★ Best · @
▼ Worst · @
|
completed |
|
|
||||||||||||||||||
|
No per-prompt detail available
·
Schema / pass
Semantic C / P / I
·
·
Avg / medians · s
Min / maxs · s
Tokens in / out ·
Quantization
|
|||||||||||||||||||||