A2UI Benchmark

Local-inference UI generation · LLM-as-Judge
runs evaluations prompts

Benchmark reports

Generated PDF per machine — full prompts, completions, judge rationales
PDF
DGX Spark GB10 · Linux
vLLM + llama.cpp · 16 model evaluations
Download ↓
PDF
MacBook Pro M2 Max · macOS
llama.cpp · 8 model evaluations
Download ↓
PDF
MacBook Pro M1 Max · macOS
llama.cpp · 8 model evaluations
Download ↓
Machine all
Engine all
Family all
Table search & status
/ rows
Semantic accuracy
%
Avg across selection · best
Schema accuracy
%
Valid A2UI JSON share
Avg inference
s
Per-prompt latency
Prompts evaluated
In current selection
Output tokens
Total generated
Distinct models
engines · families
Schema accuracy
% of prompts producing a valid A2UI payload
higher is better
Semantic accuracy
% of prompts judged correct by the LLM judge
higher is better
Inference time
Average seconds per prompt
lower is better
Output tokens generated
Total decoded tokens over selection
sum
Per-model results
Click a row to inspect per-prompt grading
No rows match the current filters.