3D Judge
To reach SoTA, the system should:
- outperform CLIP-style and image-only judge baselines,
- outperform ImageReward-style preference predictors,
- and beat Eval3D on human correlation or pairwise agreement.
The goal is a judge that measures real 3D structural correctness, not just 2D plausibility.
Research Gap
- Janus and view inconsistency
- 3D outputs can look plausible in one view while failing across views.
- Data scarcity
- Good 3D data is scarce, expensive, and narrow.
- Benchmark weakness
- Current metrics over-rely on image-text similarity and under-measure true 3D correctness.
- Compute and representation limits
- 3D models remain expensive in tokens, compute, and representation overhead.
- Open limitations across the literature
- Existing systems still struggle with robust structural verification.
Benchmarks
- Eval3D
- Primary benchmark for a new 3D judge.
- 160 prompts, generated 3D assets, and human ratings.
- Main target: beat the commonly cited 83% to 88% human agreement range.
- T3-Bench
- Secondary baseline for older multi-view image-text evaluators.
- Core grading axes
- Geometric consistency: detect floaters, noisy surfaces, and texture-geometry mismatch.
- Structural consistency: detect Janus failures, duplicated parts, and impossible shape layouts.
- Semantic consistency: detect objects whose identity changes across viewpoints.
- Prompt alignment: detect missing prompt-critical details.
- 3D MM-Vet
- Useful only if the judge is an MLLM.
- Validates basic 3D spatial and visual grounding.