3D Judge
the system should:
- outperform CLIP-style and image-only judge baselines,
- outperform prior preference predictors -ImageReward-style approaches,
- and beat Eval3D on human correlation or pairwise agreement.
build a judge that measures real 3D structural correctness, not just 2D plausibility.
Research Gap
- Janus and view inconsistency
- A core failure mode is structural inconsistency across views: multiple faces, impossible topology, or objects that change identity as the camera moves.
- Data scarcity
- High-quality 3D data is limited, expensive to label, and often too narrow.
- Benchmark weakness
- Many current metrics over-index on text-image similarity and under-measure geometry, structure, and spatial coherence.
- There is still a gap between “looks plausible in rendered views” and “is actually correct as a 3D asset.”
- Compute and representation limits
- 3D models still pay heavy costs in pretraining time, token count, and representation overhead.
- Dynamic scenes remain especially expensive because they require continuous updates and multi-view consistency.
- Open limitations across the literature
- Current systems still struggle with static-only assumptions, weak room-level context, limited part-level understanding, and imperfect foundation-model backbones.
- This leaves room for a judge that is explicitly optimized for robust structural verification instead of generic generation or captioning.
Benchmarks
- Eval3D is the primary benchmark
- Use the Eval3D benchmark as the main reference set for a new 3D judge.
- It includes 160 prompts across single objects, multiple objects, and more complex scene compositions.
- The key target is pairwise agreement with human preference. A competitive judge should exceed the commonly cited Eval3D baseline range of roughly 83% to 88%.
- T3-Bench is the secondary baseline
- It is helpful for comparing against prior evaluators built around multi-view image-text scoring.
- Use it to show that your judge improves on earlier prompt-alignment and subjective-quality baselines, not just newer judge models.
- Core grading axes
- Geometric consistency: detect floaters, noisy surfaces, and texture-geometry mismatch.
- Structural consistency: detect Janus failures, duplicated parts, and impossible shape layouts.
- Semantic consistency: detect objects whose identity changes across viewpoints.
- Prompt alignment:
- 3D MM-Vet is useful for LLM-based judges
- If the judge is an MLLM rather than a pure scoring model
- It helps validate that the model actually understands 3D spatial relations, object parts, and visual grounding before using it as a judge.