3D Verification

3D Judge

To reach SoTA, the system should:

outperform CLIP-style and image-only judge baselines,
outperform ImageReward-style preference predictors,
and beat Eval3D on human correlation or pairwise agreement.

The goal is a judge that measures real 3D structural correctness, not just 2D plausibility.

Research Gap

Janus and view inconsistency

3D outputs can look plausible in one view while failing across views.

Data scarcity

Good 3D data is scarce, expensive, and narrow.

Benchmark weakness

Current metrics over-rely on image-text similarity and under-measure true 3D correctness.

Compute and representation limits

3D models remain expensive in tokens, compute, and representation overhead.

Open limitations across the literature

Existing systems still struggle with robust structural verification.

Benchmarks

Eval3D

Primary benchmark for a new 3D judge.
160 prompts, generated 3D assets, and human ratings.
Main target: beat the commonly cited 83% to 88% human agreement range.

T3-Bench

Secondary baseline for older multi-view image-text evaluators.

Core grading axes

Geometric consistency: detect floaters, noisy surfaces, and texture-geometry mismatch.
Structural consistency: detect Janus failures, duplicated parts, and impossible shape layouts.
Semantic consistency: detect objects whose identity changes across viewpoints.
Prompt alignment: detect missing prompt-critical details.

3D MM-Vet

Useful only if the judge is an MLLM.
Validates basic 3D spatial and visual grounding.

Powered by Forestry.md