-
Use Pairwise Comparisons for Subjectivity: Instead of asking a model to rate a response on a scale of 1–10 (direct scoring), give it two responses and ask it to pick the better one. This significantly reduces variance and aligns much closer with human preferences.
-
Force Chain-of-Thought (CoT): Always prompt the evaluator to explain its reasoning step-by-step before it outputs a final score or decision. This drastically improves accuracy.
-
Use a Panel of LLMs (PoLL): Relying on a single massive model (like GPT-4) is expensive and can introduce "intra-model bias." Instead, ensemble three smaller, diverse models and have them vote on the best response. It is cheaper and often more accurate.
-
Tell Smart Models Not to Overthink: For simple tasks like checking if a generated answer matches a ground-truth reference, large models often over-analyze or inject outside knowledge. Explicitly prompting them with "do not overthink" keeps them on track.
-
Be Hyper-Specific with Criteria: Broad instructions like "evaluate for quality" lead to poor results. Break down your evaluation into strict, granular criteria (e.g., "does the response contain offensive language?" or "does it follow the JSON format strictly?"). Evaluate one dimension per prompt.
-
Use Cross-Examination for Facts: If you need to check for factual accuracy, set up a multi-turn prompt where an "examiner" LLM asks the "examinee" LLM follow-up questions to sniff out inconsistencies.