Evals — Council vs Single

Council > single model, but proven. The multi-perspective claim must be measurably better via evals, not asserted.

The evals/ directory holds Council-vs-Single comparison data: the same tasks answered by a single model and by the Council, scored and published.

Candidate task families

The most convincing evidence comes from tasks where a skeptic/reviewer perspective changes the outcome:

Code review: does the Council catch real bugs a single chat misses?
Decision memos: does surfaced dissent improve the recommendation?
Document triage: does multi-perspective reading reduce false conclusions?

Choosing the most convincing eval set is an open question.

Why it’s in the repo

Publishing the eval data makes the central differentiator falsifiable. If the Council doesn’t beat a single model on a task family, that’s visible. A contributor can improve the recipes until it does. This lands as roadmap phase v0.4.