TRUST & METHODOLOGY
The research, the mechanism, and the principles behind multi-model deliberation.
In statistical learning, ensembles consistently outperform individual models. The same principle applies to language models: aggregating diverse perspectives reduces the variance of any single predictor and surfaces blind spots that any one model would otherwise miss.
A 2024 study by MIT researchers (Improving Factuality and Reasoning in Language Models through Multiagent Debate) found that models produce more accurate results when they critique each other's responses across arithmetic, factuality, and reasoning benchmarks.
Independent testing across 100 expert-level questions in finance, law, medicine, and technology has shown multi-model synthesis matching or outperforming the best individual frontier model — without performance degradation on the questions where a single model was already correct.
This isn't a theory. It's a documented, replicated finding across multiple research groups. See our live benchmark page for current frontier-model scores and how a council composite compares.
No model sees the others' work during Stage 1. This prevents anchoring bias — the well-documented effect where the first answer in a sequence pulls subsequent answers toward it.
During Stage 2, each model evaluates responses labelled A, B, C, D — without knowing which model produced which. This prevents reputation effects and removes any bias toward "what the famous model would say."
The synthesis model's job is integration, not opinion. It weighs the peer reviews, not its own preferences. The chairman is never also a council member — that separation is enforced in code.
LLM Counsel was inspired by Andrej Karpathy's open-source LLM Council — a lightweight interface he built in a single weekend to query multiple models and have them review each other's work. The concept was simple: important questions deserve more than one opinion.
We took that idea and built the production infrastructure around it — streaming, caching, presets, confidence scoring, OpenAI-compatible auth, an MCP server, and an API that drops into any existing workflow.
The MIT "Improving Factuality and Reasoning in Language Models through Multiagent Debate" paper (2024) is the most cited. We also reference Artificial Analysis's 2026 divergence study and the LLM Consensus 100-question expert benchmark.
No. The chairman and the council are different model instances by design — that separation is enforced in code and is the structural reason synthesis doesn't collapse back into one model's view.
During Stage 2, each model evaluates responses labelled A, B, C, D — without knowing which model produced which. The mapping is held server-side and only de-anonymised for the final reasoning trace shown to the user.
No. Averaging would flatten disagreement. The chairman is asked to integrate the highest-ranked points, name conflicts, and present caveats — the goal is a defensible synthesis, not a mean.