TRUST & METHODOLOGY

A council doesn't work because it's bigger. It works because it's honest.

The research, the mechanism, and the principles behind multi-model deliberation.

The research

In statistical learning, ensembles consistently outperform individual models. The same principle applies to language models: aggregating diverse perspectives reduces the variance of any single predictor and surfaces blind spots that any one model would otherwise miss.

A 2024 study by MIT researchers (Improving Factuality and Reasoning in Language Models through Multiagent Debate) found that models produce more accurate results when they critique each other's responses across arithmetic, factuality, and reasoning benchmarks.

Independent testing across 100 expert-level questions in finance, law, medicine, and technology has shown multi-model synthesis matching or outperforming the best individual frontier model — without performance degradation on the questions where a single model was already correct.

This isn't a theory. It's a documented, replicated finding across multiple research groups. See our live benchmark page for current frontier-model scores and how a council composite compares.

The mechanism — how anonymous peer review prevents groupthink

1. Models respond independently

No model sees the others' work during Stage 1. This prevents anchoring bias — the well-documented effect where the first answer in a sequence pulls subsequent answers toward it.

2. Reviews are anonymous

During Stage 2, each model evaluates responses labelled A, B, C, D — without knowing which model produced which. This prevents reputation effects and removes any bias toward "what the famous model would say."

3. The chairman is designated, not dominant

The synthesis model's job is integration, not opinion. It weighs the peer reviews, not its own preferences. The chairman is never also a council member — that separation is enforced in code.

From a weekend project to a production platform

LLM Counsel was inspired by Andrej Karpathy's open-source LLM Council — a lightweight interface he built in a single weekend to query multiple models and have them review each other's work. The concept was simple: important questions deserve more than one opinion.

We took that idea and built the production infrastructure around it — streaming, caching, presets, confidence scoring, OpenAI-compatible auth, an MCP server, and an API that drops into any existing workflow.

What we believe

Deliberate, not fast. Calm precision over speed. The value is in the judgment.
Plain and precise. Clarity beats cleverness. We use the simplest accurate word.
Honest about uncertainty. Good counsel names its confidence and its caveats.
On your side. We exist to help someone decide — not to impress them.

Frequently asked questions

Where can I read the research?

The MIT "Improving Factuality and Reasoning in Language Models through Multiagent Debate" paper (2024) is the most cited. We also reference Artificial Analysis's 2026 divergence study and the LLM Consensus 100-question expert benchmark.

Is the chairman ever also a council member?

No. The chairman and the council are different model instances by design — that separation is enforced in code and is the structural reason synthesis doesn't collapse back into one model's view.

How does anonymity work in peer review?

During Stage 2, each model evaluates responses labelled A, B, C, D — without knowing which model produced which. The mapping is held server-side and only de-anonymised for the final reasoning trace shown to the user.

Is multi-model AI just averaging?

No. Averaging would flatten disagreement. The chairman is asked to integrate the highest-ranked points, name conflicts, and present caveats — the goal is a defensible synthesis, not a mean.