How Nodelulu Works
Our methodology — transparent, multi-layered, and independently verifiable.
The Problem
AI-generated text looks confident even when it's wrong. Misinformation spreads faster than corrections. Whether you're reading a research paper, a news article, or content from a chatbot, you deserve to know what's real and what's delulu.
Our Approach — The Science Behind Nodelulu
Nodelulu is an ensemble verification system designed to accurately identify AI hallucinations. The core idea: if each AI checker independently has a good chance of catching an error, then aggregating their judgments reduces the overall error rate — often dramatically. This isn't a guess. It's the same logic behind ensemble learning in machine learning, redundancy in safety-critical engineering, and centuries-old statistical theory.
The research that led us here
- Ensemble methods in machine learning
Dietterich (2000), Breiman (1996), Freund & Schapire (1997)Combining multiple independently-trained models consistently outperforms any single model. This is one of the most replicated findings in all of AI research — the foundation behind Random Forests, boosting, and every modern prediction system that matters.
- Condorcet's Jury Theorem (1785)
If each independent evaluator is more likely right than wrong, the probability of the majority being correct approaches 100% as evaluators are added. With 4 checkers each at 70% accuracy, the probability of the majority being wrong drops to roughly 8% — down from 30% for a single checker.
- Wisdom of Crowds
Galton (1907), Surowiecki (2004)Independent diverse judgments aggregated together are more accurate than individual expert judgments. The crucial conditions: judges must be genuinely independent and bring different perspectives. Same model, same prompt, same blind spots = no gain.
- Self-Consistency in LLM reasoning
Wang et al. (2022)Multiple reasoning paths aggregated by majority vote improve correctness in modern language models. This is the closest direct precedent to what Nodelulu does — except we go further by using different models, different providers, and different analytical specialties rather than sampling from a single model.
- Redundancy in safety-critical systems
Aviation, medical devices, nuclear engineeringIndependent verification is standard practice in every domain where getting it wrong has consequences. Two-person integrity, redundant flight computers, independent safety reviews — the principle is the same: independent checks reduce undetected failure.
How Nodelulu is built to satisfy these conditions
The research is clear: ensemble gains depend on meeting specific conditions. Cut corners on any of them and the advantage disappears. Here's what we built and why:
- Genuine independence and diversity: Four premium reasoning models from three leading providers (OpenAI, Anthropic, Google). Different companies, different architectures, different training data, different blind spots. Each model is given a unique analytical specialty through a tested configuration that took months to develop and validate.
- Premium reasoning power: Every model in the system is a top-tier, high-end reasoning model — the most capable available from each provider. The science only works when each checker is individually good at catching errors. Lightweight or free-tier models don't have the reasoning depth to make this approach viable.
- Structured, specific verification: Each model doesn't just read your text and give an opinion. It verifies specific claim types — facts, numbers, citations, logic, contradictions, sources — using structured outputs with defined categories. Precision, not vibes.
- Weighted consensus aggregation: Findings aren't merged by simple majority vote. Multi-model agreement at high severity carries significantly more weight than a single model flagging something with hedged language. The system knows the difference between four models shouting and one model whispering.
A single AI — even the most expensive one available — will always have blind spots. The science says so, and our testing confirms it. Nodelulu exists because accurate hallucination detection requires a system, not a single model. Four premium reasoning models, three providers, weighted consensus, and independent web search verification — built to do what no single AI can do alone.
Stage 1 — Safety & Gatekeeping
Before any AI model sees your text, Nodelulu runs two checks:
- Content moderation: Prohibited content (CSAM, weapons instructions, targeted harassment) is blocked immediately. This protects our infrastructure and complies with AI provider policies. We don't censor opinions or controversial topics — only content that is illegal or violates provider terms.
- Prompt injection defence: Documents containing adversarial patterns designed to manipulate AI models are sanitised before processing. Your text is preserved; we just mark the suspicious patterns so models treat them as data, not instructions.
Stage 2 — Multi-Model Analysis
Your text is sent simultaneously to four frontier AI models from three independent providers:
| Model | Provider | Analytical Focus |
|---|---|---|
| GPT-5.2 Pro | OpenAI | Broad analysis — factual, logical, and opinion detection |
| Claude Opus 4.6 | Anthropic | Consistency analysis — internal contradictions, logical structure |
| Gemini 3 Pro | Precision analysis — numerical accuracy, fabricated sources | |
| GPT-5.2 Codex | OpenAI | Broad analysis — factual, logical, and opinion detection |
Each model independently analyses your text and returns structured findings. Two models run a broad analysis sweep, one focuses on internal contradictions and logical structure, and one specialises in numerical precision and source verification — three specialised analytical lenses across four premium reasoning models. This is the ensemble approach in practice: genuine diversity of perspective, not the same analysis repeated four times.
Stage 3 — Merge by Consensus
Four models often flag the same problem in different words. The merge engine clusters findings that refer to the same issue using:
- Anchor-span overlap (primary): If two models highlight overlapping spans of your document, their findings are about the same thing. Deterministic — no ambiguity.
- Text similarity (fallback): When anchors can't be resolved (e.g. for omissions), the engine compares the issue descriptions using word overlap to decide if they're the same finding.
Each merged finding gets a confidence score (0–100) based on how many models independently flagged it and at what severity. More models agreeing = higher confidence that the finding is real. A single-model finding with hedged language is heavily discounted.
This is the weighted voting classifier described in Our Approach — Condorcet's Jury Theorem and Dawid–Skene aggregation in practice.
Stage 4 — Web Search Verification
Model analysis alone isn't enough — models can be wrong too. For findings that can be fact-checked (factual errors, numerical claims, fabricated sources, temporal issues), Nodelulu runs web search verification against live sources:
- Web search verification: Structured search results with snippets, source verification, and freshness filtering
- Academic database verification: CrossRef and Semantic Scholar for citation verification — catching fabricated journal articles and papers
Findings about opinions, logical structure, or omissions are analytical judgments — they can't be web-searched and are verified by model consensus only.
Web search verification adjusts your score. Findings confirmed by multiple web sources increase in confidence; fabricated sources with zero web evidence are penalised more heavily. This means web search verification directly affects the final Degree of Delulu — it's not just decoration.
Stage 5 — The Degree of Delulu
Every document starts at 100 (clean). Each finding deducts points based on:
- Confidence score: High-confidence findings (many models agree + web evidence) deduct more
- Diminishing returns: The first confirmed error is devastating to trust; subsequent errors add progressively less. One major error and ten minor nitpicks don't make a document ten times worse.
- Category weighting: Omissions (missing context) penalise less than fabricated sources. Missing information isn't the same as wrong information.
| Score Range | Meaning |
|---|---|
| 80–100 | Clean — minimal or no concerns |
| 60–79 | Mostly clean — some issues worth checking |
| 40–59 | Mixed — several findings, verify before trusting |
| 20–39 | Significant issues — approach with caution |
| 0–19 | Unreliable — major errors or fabrications detected |
Finding Severity Levels
| Severity | Meaning | Example |
|---|---|---|
| Dispute | Clearly wrong or fabricated | "The Eiffel Tower is in Berlin" |
| Doubt | Likely wrong or misleading | "Approximately 400 metres tall" (actual: 330m) |
| Note | Worth flagging but uncertain | Missing context that may be relevant |
What We Check For
Statements that contradict established facts
Wrong numbers, dates, statistics, or measurements
Citations to journals, papers, or studies that don't exist
The document contradicts itself in different places
Conclusions that don't follow from the evidence presented
Subjective claims presented as objective truth
Possibly outdated information or time-dependent claims
Important context missing within the topic's scope
Honest Limitations
We believe in transparency. Here's what Nodelulu can't do:
- We can't guarantee 100% accuracy — AI models make mistakes, and web sources can be wrong too
- Very recent events (last few hours) may not have web evidence yet
- Private, proprietary, or classified information can't be verified through web search
- Subjective findings (opinions, logical structure) rely on model judgment, not objective evidence
- Results should be a starting point for your own verification, not the final word
Model Availability
The frontier models we use are operated by third-party providers (OpenAI, Anthropic, Google). If a provider experiences downtime, that model's analysis is skipped and the remaining models continue. Your Degree of Delulu adjusts its confidence level to reflect how many models contributed.
When a model is unavailable, that's the provider's issue — not ours. We always tell you which models contributed to your results.
Your Data
Text you submit is processed in real-time and is not permanently stored. Results are briefly cached in memory to avoid re-processing duplicate submissions, then discarded. See our Privacy Policy for full details.