How Nodelulu Works

Our methodology — transparent, multi-layered, and independently verifiable.

The Problem

AI-generated text looks confident even when it's wrong. Misinformation spreads faster than corrections. Whether you're reading a research paper, a news article, or content from a chatbot, you deserve to know what's real and what's delulu.

Our Approach — The Science Behind Nodelulu

Nodelulu is an ensemble verification system designed to accurately identify AI hallucinations. The core idea: if each AI checker independently has a good chance of catching an error, then aggregating their judgments reduces the overall error rate — often dramatically. This isn't a guess. It's the same logic behind ensemble learning in machine learning, redundancy in safety-critical engineering, and centuries-old statistical theory.

The research that led us here

Ensemble methods in machine learning
Dietterich (2000), Breiman (1996), Freund & Schapire (1997)
Combining multiple independently-trained models consistently outperforms any single model. This is one of the most replicated findings in all of AI research — the foundation behind Random Forests, boosting, and every modern prediction system that matters.
Condorcet's Jury Theorem (1785)
If each independent evaluator is more likely right than wrong, the probability of the majority being correct approaches 100% as evaluators are added. With 4 checkers each at 70% accuracy, the probability of the majority being wrong drops to roughly 8% — down from 30% for a single checker.
Wisdom of Crowds
Galton (1907), Surowiecki (2004)
Independent diverse judgments aggregated together are more accurate than individual expert judgments. The crucial conditions: judges must be genuinely independent and bring different perspectives. Same model, same prompt, same blind spots = no gain.
Self-Consistency in LLM reasoning
Wang et al. (2022)
Multiple reasoning paths aggregated by majority vote improve correctness in modern language models. This is the closest direct precedent to what Nodelulu does — except we go further by using different models, different providers, and different analytical specialties rather than sampling from a single model.
Redundancy in safety-critical systems
Aviation, medical devices, nuclear engineering
Independent verification is standard practice in every domain where getting it wrong has consequences. Two-person integrity, redundant flight computers, independent safety reviews — the principle is the same: independent checks reduce undetected failure.

How Nodelulu is built to satisfy these conditions

The research is clear: ensemble gains depend on meeting specific conditions. Cut corners on any of them and the advantage disappears. Here's what we built and why:

Genuine independence and diversity: Four premium reasoning models from three leading providers (OpenAI, Anthropic, Google). Different companies, different architectures, different training data, different blind spots. Each model is given a unique analytical specialty through a tested configuration that took months to develop and validate.
Premium reasoning power: Every model in the system is a top-tier, high-end reasoning model — the most capable available from each provider. The science only works when each checker is individually good at catching errors. Lightweight or free-tier models don't have the reasoning depth to make this approach viable.
Structured, specific verification: Each model doesn't just read your text and give an opinion. It verifies specific claim types — facts, numbers, citations, logic, contradictions, sources — using structured outputs with defined categories. Precision, not vibes.
Weighted consensus aggregation: Findings aren't merged by simple majority vote. Multi-model agreement at high severity carries significantly more weight than a single model flagging something with hedged language. The system knows the difference between four models shouting and one model whispering.

A single AI — even the most expensive one available — will always have blind spots. The science says so, and our testing confirms it. Nodelulu exists because accurate hallucination detection requires a system, not a single model. Four premium reasoning models, three providers, weighted consensus, and independent web search verification — built to do what no single AI can do alone.

Stage 1 — Safety & Gatekeeping

Before any AI model sees your text, Nodelulu runs two checks:

Content moderation: Prohibited content (CSAM, weapons instructions, targeted harassment) is blocked immediately. This protects our infrastructure and complies with AI provider policies. We don't censor opinions or controversial topics — only content that is illegal or violates provider terms.
Prompt injection defence: Documents containing adversarial patterns designed to manipulate AI models are sanitised before processing. Your text is preserved; we just mark the suspicious patterns so models treat them as data, not instructions.

Stage 2 — Multi-Model Analysis

Your text is sent simultaneously to four frontier AI models from three independent providers:

Model	Provider	Analytical Focus
GPT-5.2 Pro	OpenAI	Broad analysis — factual, logical, and opinion detection
Claude Opus 4.6	Anthropic	Consistency analysis — internal contradictions, logical structure
Gemini 3 Pro	Google	Precision analysis — numerical accuracy, fabricated sources
GPT-5.2 Codex	OpenAI	Broad analysis — factual, logical, and opinion detection

Each model independently analyses your text and returns structured findings. Two models run a broad analysis sweep, one focuses on internal contradictions and logical structure, and one specialises in numerical precision and source verification — three specialised analytical lenses across four premium reasoning models. This is the ensemble approach in practice: genuine diversity of perspective, not the same analysis repeated four times.

Stage 3 — Merge by Consensus

Four models often flag the same problem in different words. The merge engine clusters findings that refer to the same issue using:

Anchor-span overlap (primary): If two models highlight overlapping spans of your document, their findings are about the same thing. Deterministic — no ambiguity.
Text similarity (fallback): When anchors can't be resolved (e.g. for omissions), the engine compares the issue descriptions using word overlap to decide if they're the same finding.

Each merged finding gets a confidence score (0–100) based on how many models independently flagged it and at what severity. More models agreeing = higher confidence that the finding is real. A single-model finding with hedged language is heavily discounted.

This is the weighted voting classifier described in Our Approach — Condorcet's Jury Theorem and Dawid–Skene aggregation in practice.

Stage 4 — Web Search Verification

Model analysis alone isn't enough — models can be wrong too. For findings that can be fact-checked (factual errors, numerical claims, fabricated sources, temporal issues), Nodelulu runs web search verification against live sources:

Web search verification: Structured search results with snippets, source verification, and freshness filtering
Academic database verification: CrossRef and Semantic Scholar for citation verification — catching fabricated journal articles and papers

Findings about opinions, logical structure, or omissions are analytical judgments — they can't be web-searched and are verified by model consensus only.

Web search verification adjusts your score. Findings confirmed by multiple web sources increase in confidence; fabricated sources with zero web evidence are penalised more heavily. This means web search verification directly affects the final Degree of Delulu — it's not just decoration.

Stage 5 — The Degree of Delulu

Every document starts at 100 (clean). Each finding deducts points based on:

Confidence score: High-confidence findings (many models agree + web evidence) deduct more
Diminishing returns: The first confirmed error is devastating to trust; subsequent errors add progressively less. One major error and ten minor nitpicks don't make a document ten times worse.
Category weighting: Omissions (missing context) penalise less than fabricated sources. Missing information isn't the same as wrong information.

Score Range	Meaning
80–100	Clean — minimal or no concerns
60–79	Mostly clean — some issues worth checking
40–59	Mixed — several findings, verify before trusting
20–39	Significant issues — approach with caution
0–19	Unreliable — major errors or fabrications detected

Finding Severity Levels

Severity	Meaning	Example
Dispute	Clearly wrong or fabricated	"The Eiffel Tower is in Berlin"
Doubt	Likely wrong or misleading	"Approximately 400 metres tall" (actual: 330m)
Note	Worth flagging but uncertain	Missing context that may be relevant

What We Check For

Factual Error

Statements that contradict established facts

Numerical Error

Wrong numbers, dates, statistics, or measurements

Fabricated Source

Citations to journals, papers, or studies that don't exist

Self-Contradiction

The document contradicts itself in different places

Logical Leap

Conclusions that don't follow from the evidence presented

Opinion as Fact

Subjective claims presented as objective truth

Temporal Issue

Possibly outdated information or time-dependent claims

Omission

Important context missing within the topic's scope

Honest Limitations

We believe in transparency. Here's what Nodelulu can't do:

We can't guarantee 100% accuracy — AI models make mistakes, and web sources can be wrong too
Very recent events (last few hours) may not have web evidence yet
Private, proprietary, or classified information can't be verified through web search
Subjective findings (opinions, logical structure) rely on model judgment, not objective evidence
Results should be a starting point for your own verification, not the final word

Model Availability

The frontier models we use are operated by third-party providers (OpenAI, Anthropic, Google). If a provider experiences downtime, that model's analysis is skipped and the remaining models continue. Your Degree of Delulu adjusts its confidence level to reflect how many models contributed.

When a model is unavailable, that's the provider's issue — not ours. We always tell you which models contributed to your results.

Your Data

Text you submit is processed in real-time and is not permanently stored. Results are briefly cached in memory to avoid re-processing duplicate submissions, then discarded. See our Privacy Policy for full details.

← Back to Nodelulu · Terms of Service