top of page

Theo Collaborator: think, structure, and iterate on research

Stay Connected

facebook icon.png
X icon.png
linkedin icon.png

Stay Connected - Get our latest news and updates

  • LinkedIn
  • Youtube
  • X
  • Facebook

Lab Notes: CiteVerify - Fake citation detection and evidence verification for AI research

  • Writer: FirstPrinciples
    FirstPrinciples
  • May 21
  • 7 min read

Authored by Shaghayegh Sadeghi & Khashayar Khajavi, Machine Learning Associates - FirstPrinciples


Scientific writing has always relied on the implicit contract that claims must remain traceable to evidence. Large language models complicate that relationship. Modern systems can generate reports that appear rigorous and extensively sourced while quietly introducing fabricated citations. CiteVerify is FirstPrinciples’ answer to the rise of fake citations. 


A recurring problem in LLM generated reports, especially in systems like Deep Literature Search (DLS) at FirstPrinciples, is that the writing looks trustworthy long before the evidence actually is. A report may sound polished and cite in all the right places, yet still fail at the most important step: linking each claim to a real and genuinely supporting source. These failures show up as fabricated references, corrupted metadata, or real papers attached to claims they do not support. CiteVerify is designed to expose and measure this reliability gap in DLS.


Open book with references list, black and beige page design. Pixelated, colorful cube pattern emerges from the right edge, creating an abstract effect.

Citation reliability is not one problem, but at least three. Does the cited source exist? Does it actually support the attached claim? If not, can the system recover a better source? Evaluation setups that collapse these into a single judgment make it hard to see what failed. CiteVerify separates the stages so that hallucinated citations, metadata mismatches, improperly assigned references, and true evidence gaps can be distinguished, making the final reports both auditable and traceable.


AI citation verification pipeline

The pipeline starts from a generated report and turns it into structured claim citation pairs. From there, it runs citation existence verification against external academic sources rather than relying only on local metadata. Candidate references are checked across sources such as CrossRef, Semantic Scholar, OpenAlex, and arXiv, and each citation is scored for match quality. Citations that cannot be verified are filtered out before downstream evidence checking.



Flowchart of the CiteVerify pipeline with stages: Report Generation, Claim-Citation Extraction, Hallucination Detection, Citation Filter, Claim-Evidence Verification, Evidence Finder, and Evaluation.
Figure 1. CiteVerify pipeline overview. Seven stages from report generation through evaluation. Citation existence and claim evidence checking are treated as separate problems, followed by evidence recovery and final analysis.

The remaining pairs go through claim evidence verification, where the system retrieves abstracts and, when needed, full text passages to decide whether a citation supports, contradicts, or fails to provide enough evidence for a claim.


Unsupported cases can trigger an evidence finder that searches for replacement sources. This decomposition matters because fabricated and misused citations are different kinds of failure. If a reference cannot be resolved, the issue is upstream in citation construction. If it resolves cleanly but does not support the claim, the problem is more subtle. Separating the two gives a much clearer picture of how citation-grounded generation breaks down in practice.


AI citation hallucination detection

Before verifying claim evidence alignment, the pipeline checks whether cited sources actually exist. Across 46 Configuration A reports, 725 citations, the hallucination detector flagged 12 problematic references: 10 minor corrupted metadata, wrong DOI, truncated URL, or mismatched identifier, and 2 major entirely fabricated. Configuration B, with 907 citations, contained 2 minor hallucinations and zero major ones. Exact match rate rose from 98.3 percent to 99.8 percent.


Bar charts showing "Citation volume" and "Hallucination counts" for Configurations A and B. A has 725 citations, 10 minor, 2 major hallucinations. B has 907 citations, 2 minor, 0 major hallucinations.
Figure 2. Hallucination detection summary. Configuration B achieves 99.8 percent exact match across 907 citations, eliminating all major hallucinations. Match method shifted entirely to URL based verification.

Key takeaways

Hallucinations drop from 12 to 2 (-83 percent), with major hallucinations reduced to zero. Exact match quality improves from 98.3 percent to 99.8 percent while citation volume increases by 25 percent. Configuration B relies entirely on URL based matching, improving consistency and auditability.


Claim evidence alignment

Once hallucinated citations are filtered out, the central question is whether each remaining citation actually supports its attached claim. For every pair, the system retrieves the cited paper's abstract and compares it against the claim using an LLM.


If the abstract is insufficient, it escalates to full text retrieval and passage level comparison. Each pair receives one of three verdicts: supports, clear evidence; insufficient evidence, topically related but not directly supporting; or contradicts, conflicting information.


This three class verdict is what distinguishes CiteVerify from simpler retrieval checks. A citation can be real, correctly formatted, and topically relevant yet still fail to support the specific quantitative or forward looking claim it is attached to. The alignment stage is designed to catch exactly those cases with separating genuine evidential support from loose topical association.


Key takeaways

Alignment is evaluated as a three class decision: supports, insufficient evidence, or contradicts. Most pairs resolve at abstract stage. Full text escalation captures claims requiring detailed evidence.This separation prevents topically related but unsupported citations from being counted as valid support.


Results

We evaluated CiteVerify on 46 DLS generated physics reports across 10 subtopics, comparing Configuration A and Configuration B of the DLS pipeline. Replacing the LLM written references section with a deterministic CitationRegistry removed a major class of citation failures such as fabricated references, inconsistent numbering across subtopics, duplicate entries, and corrupted metadata, while enforcing globally consistent and verifiable citation mappings.


Configuration A analyzed 3,293 claim citation pairs. Configuration B analyzed 4,611. In both, most pairs resolve at the abstract stage, with a smaller but important fraction requiring full text analysis, an escalation that simply reflects claims referencing specific experimental results, numerical bounds, or methodological details only available in the body of the paper. The metrics below show that Configuration B does more than clean up citations; it increases evaluable coverage, strengthens evidence tracing, and makes DLS reports more auditable.


Bar chart titled "Verdict distribution across configurations." Configuration A (orange) and B (green) show verdicts: Supports, Insufficient, Contradicts.
Figure 3. Verdict distribution across the two system configurations. Supported pairs dominate in both settings, while contradictions remain a relatively small but important category.

The contradiction count rose from 65 to 86, reflecting stricter evaluation over a larger pair set. Many failures are not random hallucinations but subtle support errors. For example, the cited paper contains a numeric bound the report says is missing, is relevant to the topic but does not support the specific milestone or quantitative claim, supports only a narrower statement than the report implies, or is a background reference stretched into evidence for a forward looking claim. This reinforces a main lesson of the project in that citation verification requires reasoning over the relationship between a claim and what the source really says.


Two donut charts compare "Configuration A" and "B" showing data breakdowns for pairs. Colors: green, blue, orange, gray. Titles and labels present.
Figure 4. Decision stage breakdown. Most decisions happen at the abstract stage, with a smaller share requiring full text analysis.

Figure 5. Retrieval and performance summary. Both configurations achieve strong retrieval coverage, with differing tradeoffs in runtime and depth of analysis.
Figure 5. Retrieval and performance summary. Both configurations achieve strong retrieval coverage, with differing tradeoffs in runtime and depth of analysis.

Performance tradeoffs

Configuration B is actually faster per citation. Mean latency dropped from 5.72 seconds to 4.82 seconds, driven largely by more efficient abstract fetching, 0.99 seconds to 0.49 seconds. Total runtime still grew from about 18,839 seconds to 22,226 seconds because the system analyzes 40 percent more pairs, and peak memory rose modestly, 1,489 MB to 1,559 MB. Given the jump in coverage and support rate, this is a favorable trade.


Bar chart comparing mean citation latency for subtopics in configurations A (orange) and B (green); includes global means: A=5.72s, B=4.82s.
Figure 6. Mean per citation latency by subtopic. Configuration B is generally faster, global mean 4.82 seconds versus 5.72 seconds, reflecting more efficient abstract fetching despite deeper evidence checks.

Bar chart shows latency for Config A (red) and Config B (green) across stages: Abstract Fetch, Abstract Compare, Full-Text Fetch, Passage Extract, Passage Compare.
Figure 7. Pipeline stage latency breakdown, mean seconds per citation. Abstract fetch time halved under Configuration B, 0.99 seconds to 0.49 seconds. Compare and passage stages remain similar, indicating the speedup is retrieval driven.

Key takeaways

Configuration B lowers mean latency per citation, 5.72 seconds to 4.82 seconds, while increasing total evaluated coverage. Speed gains are primarily from abstract fetch efficiency, 0.99 seconds to 0.49 seconds. Total runtime still grows because Configuration B evaluates 40 percent more claim citation pairs.


Subtopic level variation

Gains are not uniform across subtopics. Some improved dramatically under Configuration B, a pattern typical when retrieval and evidence handling get better. Nuclear physics report 22 jumped from 49.5 percent to 81.6 percent support, and particle physics report 26 from 50.0 percent to 88.9 percent. The single report subtopics, biomolecular simulations, precision tests, and molecular motors, also showed consistent gains.


Bar chart showing support rates by physics subtopic. Configuration A (orange) and B (green) bars compare mean percentages. Range 58.7%-91.9%.
Figure 8. Support rate by physics subtopic. All 10 subtopics shown. Largest gains under Configuration B in nuclear physics and plasma physics. Soft matter shows the widest within subtopic variance.

A few categories became harder under Configuration B, especially some soft matter, plasma, and condensed matter reports. That is not necessarily bad news. In several cases the system is now analyzing many more pairs than before, surfacing harder or weaker ones that Configuration A never evaluated. Part of the drop reflects broader scrutiny rather than worse reasoning.


Bar chart compares claim-citation volumes by subtopic. Red bars for Config A, green for Config B. Categories include Astrophysics, Biophysics.
Figure 9. Claim citation pair volume by subtopic. Configuration B consistently analyzes more pairs across all 10 subtopics, with the largest absolute increases in biophysics and plasma physics.

Bar chart titled "Per-report support rate change" shows green bars (improved) and orange bars (regressed) for Config B vs Config A.
Figure 10. Per report support rate change, ranked by improvement. Most reports improve. The largest gains, above 30 percentage points, occur in reports with low baseline support. Negative deltas often reflect increased pair volume exposing weak citations.

Retrieval sources

The retrieval cascade draws on multiple sources. URL content and arXiv dominate in both configurations, but Configuration B shows a large increase in OpenAlex, 628 to 1,190, and CrossRef, 509 to 882, usage, reflecting a more aggressive cascade. Web search usage dropped sharply, 151 to 48, suggesting earlier cascade stages succeed more often.


Two pie charts compare abstract retrieval sources for Configurations A and B, with segment colors representing different sources.
Figure 11. Abstract retrieval source distribution. URL content and arXiv dominate both configurations. Configuration B's stronger cascade is visible in the OpenAlex and CrossRef growth, with reduced reliance on web search.

Two donut charts compare full-text retrieval distributions. Configuration A has 1,204 calls, B has 1,370. Main sources are URL content, LLM Web Search.
Figure 12. Full text retrieval source distribution. URL content dominates full text retrieval in both configurations. Configuration B eliminated arXiv HTML and PDF fallbacks, relying on cleaner URL based retrieval and increased PMC usage.

Limitations

These comparisons are based on single run snapshots and reflect the current retrieval stack, model choices, and source availability at evaluation time. Absolute latency and verdict distributions may shift with API behavior, model updates, or source side changes. The strongest conclusions are therefore comparative: under identical evaluation framing, Configuration B improves citation validity, expands evaluable coverage, and increases support aligned outcomes.


Looking ahead: Citation reliability as model evaluation

The larger point is that citation reliability deserves to be treated as a first class part of model evaluation. Fluent or even factual looking writing is not enough. In scientific and technical settings, trust depends on whether the evidence is real and whether it actually supports what the system says.


Separating citation existence, claim support, and evidence recovery is a practical way to make progress. Configuration B is still imperfect and brittle in some domains, but it verifies more pairs, retrieves better, reduces major hallucinations to zero, and grounds a larger fraction of claims. That is a meaningful step toward report generation systems that do not just sound authoritative, but earn it.

 
 
iStock-1357123095.jpg
bottom of page