Lab Notes: CiteVerify - Fake citation detection and evidence verification for AI research
- FirstPrinciples

- May 21
- 7 min read
Authored by Shaghayegh Sadeghi & Khashayar Khajavi, Machine Learning Associates - FirstPrinciples
Scientific writing has always relied on the implicit contract that claims must remain traceable to evidence. Large language models complicate that relationship. Modern systems can generate reports that appear rigorous and extensively sourced while quietly introducing fabricated citations. CiteVerify is FirstPrinciples’ answer to the rise of fake citations.
A recurring problem in LLM generated reports, especially in systems like Deep Literature Search (DLS) at FirstPrinciples, is that the writing looks trustworthy long before the evidence actually is. A report may sound polished and cite in all the right places, yet still fail at the most important step: linking each claim to a real and genuinely supporting source. These failures show up as fabricated references, corrupted metadata, or real papers attached to claims they do not support. CiteVerify is designed to expose and measure this reliability gap in DLS.

Citation reliability is not one problem, but at least three. Does the cited source exist? Does it actually support the attached claim? If not, can the system recover a better source? Evaluation setups that collapse these into a single judgment make it hard to see what failed. CiteVerify separates the stages so that hallucinated citations, metadata mismatches, improperly assigned references, and true evidence gaps can be distinguished, making the final reports both auditable and traceable.
AI citation verification pipeline
The pipeline starts from a generated report and turns it into structured claim citation pairs. From there, it runs citation existence verification against external academic sources rather than relying only on local metadata. Candidate references are checked across sources such as CrossRef, Semantic Scholar, OpenAlex, and arXiv, and each citation is scored for match quality. Citations that cannot be verified are filtered out before downstream evidence checking.

The remaining pairs go through claim evidence verification, where the system retrieves abstracts and, when needed, full text passages to decide whether a citation supports, contradicts, or fails to provide enough evidence for a claim.
Unsupported cases can trigger an evidence finder that searches for replacement sources. This decomposition matters because fabricated and misused citations are different kinds of failure. If a reference cannot be resolved, the issue is upstream in citation construction. If it resolves cleanly but does not support the claim, the problem is more subtle. Separating the two gives a much clearer picture of how citation-grounded generation breaks down in practice.
AI citation hallucination detection
Before verifying claim evidence alignment, the pipeline checks whether cited sources actually exist. Across 46 Configuration A reports, 725 citations, the hallucination detector flagged 12 problematic references: 10 minor corrupted metadata, wrong DOI, truncated URL, or mismatched identifier, and 2 major entirely fabricated. Configuration B, with 907 citations, contained 2 minor hallucinations and zero major ones. Exact match rate rose from 98.3 percent to 99.8 percent.

Key takeaways
Hallucinations drop from 12 to 2 (-83 percent), with major hallucinations reduced to zero. Exact match quality improves from 98.3 percent to 99.8 percent while citation volume increases by 25 percent. Configuration B relies entirely on URL based matching, improving consistency and auditability.
Claim evidence alignment
Once hallucinated citations are filtered out, the central question is whether each remaining citation actually supports its attached claim. For every pair, the system retrieves the cited paper's abstract and compares it against the claim using an LLM.
If the abstract is insufficient, it escalates to full text retrieval and passage level comparison. Each pair receives one of three verdicts: supports, clear evidence; insufficient evidence, topically related but not directly supporting; or contradicts, conflicting information.
This three class verdict is what distinguishes CiteVerify from simpler retrieval checks. A citation can be real, correctly formatted, and topically relevant yet still fail to support the specific quantitative or forward looking claim it is attached to. The alignment stage is designed to catch exactly those cases with separating genuine evidential support from loose topical association.
Key takeaways
Alignment is evaluated as a three class decision: supports, insufficient evidence, or contradicts. Most pairs resolve at abstract stage. Full text escalation captures claims requiring detailed evidence.This separation prevents topically related but unsupported citations from being counted as valid support.
Results
We evaluated CiteVerify on 46 DLS generated physics reports across 10 subtopics, comparing Configuration A and Configuration B of the DLS pipeline. Replacing the LLM written references section with a deterministic CitationRegistry removed a major class of citation failures such as fabricated references, inconsistent numbering across subtopics, duplicate entries, and corrupted metadata, while enforcing globally consistent and verifiable citation mappings.
Configuration A analyzed 3,293 claim citation pairs. Configuration B analyzed 4,611. In both, most pairs resolve at the abstract stage, with a smaller but important fraction requiring full text analysis, an escalation that simply reflects claims referencing specific experimental results, numerical bounds, or methodological details only available in the body of the paper. The metrics below show that Configuration B does more than clean up citations; it increases evaluable coverage, strengthens evidence tracing, and makes DLS reports more auditable.

The contradiction count rose from 65 to 86, reflecting stricter evaluation over a larger pair set. Many failures are not random hallucinations but subtle support errors. For example, the cited paper contains a numeric bound the report says is missing, is relevant to the topic but does not support the specific milestone or quantitative claim, supports only a narrower statement than the report implies, or is a background reference stretched into evidence for a forward looking claim. This reinforces a main lesson of the project in that citation verification requires reasoning over the relationship between a claim and what the source really says.


Performance tradeoffs
Configuration B is actually faster per citation. Mean latency dropped from 5.72 seconds to 4.82 seconds, driven largely by more efficient abstract fetching, 0.99 seconds to 0.49 seconds. Total runtime still grew from about 18,839 seconds to 22,226 seconds because the system analyzes 40 percent more pairs, and peak memory rose modestly, 1,489 MB to 1,559 MB. Given the jump in coverage and support rate, this is a favorable trade.


Key takeaways
Configuration B lowers mean latency per citation, 5.72 seconds to 4.82 seconds, while increasing total evaluated coverage. Speed gains are primarily from abstract fetch efficiency, 0.99 seconds to 0.49 seconds. Total runtime still grows because Configuration B evaluates 40 percent more claim citation pairs.
Subtopic level variation
Gains are not uniform across subtopics. Some improved dramatically under Configuration B, a pattern typical when retrieval and evidence handling get better. Nuclear physics report 22 jumped from 49.5 percent to 81.6 percent support, and particle physics report 26 from 50.0 percent to 88.9 percent. The single report subtopics, biomolecular simulations, precision tests, and molecular motors, also showed consistent gains.

A few categories became harder under Configuration B, especially some soft matter, plasma, and condensed matter reports. That is not necessarily bad news. In several cases the system is now analyzing many more pairs than before, surfacing harder or weaker ones that Configuration A never evaluated. Part of the drop reflects broader scrutiny rather than worse reasoning.


Retrieval sources
The retrieval cascade draws on multiple sources. URL content and arXiv dominate in both configurations, but Configuration B shows a large increase in OpenAlex, 628 to 1,190, and CrossRef, 509 to 882, usage, reflecting a more aggressive cascade. Web search usage dropped sharply, 151 to 48, suggesting earlier cascade stages succeed more often.


Limitations
These comparisons are based on single run snapshots and reflect the current retrieval stack, model choices, and source availability at evaluation time. Absolute latency and verdict distributions may shift with API behavior, model updates, or source side changes. The strongest conclusions are therefore comparative: under identical evaluation framing, Configuration B improves citation validity, expands evaluable coverage, and increases support aligned outcomes.
Looking ahead: Citation reliability as model evaluation
The larger point is that citation reliability deserves to be treated as a first class part of model evaluation. Fluent or even factual looking writing is not enough. In scientific and technical settings, trust depends on whether the evidence is real and whether it actually supports what the system says.
Separating citation existence, claim support, and evidence recovery is a practical way to make progress. Configuration B is still imperfect and brittle in some domains, but it verifies more pairs, retrieves better, reduces major hallucinations to zero, and grounds a larger fraction of claims. That is a meaningful step toward report generation systems that do not just sound authoritative, but earn it.






