
Reducing Hallucinations in Healthcare AI

In this blog series, Delphyr Engineering, we share practical insights from building an AI platform for healthcare. Last month, we explored ten techniques that can improve the reliability of large language models (LLMs) in healthcare environments. In this blog, we deep dive into one of the most important challenges in this field: hallucination.
What are hallucinations?
Hallucinations are situations where an AI model generates information that is unsupported, incorrect, or entirely fabricated. In many consumer applications, hallucinations may simply be inconvenient. In healthcare, however, the stakes are significantly higher.
For example, imagine a healthcare professional asking an AI system for a summary of a patient’s recent behavior on a psychiatric ward. If the model incorrectly states that the patient showed aggressive behavior during the previous week (while no such incident is documented in the patient record) the system has hallucinated this information. Even if the statement sounds plausible and is presented confidently, it is not grounded in the underlying data.
As you can imagine, this is a problem in healthcare settings. A generated statement without a reliable grounding in patient data or clinical evidence can undermine trust, create confusion, or contribute to unsafe decision-making. For instance, a colleague reading the summary during handover might incorrectly assume the patient poses an acute safety risk, leading to unnecessary escalation of care, changes in medication, or the documentation of inaccurate information in the EPD.
Why do AI-models by default have a tendency to hallucinate?
The important thing to understand is that large language models are not databases or search engines. They do not “know” facts in the way humans often assume. Instead, they are prediction systems trained to generate the most statistically likely next word based on patterns learned from enormous amounts of text.
This is what makes AI models so powerful at generating fluent and natural-sounding language. But it also explains why hallucinations can occur. When information is incomplete, ambiguous, or missing, a (poor) model may still try to produce a coherent answer rather than explicitly saying “I don’t know.” In other words, the model is optimized to continue the conversation convincingly, not to guarantee that every statement is factually correct.
Methods that reduce hallucination
In this blog, we discuss two techniques that can be used to minimise hallucinations in AI output: forcing citation-based grounding (a prevention technique) and evals (an evaluation technique).
Prevention technique: forcing citation and grounding
One important technique for reducing hallucination is forcing the model to provide citations for factual claims. Rather than allowing the model to generate fully synthetic answers, the system can require that statements are explicitly linked to underlying source material. In practice, this means that for every factual statement generated, the model must indicate where the information originated from.
This approach introduces an important shift in behavior. The model is no longer encouraged to “fill in the gaps” using statistical likelihood alone. Instead, it is guided to ground its answers in retrievable information from trusted data sources.
To make this possible, retrieval systems play a central role. Before generating a response, relevant information must first be retrieved from indexed data sources such as patient documentation, clinical notes, or medical guidelines. The model then generates its answer based on these retrieved fragments, while linking generated claims back to their original sources. This creates several advantages:
Clinicians can verify where information comes from
Unsupported claims become easier to detect
The system becomes more transparent and auditable
The likelihood of fabricated information decreases
Importantly, citation itself is not a guarantee of correctness. A model may still incorrectly interpret or summarize retrieved information. However, forcing explicit grounding significantly reduces the freedom of the model to invent unsupported facts without traceability. Next to that, additional validation techniques can further strengthen this process, such as verifying whether generated claims are actually supported by the cited source text rather than merely attaching references superficially.
Citation enforcement and grounding in practice
For AI models used in healthcare, a method to force citations is to enforce strict citation grammar in the prompt. Every factual claim must be wrapped in <cit><source_id>N</source_id>verbatim snippet</cit>. The model is told the snippet has to be verbatim from the retrieved source, not paraphrased.
Next to that, you can run a streaming citation validator that fuzzy-matches each <cit> snippet back to the actual source text as the response is being generated. If the snippet doesn't appear in any source, the citation is flagged so the frontend can warn the clinician.
Finally, models can be forced to cite and ground their output with a citation hallucination guard that detects when the model emits a <cit> in a context where there are no retrieved sources (basic chat) and hard-blocks the response. For this, you can run a citation repetition guard that catches degenerate loops where the model keeps citing the same source over and over.
Continuous evaluations: evals
Beyond prevention mechanisms, reliable healthcare AI systems also require continuous evaluation, often called evals. Evals are systematic tests used to measure how well an AI system performs on specific tasks. Instead of relying on subjective impressions, evals provide repeatable ways to check whether a model produces correct, reliable, and consistent outputs under defined conditions. In healthcare AI, evals are especially important for assessing whether outputs are factually grounded in source data and for detecting issues such as hallucinations, inconsistency, or overconfident reasoning.
Groundedness / citation verification
A good example of an eval, is checking whether outputs are actually grounded in source data. The idea is simple: when a model generates a claim together with citations, the system verifies whether those citations truly support the statement. For example, if the model states that a patient experienced increased anxiety last week, the evaluation checks whether this is actually reflected in clinical notes, patient records, or retrieved source passages. If no supporting evidence can be found, the statement is flagged as potentially hallucinated.
Consistency and contradiction checks
Another class of evals focuses on internal consistency. Here, the system checks whether generated outputs contradict existing patient data, earlier notes, or established medical knowledge. For example, if one record states that a patient denies suicidal ideation, but a generated summary claims the opposite, this inconsistency can be flagged automatically.
LLM-as-a-judge evaluation
In this approach, a second model is used to evaluate the output of the first. The evaluator is prompted to identify unsupported claims or assess whether statements are grounded in the provided sources. This makes it a scalable alternative to human review, especially for large-scale testing. However, it also introduces its own risks, since the evaluator model can itself make mistakes or hallucinate, which makes careful benchmarking important.
Synthetic benchmark testing
Dedicated test sets can be constructed specifically to probe hallucination behavior. These often include intentionally misleading or incorrect inputs, such as fabricated diagnoses, nonexistent medications, or ambiguous clinical scenarios. The model is then evaluated on whether it correctly challenges or rejects unsupported information. This type of testing is particularly useful for stress-testing system robustness.
Confidence and uncertainty calibration
Hallucination is not only about correctness but also about how confidently a model expresses its output. A key evaluation question is whether uncertainty is communicated appropriately. Overconfident but incorrect answers are especially risky in healthcare settings, where they can easily be mistaken for clinical fact.
Human expert review
Despite advances in automation, clinical expert review remains one of the most reliable ways to assess hallucinations. Experts typically evaluate whether outputs are factually correct, whether they introduce unsupported claims, whether important details are missing, and whether the level of certainty matches the evidence. While expensive, this approach is still essential for high-risk workflows and for validating automated evaluation methods.
Moving toward trustworthy healthcare AI
Hallucination prevention is ultimately not about achieving perfect certainty. Clinical practice itself involves uncertainty, interpretation, and incomplete information. The goal instead is to build systems that remain transparent about what they know, what they do not know, and where information originates from.
This requires more than prompt engineering alone. It depends on layered system design: retrieval infrastructure, citation mechanisms, evaluation frameworks, validation strategies, and continuous monitoring.
As healthcare AI systems become more integrated into clinical workflows, grounding and traceability will become increasingly important. Trustworthy AI is not simply generated by the model itself, but by the safeguards and engineering decisions surrounding it.
The bottom line
Large language models can be incredibly powerful in healthcare settings, but only when their outputs remain grounded, verifiable, and transparent. Reducing hallucinations is therefore not a single feature or isolated technique. It is an ongoing engineering effort that combines retrieval systems, citation enforcement, different evaluation techniques, and continuous refinement.