Large Language Models (LLMs) have moved into clinical workflows quickly. They now draft clinical notes, triage patient queries, surface drug information, and guide protocol decisions across a range of healthcare products. The adoption is real, and it is running ahead of the conversation about how to make these systems safe enough for those settings.
Healthcare doesn't tolerate mistakes the way most domains do. If a chatbot hallucinates a restaurant recommendation, you end up with a slightly disappointing dinner. If it hallucinates a drug dose, someone gets hurt. That asymmetry matters. It changes what "good enough" looks like, and it argues for a different engineering posture: clinical claims shouldn't come from a model's fuzzy compression of its training data — they need to be grounded in sources that are actually authoritative.
A common approach to grounding is to give models structured access to authoritative knowledge at inference time through tools, MCP servers, or retrieval pipelines, rather than relying entirely on what they absorbed during pretraining. The approach works. But how much it helps, on which tasks, and whether models actually use these tools well are empirical questions. Without benchmarks designed for those questions, "we have grounding" can become a checkbox with little behind it.
These questions get harder when the context is Indian healthcare.
India's clinical reality is not a regional flavor on top of Western medicine. It is, for AI purposes, a different problem space.
The country has more than 500,000 branded drugs in active circulation, most with no US or EU equivalent. Hundreds of manufacturers produce variants of the same generic under different brand names, and the same brand name occasionally maps to different compositions across manufacturers. An LLM that can confidently identify Tylenol or Lipitor often cannot tell you what Jorolac or Cefotel Suspension contain. That is not a model weakness; it is a data density problem. These molecules do not appear with enough frequency in pretraining corpora for parametric memory to handle them reliably.
Clinical protocols are similarly local. ICMR, RSSDI, IAP, MoHFW, and various state health ministries publish guidelines that diverge from FDA or NICE recommendations, partly because of population phenotypic differences and partly because the operational realities of Indian healthcare delivery are different. The National Formulary of India recommends pharmacological standards tweaked / tailored to Indian specific population. For example, estimating Glomerular Filtration Rate (eGFR) and assessing pharmacokinetics in the Indian population presents unique challenges due to differences in muscle mass, diet, and Body Surface Area (BSA) compared to Western populations where standard formulas were developed. Similarly with respect to clinical guidelines referred to for clinical practice, there is a combination of protocols from Indian bodies like ICMR , RSSDI and IAP which need to be factored in on top of existing western literature or guidelines.
A model that scores well on USMLE is, in this context, being measured on a test that does not reflect the conditions it would be deployed into. Most medical AI benchmarks are built for Western healthcare; performance on those benchmarks tells you how a model handles Western drugs, Western protocols, Western clinical conventions. The distance between that and Indian clinical practice is real, consistent, and not closed by simply scaling up the model.
Given the scale of healthcare delivery in India, that gap is not a small one to leave unaddressed.
To address this, we built two things at EkaCare: a remote MCP server that gives any compatible LLM client structured access to Indian medical knowledge, and four open datasets that let the community measure how well models actually use that access.
The MedAI MCP server exposes four knowledge domains as four tools.
Indian Branded Drug Search indexes 500,000+ branded medications drawn from real EMR prescription data — the long tail of drugs Indian doctors actually prescribe, not textbook examples. The tool turns a coverage problem into a lookup problem.
Indian Treatment Protocol Search indexes hundreds of published guidelines from ICMR, RSSDI, IAP, MoHFW, and international publishers in a vector knowledge base. Results return as the actual protocol pages, preserving table structure and clinical notation rather than text re-rendered through another model. The tool exposes two intents: publishers (discover available guideline sources) and search (retrieve by clinical query and publisher name). That two-step design turns out to matter a lot, more on this below.
Indian Pharmacology Details provides structured lookups from the National Formulary of India 2011: indications, contraindications, dosage, adverse effects, and pregnancy safety for generic drugs.
Medical Calculators expose 403 clinical calculators across 26 categories — cardiovascular, nephrology, hematology, OB-GYN, pulmonology, diabetes, pediatrics, and more — as three separate tools that mirror how a clinician approaches an unfamiliar formula: discover available calculators, fetch the input schema, then compute. The model cannot silently misremember the Parkland equation or approximate CKD-EPI. It either runs the right calculator with the right inputs, or it does not.
To make this evaluable, not just usable, we open-sourced four new datasets on HuggingFace under the ekacare organization. Each is also integrated into KARMA, our open-source medical AI evaluation framework.
We chose to release these openly because closed benchmarks tend to reflect the priorities of one team, while shared benchmarks let a community converge on what works.
We benchmarked frontier LLMs against all four datasets, with and without tool access. The pattern that emerges is more interesting than a simple "tools help."
Without tools, both frontier models score in the high 70s to low 80s — reasonable, but well short of what a clinical drug identification application would need. With tool access, both clear 99%.
The gap here is not pharmacological understanding. It is data density. 500,000 brand names and manufacturer variants do not appear often enough in pretraining for parametric memory to handle them reliably. The tool closes the gap by making real, accurate, continually-growing data accessible at inference time. No amount of model scaling fixes a coverage problem.
GPT-5-mini is the outlier, and a useful one. Even with the same tool access, it gains only 4.4 points. The pattern repeats across our other evaluations: smaller models struggle to reliably invoke tools and parse structured outputs. Tool access helps only when the model uses the tool well.
Pharmacological principles transfer better across training corpora. A model that learned general pharmacology from Western literature is not starting from zero on mechanisms and indications. But the no-tool ceiling still falls meaningfully below what clinical use requires. The NFI tool moves both frontier models into the mid-80s, recovering accuracy on India-specific dosing conventions and prescription patterns — the kind of detail that matters for handling local AMR realities.
This is the largest lift in the entire evaluation: 38 percentage points for Claude Sonnet 4.6 on calculators. Models do not fail calculator questions because they do not know what BMI means. They fail because they misremember formula variants, mishandle unit conversions, or compound arithmetic errors across multi-step chains. Deterministic computation eliminates all of that.
But aggregate accuracy hides where models are still failing. Error analysis across 1,066 samples reveals two failure classes that persist even with tool access. The first is wrong calculator selected — picking a semantically adjacent but distinct tool, like the generic BMI calculator where a sex-specific variant is required. The second, affecting 12–13% of samples across all three models, is right calculator, wrong inputs: either enum misselection (mapping "sedentary" to "very_active" in a TDEE activity field) or unit-conversion overreach, where the model receives a correct result, decides it should be in a different unit, and modifies the answer — introducing an error the calculator did not make.
GPT-5-mini exhibits one additional failure: it passes natural-language category names instead of the required snake_case slugs, gets back empty lists, and falls back to self-computation. That single bug accounts for most of its 7.8% "listed but never computed" rate, versus 1–2.4% for the frontier models.
The takeaway: aggregate accuracy hides whether a failure is at tool invocation, calculator selection, input mapping, or output handling. Each is a different problem with a different fix.
This is where the evaluation gets most interesting.
Claude Sonnet 4.6 starts at 68.6% without tools, slightly behind GPT-5.2's 72.2% baseline. With tool access, Claude jumps to 83.8%. GPT-5.2 actually drops to 70.6% — a 1.6 pp regression, from the same tools, on the same dataset.
The difference is not access. It is strategy.
The protocol tool has two intents: publishers (list available guideline sources) and search (retrieve text by query and publisher). All three models call publishers before searching most of the time — Claude in 97% of samples, GPT-5.2 in 91%, GPT-5-mini in 75%. Sequencing is largely consistent across the board.
What separates the models is what they do after the tool returns. Claude's average answer is 38% longer with tools than without — the retrieved protocol text actually shows up in what it writes. GPT-5.2's average answer is slightly shorter with tools than without. It calls the tool, reads the result, and then produces close to the same answer it would have written from memory. The retrieval reaches the model's context but not its output. The tool use is going through the motions.
The rubric breakdown confirms this. Claude's gain is concentrated in factual accuracy: 64.6% to 82.1%, a 17.5-point jump. GPT-5.2's accuracy actually drops by a point. When retrievals are partial or imperfect, a model that ignores them stays close to its prior, while a model that integrates them takes a small hit on the messy cases and a large benefit on the rest. Claude pays that cost on a few samples and recovers it many times over. GPT-5.2 doesn't, because it isn't really integrating either way.
We also evaluated queries each LLM was making to the tools.
Retrieval hygiene widens the gap further. Claude's search calls come back with useful content 90% of the time; GPT-5.2's 84%; GPT-5-mini's only 75%. Claude sticks to the canonical publisher names returned by discovery (73 distinct strings across the whole run). GPT-5-mini uses 113 — the same source under multiple aliases, like KDIGO and Kidney Disease Improving Global Outcomes (KDIGO) showing up about 50 times each. Every invented variant is a missed lookup, and a quarter of GPT-5-mini's searches return nothing as a result.
The headline number — Claude finishing 13.2 points ahead of GPT-5.2 despite starting 3.6 points behind — comes from two facts: Claude retrieves information better, uses the right publisher names and lets retrieved evidence change its answer, and GPT-5.2 mostly doesn't.
The analysis points to concrete interventions, each tied to a specific failure mode.
Test whether tools change the answer at all. Before measuring accuracy with and without tools, measure answer length and content with and without tools. If a model's outputs barely move when retrieved context is added, the bottleneck isn't the tool — it's the model's willingness to use what came back. Prompt tweaks to sequencing won't fix this. It's a training-and-defaults problem.
Explicit instructions to ground answers in retrieved text. For models that retrieve well but write the same answer regardless, system prompts that ask for direct citation or quoting from retrieved passages can pull the integration step into the foreground. This is the lowest-cost fix and should be tested first for models like GPT-5.2 that already sequence correctly.
Fine-tuning on tool-calling trajectories. For production deployments, fine-tuning on examples of full discover-retrieve-integrate flows changes the model's default behavior. This is the durable fix when prompt engineering plateaus.
Benchmark retrieval and integration, not just final accuracy. Final-answer scores can't tell you whether a model got the right answer because it retrieved the right source or because its training data happened to cover the question. Tracking retrieval success, answer-length response to context, and rubric-level accuracy gains separately is the difference between an evaluation that tells you a model is "worse with tools" and one that tells you why.
The MCP server is remote and hosted, and speaks standard MCP over HTTP. No SDK to install, no local process to run.
The quickstart documentation has per-client setup with config snippets, the tool reference covers all six tools and the three-step calculator workflow, and the authentication page documents both the OIDC flow and the direct-token path.
Both the datasets and the KARMA evaluation framework are public. KARMA (Knowledge Assessment and Reasoning for Medical Applications) is an open-source toolkit for evaluating medical AI systems across text, image, and audio, with particular focus on India's healthcare environment. It includes standardized metrics, out-of-the-box support for major model providers, and a registry system that lets researchers integrate their own models and datasets with minimal friction.
pip install karma-medeval
karma eval --model <your-model> --datasets ekacare/indian_drug_mcqa
karma eval --model <your-model> --datasets ekacare/Eka_NFI_MCQA
karma eval --model <your-model> --datasets ekacare/medical_calculator_eval
karma eval --model <your-model> --datasets ekacare/protocol_retrieval_rubricsRun both tracks — with tools and without. The lift between them is often more diagnostic than the absolute score. If you benchmark a model not listed here, we would like to see the results. Contribute at karma.eka.care or join the conversation on GitHub.
The drug and calculator results establish that tool access produces gains large enough to shift from clinically inadequate to clinically viable — 80% to 99% on drug identification, 43% to 82% on calculators. These are tasks where parametric memory does not compete well with authoritative lookup, regardless of model size or training quality. For Indian healthcare AI products deployed without grounding for these capabilities, the gap between model output and clinical acceptability is large enough to warrant attention before scaling further.
The protocol results establish something more nuanced and, for decision-makers, more important. The infrastructure is necessary but insufficient. Two models with equivalent baselines diverge by 13 percentage points based entirely on how make tool calls and integrate tool outputs into their answers. The calculator error analysis shows the same dynamic from a different angle — models that nominally use tools but mishandle enum fields, unit conventions, or category slug formats leave substantial accuracy on the table in ways aggregate benchmarks do not surface.
For teams deploying AI in Indian healthcare contexts, this points to two parallel workstreams. First, invest in tool infrastructure — drug databases, calculators, protocol search — built for Indian clinical reality, not adapted from Western equivalents. Second, invest in evaluating how your models use those tools, not just whether they have access. Treat tool-calling strategy as a variable you measure and optimize, not an implementation detail you assume is handled.
For higher-up stakeholders evaluating frontier models for clinical deployment, two gaps deserve attention before procurement decisions are finalised. The gap between "this model passes USMLE" and "this model handles Indian clinical context reliably" is real and measurable. So is the gap between "this model has tools" and "this model uses them well." We have tried to make both gaps visible, and to put the infrastructure — the tools, the datasets, the evaluation framework — into the public domain so the Indian medical AI community can run, extend, and improve them together.
Rigorous evaluation, covering not only accuracy but also retrieval behaviour, tool-calling strategy, and error-class breakdowns, is the part of this work we think is most undersupplied today. We hope the resources released here are useful starting points, and we welcome contributions from teams working on similar problems.
This work would not have come together without Nikhil Kasukurthi, who led much of the engineering and evaluation effort behind MedAI Tools and KARMA. The tool design, the dataset curation, the benchmarking pipeline, and most of the analysis surfaced in this post are the result of his work. Thanks, Nikhil.