MedAssist: why Polish clinical NLP is different from English clinical NLP.

Clinical NLP products are almost all English-first. The vocabulary databases, the fine-tuning corpora, the anatomical ontologies — SNOMED CT, UMLS, MedDRA — were all built in English and translated outward. For Polish clinical work, that translation layer fails in ways that matter at the point of care.

The three things that actually break

1. Declension and the lost subject. Polish is a heavily inflected language. A medication name changes form depending on grammatical case — paracetamol, paracetamolu, paracetamolem. English clinical NLP tokenisers treat these as different strings. A dosage lookup against a drug database that was trained on English forms misses the Polish form entirely.

2. Clinical-Polish is not formal Polish. Polish clinicians use heavy abbreviation — "OB podwyższone, CRP 48, PLT 450, bez gorączki". Standard Polish NLP models handle newspaper Polish and legal Polish. They tokenise clinical abbreviations as noise. A Polish-tuned clinical NLP has to be trained on actual clinical notes, not Polish Wikipedia.

3. The regulatory frame is Polish and different. ICD-10-PL has Polish-specific codes. Drug interactions reference the Polish drug formulary, not the US FDA orange book. Reimbursement codes come from NFZ (Narodowy Fundusz Zdrowia), not CPT. A "good" English clinical NLP ported to Poland will confidently produce the wrong code at the wrong level of confidence.

What we actually did

MedAssist’s transcription agent (agent/transcribe-pl) was trained on Polish clinical notes and real consultation transcripts, not general-language Polish corpora. Its vocabulary is the vocabulary clinicians actually use on shift. Its handling of abbreviations is medical-specific, not general.

The ICD-10 coding agent (agent/code-icd10) references ICD-10-PL as its primary source, with a fallback to ICD-10 WHO where the Polish extension is silent. The drug-interaction agent (agent/check-interactions) prioritises Polish formulary data, which has different brand-name coverage than UK or US formularies.

The QA gate (agent/qa-clinical) flags where the agent’s confidence drops below threshold — which, in practice, is where a naive ported system would still report high confidence and be wrong.

Why this is a defensible moat

Clinical NLP quality compounds on clinical data. An English-tuned system retrained on Polish sees fewer examples, has fewer abbreviation coverages, and hits the regulatory-frame mismatch at every encounter. A system built Polish-first from the outset doesn’t have those debts.

And the regulatory frame moves. RODO (Polish GDPR) has specific clinical-data provisions. NFZ billing codes change by reimbursement decree. The drug formulary shifts quarterly. A product that treats Polish as a translation output will lag those changes. One that treats Polish as its native frame keeps pace.

"Clinical NLP is not translation. It is regulatory, linguistic, and epistemic rebuild per jurisdiction."

What this means for clinics buying clinical NLP

Ask the vendor: what’s the first language you trained on? If the answer is English with Polish as a locale, you’re buying a translation layer. If it’s Polish from the start, you’re buying the right product.

MedAssist is in active build, with pilot-clinic slots opening for Q3 2026 deployment. Enquire: enterprise@blackflake.com.

— Bartek Kubas · Founder-architect · Blackflake 15 May 2026 · Łódź, Poland