Large language models for clinical text analysis

In collaboration with Durham and Manchester, we're developing software to analyse over 10 million clinical free-text notes.

We are using large language models (LLMs) which are huge neural networks trained to understand text. Our main efforts centre around a group of smaller LLMs: Bidirectional encoder representations using transformers (BERT). We've adapted BERT models through additional training on veterinary clinical text, creating PetBERT and DogBERT and EquineBERT. We can add fine tuning to these using labelled records to create classifiers (e.g. language models that identify features of the text implying a syndrome is present or not), entity extractors (language models that identify types of objects like drugs or diagnoses). These classifiers and extractors use supervised learning, requiring manually annotated records for training but we can leverage thousands of record labels applied by undergraduate and post-graduate researchers using our in-house record analysis system called SAVSNET-Datalab over the last ten years.

Additionally, we can use these models to cluster records by their semantic content and identify which terms are most important in each cluster - this is called unsupervised topic modelling which has the potential to identify unexpected or novel combinations of features.

Increasingly we are exploring how to use generative language models (similar to ChatGPT, Gemini and Claude). These can generate text from a prompt (e.g. “what symptoms are described in the following text”). These models use much more energy and the online versions would require transmitting clinical data outside the University network, which raises data governance concerns. Consequently, we are exploring use of locally installed smaller versions of the models and the use of energy efficient computer architectures to work with these.

Links to our research.

  • ChatGPT - mining of clincical records for obesity. Vet Record. 
  • Topic modelling - unsupervised identification of a disease outbreak in dogs. Plos One.
  • Language models - PetBERT. Automated ICD-11 syndrome disease classification. Scientific Reports.
  • Part I - text mining review - from lexical structures to pattern recognition. Front. Vet. Sci.
  • Part 2 - text mining review - training computers to identify features in clinical text. Front. Vet. Sci. 
  • Predictors of outcome - explainable text-tabular models for predicting mortality. Scientific Reports.
  • Topic modelling - comprehensive representation of health in 1,000,000 dogs. Preprint.

Back to: Small Animal Veterinary Surveillance Network