Large language models for clinical text analysis

We're developing software to analyse over 10 million clinical free-text notes, captured by the Small Animal Veterinary Surveillance Network (SAVSNET), in collaboration with Durham University and The University of Manchester,

At the heart of this project are large language models (LLMs), huge neural networks trained to understand text. Our main focus centres around a group of smaller LLMs, bidirectional encoder representations using transformers (BERT).

We've adapted BERT models through additional training on veterinary clinical text, creating PetBERT, DogBERT and EquineBERT. We can add fine tuning to these using labelled records to create classifiers. For example:

Language models that identify features of the text, implying whether a syndrome is present or not
Entity extractors, language models that identify types of objects like drugs or diagnoses.

These classifiers and extractors use supervised learning, requiring manually annotated records for training. However, we can leverage thousands of record labels, applied by undergraduate and postgraduate researchers, using our in-house record analysis system called SAVSNET Datalab.

Additionally, we can use these models to cluster records by their semantic content and identify which terms are most important in each cluster. This is called unsupervised topic modelling which has the potential to identify unexpected or novel combinations of features.

Increasingly, we are exploring how to use generative language models (similar to ChatGPT, Gemini and Claude). These can generate text from a prompt (for example, “what symptoms are described in the following text?”). However, these models use much more energy and the online versions require transmitting clinical data outside of the University network, raising data governance concerns. Consequently, we are exploring use of locally installed smaller versions of the models, alongside the use of energy efficient computer architectures.

The visualisation below is based on TOPIC modelling 1million canine consultations and highlights how breeds can vary in their disease susceptibility.

Large language models for clinical text analysis

Related articles