Large language models for clinical text analysis
We're developing software to analyse over 10 million clinical free-text notes, captured by the Small Animal Veterinary Surveillance Network (SAVSNET), in collaboration with Durham University and The University of Manchester,
At the heart of this project are large language models (LLMs), huge neural networks trained to understand text. Our main focus centres around a group of smaller LLMs, bidirectional encoder representations using transformers (BERT).
We've adapted BERT models through additional training on veterinary clinical text, creating PetBERT, DogBERT and EquineBERT. We can add fine tuning to these using labelled records to create classifiers. For example:
- Language models that identify features of the text, implying whether a syndrome is present or not
- Entity extractors, language models that identify types of objects like drugs or diagnoses.
These classifiers and extractors use supervised learning, requiring manually annotated records for training. However, we can leverage thousands of record labels, applied by undergraduate and postgraduate researchers, using our in-house record analysis system called SAVSNET Datalab.
Additionally, we can use these models to cluster records by their semantic content and identify which terms are most important in each cluster. This is called unsupervised topic modelling which has the potential to identify unexpected or novel combinations of features.
Increasingly, we are exploring how to use generative language models (similar to ChatGPT, Gemini and Claude). These can generate text from a prompt (for example, “what symptoms are described in the following text?”). However, these models use much more energy and the online versions require transmitting clinical data outside of the University network, raising data governance concerns. Consequently, we are exploring use of locally installed smaller versions of the models, alongside the use of energy efficient computer architectures.
Related articles
- Comprehensive representation of health-related phenotypes in one million dogs using topic modelling of electronic health records
- Evaluating ChatGPT text mining of clinical records for companion animal obesity monitoring
- Using topic modelling for unsupervised annotation of electronic health records to identify an outbreak of disease in UK dogs
- PetBERT: automated ICD-11 syndromic disease coding for outbreak detection in first opinion veterinary electronic health records
- Text mining for disease surveillance in veterinary clinical data: part one, the language of veterinary clinical records and searching for words
- Text mining for disease surveillance in veterinary clinical data: part two, training computers to identify features in clinical text
- Explainable text-tabular models for predicting mortality risk in companion animals.