Student Stories: PhD student Matthew Carter shares how he's utilising novel data streams and machine learning to help in the fight against COVID-19. | Stories | Centre for Doctoral Training in Distributed Algorithms

Read Matthew's blog to find out about a day in the life with the CDT in Distributed Algorithms.

"The outbreak of the Coronavirus Disease (COVID-19) has caused widespread disruption to societies and economies around the world. Despite their reliability, conventional data streams (such as the daily reports by the Office of National Statistics) provide limited insight into the pandemic. Our team of researchers, led by Professor Simon Maskell, is developing systems that utilise novel data streams to provide deeper insights into the pandemic as it unfolds.

With social distancing measures in place, a large amount of discourse relating to COVID-19 now takes places on social media platforms such as Twitter. These platforms contain a treasure trove of information that can help us answer questions such as how many people are exhibiting Coronavirus symptoms today? However, not all information is created equal - these platforms also contain a lot of misinformation which could potentially cause harm to members of the public.

We developed a system to track and analyse tweets that mention symptoms of COVID-19. This system ‘listens’ for tweets that mention COVID-19 symptoms. Once identified, tweets are fed through a machine learning classifier which identifies whether it relates to a user’s personal symptoms, someone else’s symptoms or if the tweet contains misinformation.

Twitter dashboard data

^{Dashboard used to monitor Twitter data}

We can also use geolocation data to calculate the number of users who tweet about symptoms in each region of a given country (where geolocation is permitted by the user). From this data, it is also possible to determine the number of users who travel between different regions of a given country. This information could potentially help to identify new outbreak clusters within a country and provide insight into how members of the public responded to lockdown measures.

To make this information easily accessible, we developed a ‘Symptom Watch’ dashboard, which reports a daily count of the number of tweets that mention symptoms. These counts are currently provided per state in the USA and at various levels (local and upper tier authority, NHS region and national) in the UK. This functionality will be extended to other countries in the near future.

We have also been working with Evergreen Life to analyse data from their health and wellness app. In response to COVID-19, Evergreen Life have been asking app users questions to gain insight into the pandemic. Users are asked to report, for example, if they are isolating or if they or someone in their household has symptoms. The depth and breadth of the data collected is really impressive and could answer an endless number of questions.

Laptop and mobile phone

^{Evergreen Life dashboard data display}

The team has developed solutions to answer some of these questions, for example - the average duration an individual experiences symptoms of COVID-19 for. User reports to the Evergreen Life app are sporadic and we therefore don’t see a complete timeline of reports for the full duration an individual is exhibiting symptoms. To deal with the sporadic nature of user reports, we defined and fit a Bayesian model in the ‘Stan’ programming language, which enabled us to determine that users were most likely to experience symptoms for 3.06 days.

Where users report a household member exhibiting symptoms, we can gain insight into the interaction of COVID-19 within households by determining the time between two household members falling ill. We also know whether a user is isolating and subsequently develops symptoms. From these reports, we can quantify whether isolating reduces your chances of developing coronavirus. We analysed data collected between March and June this year and determined that individuals who did not isolate were 35% more likely to report symptoms within 7 days of reporting that they were not isolating.

The work we have done so far demonstrates how novel data streams can be utilised to gain a deeper understanding of the COVID-19 pandemic. When combined with more conventional data streams, these novel data streams could aid governments in making more informed decisions to combat the virus."