Overview
This PhD will develop rigorous benchmarks for active learning in molecular design, testing when and why AL methods outperform baselines under realistic data and experimental constraints. Fragment screening and DNA-encoded libraries will serve as contrasting real-world application domains for prospective validation.
About this opportunity
Active learning (AL) is widely promoted as a route to more efficient molecular discovery, yet there is limited consensus on which AL strategies work, under what conditions, and why. Reported successes are often based on retrospective simulations, favourable datasets, or bespoke experimental setups, making it difficult to compare methods or assess their robustness. The ALGeNeM project addresses this gap by developing principled, transparent benchmarks for AL in chemistry, grounded in realistic data regimes and prospective experimental feedback.
Aim.
The central aim of this PhD is to systematically benchmark active learning strategies for molecular design, quantifying their benefits and failure modes across different data modalities, objectives, and uncertainty regimes. Rather than proposing a single new AL algorithm, the project will focus on understanding performance trade-offs between acquisition functions, surrogate models, and representations, and on defining best practices for deploying AL in real discovery settings.
What the candidate will do.
Define benchmarking frameworks: Design evaluation protocols for AL that reflect realistic constraints (small data, noisy measurements, batch selection, delayed feedback, and model misspecification). Establish strong non-AL baselines (random, diversity-based, greedy optimisation).
Compare AL strategies: Benchmark commonly used acquisition functions (e.g. uncertainty sampling, expected improvement, information-theoretic criteria) across multiple surrogate model classes and molecular representations.
Uncertainty and realism: Investigate how uncertainty estimation quality, distribution shift, and dataset bias impact AL performance, including cases where AL provides no benefit or is actively harmful.
Prospective and semi-prospective validation: Apply the benchmarking framework in two contrasting experimental contexts:
Fragment-based and structure-enabled optimisation, where data are sparse but information-rich.
DNA-encoded library (DEL) or large library screening, where data are abundant but noisy and biased.
These will serve as test cases to stress-test conclusions, not as the primary focus of the thesis.
Training and collaboration.
The student will receive interdisciplinary training across the Materials Innovation Factory (MIF) and the Department of Chemistry, covering machine learning for molecular data, statistical decision-making, experimental design, and reproducible research software engineering. Through ALGeNeM collaborations, the student will interact with experimental scientists to understand how algorithmic choices translate into real cost, time, and risk trade-offs.
Project structure.
Year 1: Core training; literature review on AL theory and chemical applications; definition of benchmarking criteria and datasets; implementation of baseline pipelines.
Year 2: Systematic benchmarking of AL strategies across simulated and historical datasets; analysis of uncertainty, bias, and failure modes; first methods/benchmarking publication.
Year 3: Application of the framework to fragment screening and DEL-style datasets with prospective or semi-prospective evaluation; synthesis of general design principles for AL in chemistry.
Final period: Thesis completion, dissemination of open benchmarking tools, and submission of final publications.
This project will produce actionable guidance for the community on when active learning is worth using in molecular discovery — and when it is not — alongside reusable benchmarking infrastructure aligned with ALGeNeM’s broader goals.