Diagnosing the data challenges in cancer research

9 December 2024

Diagnosing the data challenges in cancer research

Dr Mireia Crispin is an award-winning Assistant Professor in the Department of Oncology at the University of Cambridge with a PhD in Particle Physics. Dr Crispin’s group at the University’s Early Cancer Institute is spearheading a new approach to cancer research and treatment. She and her colleagues are bringing together diverse datasets to decode one of the most fatal and least studied forms of gynaecological cancer: high-grade serous ovarian cancer.

Drawing on information ranging from radiological imaging to biopsy data, patient demographics, treatment history, and tumour DNA markers, Dr Crispin and her team are developing sophisticated AI tools that personalise cancer diagnosis and treatment. These tools allow them to gain a deeper understanding of each patient’s unique cancer and to predict how patients might respond to treatment. “Once these techniques are more mature and have been validated,” Dr Crispin says, “the vision is that women who have been diagnosed with ovarian cancer will be able to find the best possible treatment for them and for their specific cancer.” What used to take hours of painstaking research for radiologists reviewing radiology images and CT scans can now be done at pace with AI – allowing huge amounts of data to be processed and analysed within a matter of minutes. The highly-curated cohort of patients Dr Crispin is working with today numbers more than 1,000 women who have been treated for ovarian cancer. And she is also analysing data from a clinical trial involving 600 patients – numbers that were unthinkable even a few years ago.

Data roadblocks

However, new challenges are emerging when it comes to accessing the data Dr Crispin needs for her cutting-edge research. The infrastructure for accessing the data doesn’t seem to be keeping pace with the speed of scientific progress. Although clinical health records are digitised for patients in Cambridge, which makes them relatively easy to access in theory, that isn’t always the case for historical pathology records. For retrospective studies, for example, some of the older biopsy and medical images Dr Crispin and her team needs have to be retrieved from data storage in Wales and scanned in manually, making data curation a long and arduous process.

The original patient data, which is owned by the NHS, is highly confidential. Data from studies approved for analysis are extracted and stored anonymously on University servers in a secure data environment. Only researchers with valid NHS letters of access and research passports can access the data. Dr Crispin’s team works with a dedicated data manager who liaises with the NHS to make sure that the data is retrieved in a structured way that follows NHS frameworks and protocols assiduously. The data manager, who then also cleans and curates the anonymised data so that it’s suitable for specific research purposes, is paid for by Cancer Research UK (CRUK), which funds Cambridge Cancer Centre infrastructure costs. Any CRUKfunded research also needs to follow FAIR principles – which stands for Findable, Accessible, Interoperable and Reusable - and is designed to ensure fairness, inclusivity and transparency in research.

Embedding these principles in the Centre’s data management enables Dr Crispin’s team to publish enough data for each research project through scientific research papers so that the scientific community can reproduce and verify the results. Although each project varies in terms of publishing protocols, replicability is Dr Crispin’s minimum threshold when it comes to sharing data, which includes fully anonymised patient biomarkers and images. These datasets can be stored in public databases, the University’s own data sharing system (Apollo), or together with the code as part of the software repository.

However, there are several other challenges when it comes to accessing data that hamper Dr Crispin’s work: “The barriers are generally that the data is not set up to be mined on a large scale,” says Dr Crispin. “Both from the point of view of how the data was taken in the first place and also from the point of view of the infrastructure, people and facilities that would enable you to rapidly access those data.

At Cambridge University Hospitals NHS Foundation Trust, for example, there are only a limited number of people who are permitted access to the highly-protected patient data that researchers across the University need for their work, which inevitably leads to long delays. It’s also hard to find data managers who are suitably qualified and willing to work for a university or an NHS trust for a relatively modest salary when they could be earning far more in industry, Dr Crispin says. This challenge is exacerbated by the fact that clinical IT infrastructures are completely separate from University computers, unlike competitors in Europe and the US. This causes two main problems.

Firstly, the NHS IT infrastructure is not set up to deal with data-heavy research in the clinic, which makes returning prognostic models back from the lab into the clinic very challenging. While the University has access to world-leading super computers, the NHS doesn’t have access to the kind of computational infrastructure that researchers ideally need. Dr Crispin’s colleague, Director of Clinical Integration Dr Sarah Burge, puts it this way: “It’s like building a Formula 1 engine in the lab, and expecting it to fit into a Fiesta chassis in the clinic.”

Secondly, moving data between the two different governances – even with all the ethical approvals in place – is a challenge, involving data transfer agreements that have to be constructed anew each time: “This just doesn’t exist in countries and hospitals where the research and clinical IT domains are under one governance system,” says Dr Burge.

There are also barriers to collaboration between different NHS trusts when it comes to sharing clinical data. “It’s extremely painful, in my experience,” says Dr Crispin. “You effectively have to go to each hospital one by one and set up meetings and deal with completely different governance structures and data management systems … and there’s no guarantee that you’re ever going to get the full data.”

She contrasts this approach to clinical data in the UK with her experience as a postdoctoral researcher in the US, where access to clinical records for strictly defined research purposes was much more straightforward. And in her original area of research, particle physics, data is shared by scientists in a far more collaborative way: “Everyone has access to everything,” she says. “Everything is shared.”

Hope on the horizon

With ambitions to build a new cancer research hospital in Cambridge, however, there are plans in process to set up a centralised way to query patient data sets both internally and across the region in a more synchronised way. Dr Crispin also cites a new Electronic Health Record Research and Innovation Database (ERIN) in the pipeline, which will include all patient level data collected as part of providing patient care at the NHS Trust in Cambridge.

However, there’s far more that could be done at a national level, according to Dr Crispin, to harness the huge potential of AI and data for vital medical research like hers: “I think it should be a priority to have hospitals that are data ready and it should be a priority to have data management and IT teams with specific responsibilities around enabling and facilitating data management for research,” she says. “I think it’s clear to everyone that these powerful technologies could be very, very useful if applied properly in the medical setting. But you will never be able to do it properly unless you have some boots on the ground working within the NHS environment to make it possible. I know the budgets are stretched, but if we want to future-proof all of this, then there needs to be some investment in it.”

Dr Crispin also welcomes the proposal to create a National Data Library in the UK, citing existing resources such as the UK Biobank and databases like the Cancer Genome Atlas (TCGA) and the Cancer Imaging Atlas (TCIA), which are used heavily by cancer researchers even though they draw on a small set of data.

“If there was a National Data Library that was fully comprehensive that we could use to test hypotheses or ask new questions, I think it could be huge” she says. “It would not be unprecedented – OpenSAFELY , led by the University of Oxford, is a great example that was developed during the COVID-19 pandemic. That is the direction we should be moving in.”

Written by Vicky Anning

Read the access to data report