« MissingBigData » project

Julie Josse, Professor of Statistics at the Centre de Mathématiques Appliquées de l'École Polytechnique (CMAP), and Gaël Varoquaux, a researcher in the Parietal team at the Inria Saclay - Île-de-France center, have decided to combine their skills to tackle the problem of missing data and propose new decision support methods.

 

The MissingBigData project was selected by the DATAIA Institute as part of its first call for research projects. How did this collaboration come about? What are the challenges of their interdisciplinary research? Julie and Gaël present MissingBigData.

 

Two subjects, one problem

Julie Josse works with the Traumabase group, which compiles data on over 15,000 patients admitted for severe trauma, from hospital admission to discharge from intensive care. Severe trauma is the leading cause of death in young people, and a major cause of severe disability. They have a major socio-economic impact. The care of these patients is therefore a real public health issue. The aim of Julie's research is to analyze the data collected by Traumabase to provide decision-support tools for emergency physicians, enabling them, for example, to predict hemorrhagic shock as soon as the patient is taken into care by the SAMU (emergency medical service), so that an appropriate medical team can receive the patient on arrival at hospital. But Julie is faced with the problem of missing data: “I use the data to see if I can create models to correctly predict hemorrhagic shock. Except that my data comes from lots of different sources, from several hospitals, which don't necessarily have the same practices.

Gaël Varoquaux works on medical imaging and its use in epidemiology. In this context, Gaël analyzes large volumes of different types of data (medical imaging, health status, quality of life, etc.) of varying quality. In particular, he uses data collected by UK Biobank, which monitors the health and well-being of 500,000 volunteer participants, with the aim of improving the prevention, diagnosis and treatment of a wide range of serious and life-threatening diseases. Gaël is particularly interested in neuropsychiatry and the risk factors for mental illness (schizophrenia, autism, depression, etc.). Here too, the problem of missing data hinders the development of reliable predictive models.

How can we answer causal questions when we lack data?

Gaël explains: “If we compare people who die in hospital and those who don't, we can conclude that the hospital is very dangerous because so many people die there. This is clearly an error. This selection bias has to be mathematically compensated for. The problem is that we no longer know how to do this when there is missing data, particularly informative data.” Indeed, the omission of a measure can be “informative”, i.e. it hides a systematic effect. The MissingBigData project aims to approach the problem from a different angle, proposing new, more powerful models based on larger data samples to impute missing values. “To avoid biasing conclusions, we will study multiple imputation and conditions on dependency in the data. Our project aims to reduce health risk factors, in particular by predicting better outcomes and identifying risk factors for undesirable outcomes. We are looking for an operational solution, from methodology to implementation, that integrates the diversity and volume of data [...] by considering several types of missing data.” (extract from the MissingBigData project)

Applications in the health sector, but not only

The aim of these two researchers is to produce a generic model and methods that can be applied in fields other than healthcare. “To add value to our work, we will be developing software that will be made available to the community. Our research problem is motivated by the application, for educational purposes, that everyone will be able to replicate,” Gaël points out.

Complementary skills

The interdisciplinarity of this team will enable a PhD student funded by the DATAIA Institute to share two team cultures, make presentations to different audiences, and communicate with people who have different languages: mathematicians at the École Polytechnique and Machine Learning computer scientists at Inria. “The communities find it hard to understand each other, even though we have the same problems and complementary tools,” notes Julie. This call for projects will enable these communities to move forward with a common goal: reusability and the transfer of best practices for participatory science. To support Julie and Gaël, the MissingBigData team will be made up of Nicolas Prost, a PhD student, an engineer currently being recruited, Erwan Scornet, lecturer in the mathematics department at the École Polytechnique and head of the AI Master's program, Alexandre Gramfort, researcher at the Inria - Saclay-Île-de-France center, and Balázs Kégl, researcher at the CNRS and head of the Center for Data Science Paris-Saclay.


ContactsGael Varoquaux | Julie Josse