The «MissingBigData» project
Julie Josse, Professor of Statistics at the Centre de mathématiques appliquées de l'École polytechnique (CMAP), and Gaël Varoquaux, researcher in the Parietal team at the Inria Saclay Centre - Île-de-France, have decided to combine their skills to tackle the problems of missing data and propose new methods to support decision-making. The MissingBigData project was selected by the DATAIA Institute as part of its first call for research projects. How did this collaboration come about? What are the challenges of their interdisciplinary research? Julie and Gaël introduce us to MissingBigData.
Two subjects facing the same problem
Julie Josse works with the Traumabase group, which collects data from more than 15,000 patients admitted for severe trauma, from hospital management to resuscitation discharge. Severe trauma is the leading cause of death in young people and a major cause of severe disability. The socio-economic impact is major. The management of these patients is therefore a real public health issue. The objective of Julie's research is to analyze the data collected by Traumabase to provide decision support tools to emergency physicians, for example, to predict hemorrhagic shocks as soon as the patient is treated by the SAMU so that an appropriate medical team can receive him upon his arrival at the hospital. But Julie is faced with a problem of missing data: "From the data, I look to see if I can create models to correctly predict hemorrhagic shock. Except that my data come from many different sources, from several hospitals, which do not necessarily have the same practices."
For his part, Gaël Varoquaux is working on medical imaging and its use, particularly in epidemiology. In this context, Gaël analyses large volumes of data of different types (medical imaging, health status, quality of life of the person...) whose quality is not uniform. In particular, it uses data collected by UK Biobank, which monitors the health and well-being of 500,000 volunteer participants, to improve the prevention, diagnosis and treatment of a wide range of serious and life-threatening diseases. Gaël is particularly interested in neuropsychiatry and the risk factors for mental illness (schizophrenia, autism, depression, etc.). Here too, there is the problem of missing data that hinders the development of confidence predictive models.
How to answer causal questions when we are missing data?
Gaël explains: "If we compare the people who die in the hospital with those who do not die in the hospital, we can conclude that the hospital is very dangerous because there are many people who die there. We realize that this is a mistake. This selection bias must be mathematically compensated for. The problem is that we no longer know how to do this when there are missing data, especially information. "Indeed, the omission of a measure can be "informative", i.e. it hides a systematic effect. The MissingBigData project aims to approach the problem from a different angle and propose new and more powerful models from larger data samples to impute missing values. "To avoid biasing the conclusions, we will study multiple imputation and conditions on dependence in the data. Our project aims to reduce health risk factors by predicting better outcomes and identifying risk factors for adverse outcomes. We are looking for an operational solution, from methodology to implementation, that integrates the diversity and volume of data [...] by considering several types of missing data. " (from the MissingBigData project)
Applications in the health sector, but not only
The objective of these two researchers is to produce a generic model, methods applicable in fields other than health. "To enhance our work we will make software development available to the community. Our research problem is motivated by the application, for educational purposes, that everyone can replicate," emphasizes Gaël.
The interdisciplinarity of this team will allow a thesis student funded by the DATAIA Institute to share two team cultures, to make presentations to different audiences, to communicate with people who have different languages: mathematicians from the École polytechnique and computer scientists in Machine Learning at Inria. "Communities have difficulty understanding each other when we have the same problems and complementary tools," notes Julie. This call for projects will allow these communities to move forward with a common goal: reusability and the transfer of good practices to do participatory science. To support Julie and Gaël, the MissingBigData team will be composed of Nicolas Prost, a thesis student, an engineer whose recruitment is in progress, Erwan Scornet, lecturer at the Mathematics Department of the École polytechnique and head of the Master IA, Alexandre Gramfort, researcher at the Inria - Saclay-Île-de-France centre and Balázs Kégl, researcher at the CNRS and head of the Center for Data Science Paris-Saclay.