GenoMed4All's Women In Science - A chat with Marilena Bicchieri

On the occasion of International Women Day and the ongoing #WomenInScience campaign, we want to celebrate the amazing women of GenoMed4All.
And that's why we sat down with our colleague Marilena Bicchieri, GenoMed4All's Scientific Coordinator and Healthcare Project Manager at Humanitas Research Hospital. Here's what she had to say about her role in our project, together with her experience building a successful and meaningful career and navigating the highs and lows of STEM as a woman.

What is your role as a scientific coordinator: from both a personal and professional point of view?

As scientific coordinator, my role is both stimulating and challenging. From a professional perspective, I am responsible for keeping up-to-date with the progress of the project and identifying any gaps or needs of each involved partner to ensure the project moves forward smoothly and efficiently. I serve as a bridge between the technical experts and the clinicians, who often have different perspectives. Therefore, I must be able to understand both points of view and effectively communicate shared information.

From a personal perspective, I understand the importance of constantly learning and improving to excel in my role. By continually enhancing my knowledge and skills, I can provide valuable feedback, advice, and insights to my colleagues, helping to steer the project in the right direction. I am also driven by my ambition to achieve outstanding results, which motivates me to stay proactive and engaged every day.

What is your experience in GenoMed4All? (As part of a team? Your vision of the project as a whole?)

GenoMed4All is an ambitious project that requires a high level of expertise and coordination across a diverse group of partners. As a member of the team, I feel privileged to be part of such a dedicated and talented group of individuals. The consortium is composed of experts from various fields, including researchers, clinicians, technical experts, and industry partners, who all bring unique perspectives and skills to the table.

Working together as a team is essential to the success of the project, and I believe that everyone is committed to this shared goal. While it can be challenging to stay on the same track, the sense of unity and purpose within the team makes the difference. I am continually impressed by the dedication and professionalism of everyone involved in the project.

I believe that GenoMed4All has the potential to be a game-changer in the field of personalized medicine, thanks to the exploitation of -omics data, and will provide one of the first federated platform implementation in the healthcare sector. By leveraging the latest advances in AI technology and applying them to the study of haematological diseases, we can gain new insights on diagnosis and prognosis while developing more effective treatments with the ultimate goal to improve patient outcomes and quality of life.  This is incredibly motivating for everyone involved and I am excited to be part of this initiative and look forward to seeing the impact that it will have on the field of haematology and technology.

Could you share some of the main challenges (and highlights!) you have experienced in your career?

As women in STEM, I feel incredibly fortunate to have had a positive, supportive and respectful environment throughout my career. However, it's important to consider that being a woman in a male-oriented society can come with additional challenges, such as breaking stereotypes and biases.

I've sometimes had to work hard to prove myself which, as positive effect, has helped me to grow stronger and more resilient. Indeed, it is important to focus on the rewards that come with overcoming these challenges rather than feeling defeated. Many steps forward have been made in our society and what I have experienced is the result of many years of battles for woman emancipation. I still believe that pursuing a career in STEM can be demanding and require a great deal of responsibility, but with the right approach and a commitment, it can also be incredibly rewarding.

To succeed in STEM, it is essential to be motivated.  Motivation will lead you to be open to learning, taking on new challenges, and importantly being willing to take risks. Self-improvement is crucial, as it allows you to grow and develop your skills and expertise.  Lastly, setting clear goals and working towards them is necessary to stay focused and driven.

What would be your inspiring words to encourage girls and women to pursue the STEM path?

One of the important advice I would give to girls and women committed to a STEM career is to be brave. Don't be afraid to take on new challenges, even if they seem daunting at first and never be afraid to drive important decisions. Surround yourself with supportive colleagues and mentors, who can help guide you along the way. And remember that the work you do in STEM can have a significant impact on the world around you: by pursuing a career in STEM, you have the opportunity to make a real difference and drive important advancements in science and technology.

I encourage all girls and women interested in STEM to follow their passions and believe in themselves. By doing so, we can keep up to break down the barriers that still exist in the field and pave the way for a more diverse and inclusive STEM community, especially in the most impactful apical roles, where the presence of women is significantly statistically underrepresented.

The world needs more female scientists, engineers, and innovators at the top of the society and I am confident that the next generation of girls will continue to make important contributions to the field.


Interview courtesy of Marilena Bicchieri, PhD - Healthcare Project Manager at Humanitas Research Hospital


A conversation on Federated Learning - Part 2

Welcome to Part 2 in our miniseries on Federated Learning!

You can find all the details of our conversation so far in Part 1. In the first installment, we traveled through the continuum of machine data, learned about the different flavors of Federated Learning and pondered on its added value in healthcare. Ultimately, we were left with a question: what is holding back the adoption of Federated Learning applications?

Now it’s the time to address the still untapped potential of Federated Learning.


How to build a Federated Learning framework?

The challenges ahead 

At first glance, we might be quick to assume that a cross-silo approach for Federated Learning seems to be the easiest path to take implementation-wise: after all, we are dealing with a limited number of well-known, addressable edge systems, which are more powerful and reliable overall. However, this apparent ‘simplicity’ conceals another wide spectrum of issues to account for, either from a business, data integration, security or platform perspective. Let’s dive right in.

Business challenges 

In a Federated Learning network, there is a risk that edge nodes may behave ‘selfishly’ in order to compromise between model accuracy and cost [1]. This delicate balance of risk-reward is intimately tied to the governance of the network itself and has many implications on what is commonly known as ‘health justice’. In the context of GenoMed4All, this theme crystalizes into how we define and enforce ‘equity’ among nodes in the network and our capability to properly adjust for discrepancies in overall performance and model accuracy when onboarding a ‘dissonant’ node. Anticipating these ‘dissonances’ in an FL network is key, since participants may not be evenly matched in terms of the resources –human and material alike– they are able to commit to this joint enterprise. On this point, the research community has already dedicated quite a lot of effort to find out how we can maximize benefit for each node with limited engagement: the answer seems to lie in the way we estimate both the motivation and contribution of our network nodes.

On the topic of motivation, we may ask ourselves: how do I reward participation for each edge node in a way that ensures that the central server can maintain optimal quality? Putting in place incentive mechanisms that work for all participants involved is key, especially in such a highly heterogenous environment. These incentives or rewards may take multiple forms, like accessing specific central services, benefitting from models without contributing to their training, the opportunity to launch a new training plan… you name it. For the estimation of a node’s contribution, however, a reward can only be fixed if the ‘value’ each node brings to the network can be adequately quantified, and this is not a straightforward exercise: it has to consider both dataset size and quality, and the computation needs to then be correlated to the accuracy of the final model and updated with each training iteration [2].

Data integration challenges 

When considering data usage, we must be mindful of how to onboard organizations operating across multiple geographic, political and regulatory scenarios –especially those dictating data protection regulations– to this FL network. The first barrier we must be aware of in terms of data integration is the minimum anonymized dataset that needs to be shared for the initial FL model to be correctly tested, developed and bootstrapped. Another significant roadblock are the access policies that govern dataset extraction at the edge and define what is and is not allowed in terms of data science operations on metadata and model alike. As of today, there is a marked interest on how to strike the right integration between authorization policy language to encode these access policies and the technology required to enforce them.

Additionally, there is the ever-present matter of data quality, which the distributed nature of Federated Learning only aggravates [3]. From the qualifying and onboarding phases to integration to monitoring, it permeates the whole FL lifecycle. Well before onboarding new edge nodes –another hospital, for example– to our network, we should have in place a clear, auditable set of qualifying criteria (e.g. incentive model, hosting capabilities, training resources, available datasets…) that potential candidates are expected to meet in order to officially become nodes. This pre-selection step, though critical to the whole network performance, does not usually get the recognition it deserves, due to either monetary or time constraints.

Immediately after, in the onboarding per se, data quality must be assessed again. It also comes into play when using and integrating a Common Data Model (CDM), since training algorithms with datasets from heterogeneous sources –like Electronic Health Records (EHRs)– has a negative impact in network maintenance and scalability, which can only be mitigated by enforcing a single CDM for the central and edge nodes [4]. For GenoMed4All, our CDM pick is FHIR (Fast Health Interoperable Resources), a standard that defines how healthcare data may be exchanged between nodes regardless of how it is actually stored in those nodes. Compared to other standards alternatives, FHIR shows large (and growing) adoption rates among care providers and has sufficient support for genomic data representation, two key and decisive arguments in the context of GenoMed4All. However, the future healthcare industry seems to be slowing but surely edging towards more fluid scenarios that favor the co-existence of a wide variety of CDM standards – for instance, the emerging OpenEHR standard. This trend would be especially relevant in federated ecosystems like GenoMed4All’s, since they intend to amalgamate an ever-growing, wildly heterogenous landscape of hospitals under a unique distributed umbrella.

Monitoring data quality during training is also tricky: every time datasets are added to the network, there is a need to evaluate whether they are really up-to-standard, mainly to avoid entering a new training loop that ultimately pushes back an updated, poorer quality model to the central server.

On top of these sizeable pile of issues to consider lies the unescapable fact that we are operating in the healthcare realm, where challenges in data integration are always multi-faceted. New social determinants –linked to decision support, care pathways, medication…– and unconventional sources of information –social media, the Internet of Things– have started to permeate the way we look at and make sense of healthcare processes … and in turn, this heightened understanding has exposed a pressing need to outline and regulate data subject rights. As a result, the concept of ‘digital sovereignty’ has been coined to protect the individual’s right for autonomy in a predominantly digital world, and the EU has embraced this notion as the cornerstone of its strategy to usher in a new area of European digital leadership centered around ensuring citizens retain control over their personal data.

Security challenges 

In a cross-silo FL scenario, one of the most pressing issues currently under the spotlight is linked to data and client system security, or how to prevent information leaks during the multiple update iterations. Even if no data is exchanged between edge nodes and the central server, the model may still contain some patient-sensitive information in its parameters. The server is normally the one exploiting this vulnerability, since it centralizes client updates and has more control on the FL process as a whole. Solutions to this problem rely on Secure Multiparty Computation (SMC) to aggregate updates or Differential Privacy (DP) to distort client updates locally. Additionally, we might need to also protect the central server against potential malicious client attacks from the edge nodes: those aiming to compromise the convergence of the global model by either disrupting the training process or providing false updates [5]. For GenoMed4All this is not as relevant an issue since all partners participating as clients are considered trusted nodes in the network.

Platform challenges 

The current landscape of Federated Learning platforms –and their features– paints a picture of highly heterogenous, research-specific and not yet mature alternatives (e.g. Flower, Fedbiomed, Fate, TensorFlow Federated, PySyft, Paddle) that are emerging as an unequivocal sign of all the excitement and interest surrounding FL. Key platform capabilities like configurability, robustness, scalability, performance, user experience... are non-negotiable for GenoMed4All’s ambition. After all, we are working on a production environment that intends to serve an ever-growing community with an increasing number of algorithms and use cases.

But how to make this vision a reality? The problem is, modern workspace software environments are sorely missing a model federation dimension. Nowadays, AI platforms are mature enough to handle everything from data exploration, testing, pre-processing and transformation and feature engineering to model validation and deployment… and yet, they still have not figured out how to support a model federation approach. As a result, we are missing out on several fronts: first, on metadata exploration tools for data scientists to build their models and features on; and second, on workspaces with adequate debug, development and testing capabilities to handle models with longer lifecycles and incremental contributions from edge nodes [6]. The inevitable conclusion? Team productivity and efficiency are greatly impacted.

Another contender for top platform challenge in FL is data extraction. Data scientists follow complex workflows for model development and data extraction plays a major role in the selection, (cohort) transformation and feature extraction steps. These operations must be first formalized by the platform so they can then be automatically reproduced on the edge nodes. For data scientists, a platform that can provide easy-to-use tools to step away from manual configuration before jumping to model deployment is certainly a bonus. That is why we are taking care to integrate a flexible ETL (Extract, Transform, Load) tool –containing data cohort definition linked to the target model– to configure data extraction and transformation steps from the CDM to the algorithms in GenoMed4All’s platform.

All these challenges are represented in the scorecard below, described in the context of GenoMed4All and ranked in order of priority (i.e. we have marked with 3 stars those that we consider to be core challenges in the project).


The GenoMed4All project or why Federated Learning will serve rare disease research 

At GenoMed4All, we are building a Federated Learning platform where clinicians and researchers can work together in the definition, development, testing and validation of AI models to improve the way we currently diagnose and treat hematological diseases in the EU. We envision two complementary operational modes for this platform: a clinical mode, catering to the needs of healthcare professionals and patients in their daily practice; and a research mode, where data scientists can train and benchmark AI models from available data on hematological diseases.

For clinicians, GenoMed4All’s platform will act as a local decision support system to input new prospective and retrospective patient data, extracting insights from an ever-learning model. For researchers, GenoMed4All offers an AI sandbox to benchmark and train new AI models on real-world data and to ensure their clinical usability, a critical point that has so far hampered the real-world integration of AI applications in healthcare.

We believe that a radical shift in how we introduce these kind of tools to a clinical setting is sorely needed to ensure their accountability, transparency and usefulness among healthcare professionals. Drawing a parallel to how we rely on solid pharmaco-vigilance processes to monitor adverse reactions and ultimately confirm a certain drug is safe for use, we can certainly envision a similar clinical validation flow for these tools that successfully undergoes the same level of scrutiny and meets the required standards for performance excellence in a clinical setting.


All in all, we have seen that Federated Learning is indeed an emerging technology that is still finding its footing within the research field. The cross-silo approach we have followed does provide a number of unquestionably attractive capabilities for AI applications in the clinical research space, namely those in the data privacy domain. However, several challenges lurk in the horizon… and must be addressed before this approach can finally become mainstream practice in the healthcare industry, so that Federated Learning can effectively deliver on all the promises we have navigated through in this miniseries.

In this research space, GenoMed4All plays a pioneer role as it explores the large spectrum of issues raised by Federated Learning in healthcare: form platform technology selection and development all the way to defining the full data flow and Common Data Model, security, privacy and an end-to-end operational model. This close collaboration environment, spearheaded by multiple care providers in Europe, leading edge research institutions and recognized industrial partners (meet our stellar team here!) is our core strength to pave the way forward and deliver on new innovation opportunities.

If you enjoyed this miniseries on Federated Learning, stay tuned for future Knowledge Pills!


Missed anything? Check out these references!


This knowledge pill was created by Vincent Planat (DEDALUS), Francesco Cremonesi (DATAWIZARD) and Diana López (AUSTRALO) from the GenoMed4All consortium
Photo by Milad Fakurian on Unsplash


A conversation on Federated Learning - Part 1

We are back with a new Knowledge Pill... and the spotlight is now on Federated Learning! The backbone of GenoMed4All's Artificial Intelligence framework for hematological diseases, and an emerging technology for the healthcare realm.

What is it exactly? Why does it matter? How does it work? – if any of these questions pique your interest, you are in the right place! Let us explore the ins and outs of Federated Learning and unravel its potential for healthcare in this two-part miniseries.

What is Federated Learning? 

Federated Learning is the latest piece in the ever-growing puzzle of machine data concepts and the underlying patterns than bind them together... and thus, a fundamental step in the evolution of how we look at and understand data. Throughout this journey, each milestone reached becomes a sort of 'breakpoint' in the continuum of change, a new push forward driven by the combined forces of emerging research outcomes, technological prowess and business needs colliding at a particular moment in time. These communities –research, technology and business– are converging, and this massive shift creates new and exciting opportunities for Federated Learning to bloom.

The continuum of machine data evolution 

The idea of data warehouses as a technology first appeared in the 80s and continued to reign supreme through the 90s as the go-to data architecture for big companies. The reason? It offered a way for them to integrate all analytics data into a single central repository, where they could then be optimally accessed, queried and visualized as a whole [1]. Data warehouses kicked off the conversation on data governance and continue to be relevant nowadays. However, they are not a one-size-fits-all solution, requiring highly structured data and usually running on third parties’ proprietary hardware. 

Then Hadoop showed up and changed everything. Now, thanks to the Map-Reduce paradigm [2], companies had free access to an open source solution that was able to process raw data –either unstructured or semi-structured– in a way that data warehouses never could… and the emerging cloud computing technology promised to free them from the shackles of on-premise infrastructure. With all this data suddenly there for the taking, the focus shifted to finding a better way to collect it and store it: the so-called data lake. 

The number of connected devices skyrocketed and so did the amount of raw data available. The Internet of Everything (IoE) took off and soon it became painfully clear that cloud computing could not keep up the pace in this new paradigm of ever-growing data: slow speeds, bandwidth issues and a deficient approach towards data privacy and security were the fuel to the flames of a new, brighter fire: edge computing [3]. This new era ushers in a shift in computing power: from a central node in the cloud to (unsurprisingly) the “edges” of the network –closer to the data sources–, where small-scale storage and data processing can be performed locally. 

Problem solved, right? Well, not really. For years, we have tirelessly gone through every trick in the book on processor chip design to try and keep up with Moore’s law, but engineers seem to have reached the limit in terms of transistor scale and capacity. In this race, traditional general-purpose processors chips –what we normally call CPUs (Central Processing Units)– have given way to specialized accelerators [4] –like GPUs (Graphical Processing Units) and TPUs (Tensor Processing Units)– capable of handling dedicated AI/ML workloads and performing much better under specific size, weight and/or power constraints. 

In the highly distributed modern world of Big Data, a centralized approach is not going to cut it anymore: data is everywhere and the idea of accessing, processing and consuming it from a single central place seems lacking. How to scale up then? Let’s move on to the next frontier: federation [5]. 

A federated –or decentralized– query processing system reconciles the need to access data in a unified way with the inherent nature of the data itself: heterogeneous and distributed. These query engines have no control over the datasets and data providers are free to come and go in a federated scheme. Together with improved flexibility, this also introduces an additional level of performance complexity –mostly related to data fragmentation, disparities in access and processing power– that centralized systems are typically exempt from. 

The final leap from here to Federated Learning is once again motivated by the ebbs and flows of business –the juxtaposition of the prevalence of AI models and the supremacy of data, versus the need for numerical and resource frugality, and the real, often bitter implications on both privacy and ethics.

Reflecting on our journey through the research, technological and business dimensions, we can see that Federated Learning is indeed part of an evolution of machine data concepts and patterns that has been going on for quite some time. As such, every single one of these milestones we have briefly touched upon along the way –the ‘breakpoints’ in this continuum– emerged as a result of mixing new, exciting research outcomes with a keen ambition to address pressing business needs at the time. 

All in all, Federated Learning is here to stay… let’s see what it has to offer. 

Federated Learning: an overview 

Unlike standard Machine Learning approaches, the revolution in Federated Learning starts by bringing model to data, instead of the other way around. The concept is surprisingly simple: let us train a model in multiple decentralized edge nodes. Each holds their local dataset, and none of them will be exchanging data. Following this approach, a community will be able to create and share a global AI/ML model without centralizing the data in a single location. 


As any other technology, Federated Learning comes with its own peculiarities. At the forefront of these, we find a decentralized learning process over multiple clients interconnected in a network, which generates information exchange on model and framework control data exclusively. Moreover, training datasets do not leave the edge nodes. Consequently, the model is trained locally and able to preserve data privacy. Every new training process is orchestrated in the same manner: first, we initialize the global model and broadcast it to the edge nodes in our network; next we monitor each and every one of our training phases and finally, we aggregate the result before deployment.


Federated Learning comes in different flavors [6]. The most common one is the so-called model-centric approach, in which training is carried out at the edge nodes and a central model is updated through a federation. As an umbrella term, this approach applies to two additional subtypes: cross-device and cross-silo. 

The cross-device alternative is based on horizontal data sharing: training is carried out in a large distributed network of multiple smart devices –as is the case for Google and Android– and thus, should carefully consider the capabilities at the edge in terms of connectivity, limited computing and battery life. In contrast, the cross-silo subtype relies instead on vertical data sharing: organizations (e.g. hospitals, private companies…) or data centers take on the role of edge nodes and share incentives to train a common model, while adhering to strict data security standards and refraining from peer-to-peer data sharing. 


Another flavor in this “FL menu” is the data-centric approach, a peer-to-peer distributed learning formula through which each node in the network graph sends and receives gradient steps from its neighbors in order to reach global model convergence via gradual consensus. 

And what about GenoMed4All? Considering everything we have told you so far, our project falls neatly into the cross-silo category of Federated Learning, though there is always room for a potential exploration on how to reach a clinical consensus on horizontal data sharing. After all, any and all advances on rare diseases strongly rely on both collaboration and standardization efforts. That said, and for the sake of clarity, from this point onwards we will focus our attention on the cross-silo approach exclusively. 


Why Federated Learning? 

An attractive new concept for healthcare informatics 

It is no secret that systems and processes in the healthcare realm are, more often than not, extremely complex. This complexity directly contributes to a high level of fragmentation that mostly stems from the sensitive nature of healthcare data. Consequently, countless regulations have been developed to dictate the way we access and analyze this kind of data, which typically falls under what is called Protected Health Information (PHI). At their core, these regulatory efforts share the same ambition: ensure that sensitive patient data stays either within local institutions (clinical sites, hospitals) or with the individuals themselves, effectively protecting patient privacy. 

In light of all this, the value proposition of Federated Learning in healthcare boils down to two critical features: scalability and data protection [7]. First up, Federated Learning streamlines the scaling up process for training ML models in multiple edge nodes –think hospitals, clinical sites, research institutions in GenoMed4All– which can theoretically improve the performance of the predictive models, reducing selection bias and facilitating the onboarding of new data providers and sources to the whole training strategy. We are also talking about a framework with built-in re-training and expansion capabilities by design, which greatly simplifies reusability of predictive algorithms. 

Similarly, and as far as data protection goes, Federated Learning is a perfect fit for the healthcare domain [8], since data is not required to leave the clinical data provider’s premises at any point. This has huge implications in a clinical setting and may be key in navigating the current limitations imposed by data protection laws on highly sensitive data –like PHI, for example–, since technically there is no data sharing involved. Coincidentally, this also solves the fragmentation issue by effectively linking multiple health data sources for the purpose of training predictive models.  

All in all, Federated Learning seems to check all the boxes in the healthcare sector wishlist, but real-world applications are still few and hard to come by. According to Gartner (and as of 2021), Federated Learning is in the middle of a climb up in the initial slope of the Privacy Hype Cycle, which effectively labels it as an ‘innovation trigger’ in the digital ethics landscape: a technology in its early stages with some proof-of-concept stories to share and lots of media attention, but with no proven commercial viability yet.


So, what exactly is holding us back? Keep an eye out for Part 2!

Missed anything? Check out these references!


This Knowledge Pill was created by Vincent Planat (DEDALUS), Francesco Cremonesi (DATAWIZARD) and Diana López (AUSTRALO) from the GenoMed4All consortium
Photo by Milad Fakurian on Unsplash

Machine Learning in healthcare – A brief introduction

From its undeniable relevance as a buzzword in our day-to-day to its multiple applications in healthcare, Machine Learning is here to stay. But what is it exactly and how does it come into play in GenoMed4All?

Through a series of short knowledge pills, we intend to bridge the gap between technical experts, clinicians and the general public, and walk you through the core concepts, challenges and use cases of Machine Learning techniques in a clear and engaging way.

Curious? Below is a sneak peek of the main ideas we will be tackling in this first knowledge pill. Let’s jump right in!


What is Machine Learning?

There is no point in providing a definition for Machine Learning (ML) without first addressing the wider context it exists in: the world of Artificial Intelligence.

As a concept, Artificial Intelligence (AI) refers to any technique that makes computers capable of mimicking human behaviour to address and solve problems. In the 1980s, one group of AI methods became more common: Machine Learning, which uses statistical techniques to - you guessed it - enable machines to recognize specific patterns by learning and improving through experience on a set of provided data.


And what does 'learning' mean in this context?

After getting a sense of the bigger picture and becoming familiar with all these definitions, the question arises: what is learning from a machine perspective? Surely, it must be something different from the human process of learning, but how so?

A tentative way of approaching the answer to this seemingly straight-forward question can be found in the paper An Introduction to Machine Learning for Clinicians:

Machine learning (ML) describes the ability of an algorithm to 'learn' by finding patterns in large datasets. In other words, the 'answers' produced by ML algorithms are inferences made from statistical analysis of very large datasets.

The key here is to step away from any preconceived notions on the concept of learning, as we humans understand it. Instead of combining a set of human-created rules with data to create answers to a problem, as is done also in conventional programming, machine learning uses data and answers to discover the rules behind a problem.

To learn the rules governing a phenomenon, machines must go through a learning process, trying different rules and learning from how well they perform. This is where reward and loss functions come into play: they allow the machine to automatically assess the rules it created.

Thus, for a machine, ‘learning’ is better understood as the process of maximizing its reward function, limited to the context of that specific task and training data. An ML model trained to identify patients at risk of heart attack is not guaranteed to work for identifying other types of heart failure, nor to be able to include additional information about the patient, and would probably perform poorly on cohorts of patients from other hospitals.


Machine Learning workflow

ML practitioners operate in a standard way when they approach a new problem. We can summarize a typical ML-based workflow into the following steps:

  • Data collection: gathering the data that the algorithm will learn from.
  • Data preparation: pre-processing this data into the optimal format, extracting important features (e.g. morphological parameters from an MRI image) and performing dimensionality reduction (i.e. reducing the number of input features). This latter step is important since high dimensionality (i.e. the number of features >> the number of available samples) can cause poor performance for ML algorithms.
  • Training: in this step the machine learning algorithm actually learns by showing it the data that has been collected and prepared.
  • Tuning/Validation: fine-tuning the model to maximize its performance.
  • Evaluation/Testing: testing the model on data not used in the training/validation step in order to evaluate how well it performs.

Machine Learning today is used for different tasks like image recognition, fraud detection, recommendation systems, text and speech systems, and so on. The main requirement for addressing all these tasks through a machine learning method is having a good representative set of data to train the learning model for the required task.

As an example, in the context of ML for image recognition, if the task is to identify different immune infiltrations from histopathological tissue images, but if you have a dataset showing 80% of the images with T-cell infiltrations, the learning model will have a difficult time to correctly classify infiltrations from macrophages or neutrophils.


Statistical vs ML techniques

At this stage, you might be wondering: this all sounds strangely familiar to conventional statistics, what is the catch? Well, both Machine Learning and statistics have a shared ambition: to learn from data. It is only when we consider their approach that the differences become apparent.

The goal of statistical methods is inference: to understand 'how' underlying relationships between variables work. This is radically different from Machine Learning, whose primary focus is on prediction: the ‘what’ itself, the results of this process. As explained by Bzdok et al.:

Inference creates a mathematical model of the data-generation process to formalize understanding or test a hypothesis about how the system behaves. Prediction aims at forecasting unobserved outcomes or future behavior, such as whether a mouse with a given gene expression pattern has a disease. Prediction makes it possible to identify best courses of action (e.g. treatment choice) without requiring understanding of the underlying mechanisms. In a typical research project, both inference and prediction can be of value - we want to know how biological processes work and what will happen next.

To give a concrete example of the difference between these two procedures, let’s focus on a dummy example. Let’s have, from a cohort of 1,000 subjects, blood pressure measurements at different environmental conditions, like different temperatures and altitudes.

Considering blood pressure as an outcome, the inferential approach aims at describing how the blood pressure measurements currently available are affected by temperature and altitude by building a model on these two variables and making some statistical assumptions on them (e.g. they are normally distributed). The final model combining these variables is then tested for the goodness of fit against all the 1,000 blood pressure measurements available.

On the other hand, the predictive approach implemented by the machine learning algorithms aims at predicting future blood pressure levels from the environmental conditions (temperature and altitude). To this aim, part of the measurements available are used to train different models (e.g. measurements from 800 subjects), only assuming that a relationship between the outcome (blood pressure) and the variables (temperature and altitude) exists, while the remaining part of the subjects are used to assess the performance of the models on new samples.

Understanding the difference between inference and prediction is important in identifying the strengths and weaknesses of statistical learning and machine learning methods and being able to select the most appropriate approach based on the research goals. For example, we might want to infer which biological processes are associated with a specific disease but predict the best therapy for that disease.

A nice schematization of this juxtaposition can be found in Statistical Modeling: The Two Cultures, and you can see a modified version depicted below.


In our next knowledge pill, we will be covering different types of learning and turning the spotlight on specific subfields of Machine Learning that are predominant in a clinical setting.

Stay tuned! 


Missed anything? Here's a glossary for you!

  • Artificial Intelligence (AI) - Any technique that enables computers to mimic human behaviour.
  • Deep Learning (DL) - A subset of NN that makes computation of multi-layer neural networks feasible.
  • Generalization - The machine's ability to achieve good performance in the task it was trained to do with data it has never seen before; or the machine's ability to perform new tasks.
  • Inference - In statistical methods, the process of understanding how underlying relationships between variables determine the final output.
  • Loss and reward functions - A quantitative measure of how well the training task is being performed; the machine learns by either trying to minimize its loss function or maximize its reward function.
  • Machine Learning (ML) - A subset of AI techniques which use statistical methods to enable machines to improve with experience.
  • Neural Networks (NN) - A subset of ML algorithms that draw inspiration from the operations of a human brain to recognize relationships between data.
  • Prediction - In Machine Learning, the output of a function that, given some input, is as close as possible to the result of the natural process we are considering.


This knowledge pill was created by Francesco Cremonesi, Tiziana Sanavia, Maria Paola Boaro and Diana López from the GenoMed4All consortium
Photo by Fakurian Design on Unsplash