A conversation on Federated Learning - Part 1

We are back with a new Knowledge Pill... and the spotlight is now on Federated Learning! The backbone of GenoMed4All's Artificial Intelligence framework for hematological diseases, and an emerging technology for the healthcare realm.

What is it exactly? Why does it matter? How does it work? – if any of these questions pique your interest, you are in the right place! Let us explore the ins and outs of Federated Learning and unravel its potential for healthcare in this two-part miniseries.

What is Federated Learning? 

Federated Learning is the latest piece in the ever-growing puzzle of machine data concepts and the underlying patterns than bind them together... and thus, a fundamental step in the evolution of how we look at and understand data. Throughout this journey, each milestone reached becomes a sort of 'breakpoint' in the continuum of change, a new push forward driven by the combined forces of emerging research outcomes, technological prowess and business needs colliding at a particular moment in time. These communities –research, technology and business– are converging, and this massive shift creates new and exciting opportunities for Federated Learning to bloom.

The continuum of machine data evolution 

The idea of data warehouses as a technology first appeared in the 80s and continued to reign supreme through the 90s as the go-to data architecture for big companies. The reason? It offered a way for them to integrate all analytics data into a single central repository, where they could then be optimally accessed, queried and visualized as a whole [1]. Data warehouses kicked off the conversation on data governance and continue to be relevant nowadays. However, they are not a one-size-fits-all solution, requiring highly structured data and usually running on third parties’ proprietary hardware. 

Then Hadoop showed up and changed everything. Now, thanks to the Map-Reduce paradigm [2], companies had free access to an open source solution that was able to process raw data –either unstructured or semi-structured– in a way that data warehouses never could… and the emerging cloud computing technology promised to free them from the shackles of on-premise infrastructure. With all this data suddenly there for the taking, the focus shifted to finding a better way to collect it and store it: the so-called data lake. 

The number of connected devices skyrocketed and so did the amount of raw data available. The Internet of Everything (IoE) took off and soon it became painfully clear that cloud computing could not keep up the pace in this new paradigm of ever-growing data: slow speeds, bandwidth issues and a deficient approach towards data privacy and security were the fuel to the flames of a new, brighter fire: edge computing [3]. This new era ushers in a shift in computing power: from a central node in the cloud to (unsurprisingly) the “edges” of the network –closer to the data sources–, where small-scale storage and data processing can be performed locally. 

Problem solved, right? Well, not really. For years, we have tirelessly gone through every trick in the book on processor chip design to try and keep up with Moore’s law, but engineers seem to have reached the limit in terms of transistor scale and capacity. In this race, traditional general-purpose processors chips –what we normally call CPUs (Central Processing Units)– have given way to specialized accelerators [4] –like GPUs (Graphical Processing Units) and TPUs (Tensor Processing Units)– capable of handling dedicated AI/ML workloads and performing much better under specific size, weight and/or power constraints. 

In the highly distributed modern world of Big Data, a centralized approach is not going to cut it anymore: data is everywhere and the idea of accessing, processing and consuming it from a single central place seems lacking. How to scale up then? Let’s move on to the next frontier: federation [5]. 

A federated –or decentralized– query processing system reconciles the need to access data in a unified way with the inherent nature of the data itself: heterogeneous and distributed. These query engines have no control over the datasets and data providers are free to come and go in a federated scheme. Together with improved flexibility, this also introduces an additional level of performance complexity –mostly related to data fragmentation, disparities in access and processing power– that centralized systems are typically exempt from. 

The final leap from here to Federated Learning is once again motivated by the ebbs and flows of business –the juxtaposition of the prevalence of AI models and the supremacy of data, versus the need for numerical and resource frugality, and the real, often bitter implications on both privacy and ethics.

Reflecting on our journey through the research, technological and business dimensions, we can see that Federated Learning is indeed part of an evolution of machine data concepts and patterns that has been going on for quite some time. As such, every single one of these milestones we have briefly touched upon along the way –the ‘breakpoints’ in this continuum– emerged as a result of mixing new, exciting research outcomes with a keen ambition to address pressing business needs at the time. 

All in all, Federated Learning is here to stay… let’s see what it has to offer. 

Federated Learning: an overview 

Unlike standard Machine Learning approaches, the revolution in Federated Learning starts by bringing model to data, instead of the other way around. The concept is surprisingly simple: let us train a model in multiple decentralized edge nodes. Each holds their local dataset, and none of them will be exchanging data. Following this approach, a community will be able to create and share a global AI/ML model without centralizing the data in a single location. 

 

As any other technology, Federated Learning comes with its own peculiarities. At the forefront of these, we find a decentralized learning process over multiple clients interconnected in a network, which generates information exchange on model and framework control data exclusively. Moreover, training datasets do not leave the edge nodes. Consequently, the model is trained locally and able to preserve data privacy. Every new training process is orchestrated in the same manner: first, we initialize the global model and broadcast it to the edge nodes in our network; next we monitor each and every one of our training phases and finally, we aggregate the result before deployment.

 

Federated Learning comes in different flavors [6]. The most common one is the so-called model-centric approach, in which training is carried out at the edge nodes and a central model is updated through a federation. As an umbrella term, this approach applies to two additional subtypes: cross-device and cross-silo. 

The cross-device alternative is based on horizontal data sharing: training is carried out in a large distributed network of multiple smart devices –as is the case for Google and Android– and thus, should carefully consider the capabilities at the edge in terms of connectivity, limited computing and battery life. In contrast, the cross-silo subtype relies instead on vertical data sharing: organizations (e.g. hospitals, private companies…) or data centers take on the role of edge nodes and share incentives to train a common model, while adhering to strict data security standards and refraining from peer-to-peer data sharing. 

 

Another flavor in this “FL menu” is the data-centric approach, a peer-to-peer distributed learning formula through which each node in the network graph sends and receives gradient steps from its neighbors in order to reach global model convergence via gradual consensus. 

And what about GenoMed4All? Considering everything we have told you so far, our project falls neatly into the cross-silo category of Federated Learning, though there is always room for a potential exploration on how to reach a clinical consensus on horizontal data sharing. After all, any and all advances on rare diseases strongly rely on both collaboration and standardization efforts. That said, and for the sake of clarity, from this point onwards we will focus our attention on the cross-silo approach exclusively. 

 

Why Federated Learning? 

An attractive new concept for healthcare informatics 

It is no secret that systems and processes in the healthcare realm are, more often than not, extremely complex. This complexity directly contributes to a high level of fragmentation that mostly stems from the sensitive nature of healthcare data. Consequently, countless regulations have been developed to dictate the way we access and analyze this kind of data, which typically falls under what is called Protected Health Information (PHI). At their core, these regulatory efforts share the same ambition: ensure that sensitive patient data stays either within local institutions (clinical sites, hospitals) or with the individuals themselves, effectively protecting patient privacy. 

In light of all this, the value proposition of Federated Learning in healthcare boils down to two critical features: scalability and data protection [7]. First up, Federated Learning streamlines the scaling up process for training ML models in multiple edge nodes –think hospitals, clinical sites, research institutions in GenoMed4All– which can theoretically improve the performance of the predictive models, reducing selection bias and facilitating the onboarding of new data providers and sources to the whole training strategy. We are also talking about a framework with built-in re-training and expansion capabilities by design, which greatly simplifies reusability of predictive algorithms. 

Similarly, and as far as data protection goes, Federated Learning is a perfect fit for the healthcare domain [8], since data is not required to leave the clinical data provider’s premises at any point. This has huge implications in a clinical setting and may be key in navigating the current limitations imposed by data protection laws on highly sensitive data –like PHI, for example–, since technically there is no data sharing involved. Coincidentally, this also solves the fragmentation issue by effectively linking multiple health data sources for the purpose of training predictive models.  

All in all, Federated Learning seems to check all the boxes in the healthcare sector wishlist, but real-world applications are still few and hard to come by. According to Gartner (and as of 2021), Federated Learning is in the middle of a climb up in the initial slope of the Privacy Hype Cycle, which effectively labels it as an ‘innovation trigger’ in the digital ethics landscape: a technology in its early stages with some proof-of-concept stories to share and lots of media attention, but with no proven commercial viability yet.

 

So, what exactly is holding us back? Keep an eye out for Part 2!


Missed anything? Check out these references!

 


This Knowledge Pill was created by Vincent Planat (DEDALUS), Francesco Cremonesi (DATAWIZARD) and Diana López (AUSTRALO) from the GenoMed4All consortium
Photo by Milad Fakurian on Unsplash

Machine Learning in healthcare – A brief introduction

From its undeniable relevance as a buzzword in our day-to-day to its multiple applications in healthcare, Machine Learning is here to stay. But what is it exactly and how does it come into play in GenoMed4All?

Through a series of short knowledge pills, we intend to bridge the gap between technical experts, clinicians and the general public, and walk you through the core concepts, challenges and use cases of Machine Learning techniques in a clear and engaging way.

Curious? Below is a sneak peek of the main ideas we will be tackling in this first knowledge pill. Let’s jump right in!

 

What is Machine Learning?

There is no point in providing a definition for Machine Learning (ML) without first addressing the wider context it exists in: the world of Artificial Intelligence.

As a concept, Artificial Intelligence (AI) refers to any technique that makes computers capable of mimicking human behaviour to address and solve problems. In the 1980s, one group of AI methods became more common: Machine Learning, which uses statistical techniques to - you guessed it - enable machines to recognize specific patterns by learning and improving through experience on a set of provided data.

 

And what does 'learning' mean in this context?

After getting a sense of the bigger picture and becoming familiar with all these definitions, the question arises: what is learning from a machine perspective? Surely, it must be something different from the human process of learning, but how so?

A tentative way of approaching the answer to this seemingly straight-forward question can be found in the paper An Introduction to Machine Learning for Clinicians:

Machine learning (ML) describes the ability of an algorithm to 'learn' by finding patterns in large datasets. In other words, the 'answers' produced by ML algorithms are inferences made from statistical analysis of very large datasets.

The key here is to step away from any preconceived notions on the concept of learning, as we humans understand it. Instead of combining a set of human-created rules with data to create answers to a problem, as is done also in conventional programming, machine learning uses data and answers to discover the rules behind a problem.

To learn the rules governing a phenomenon, machines must go through a learning process, trying different rules and learning from how well they perform. This is where reward and loss functions come into play: they allow the machine to automatically assess the rules it created.

Thus, for a machine, ‘learning’ is better understood as the process of maximizing its reward function, limited to the context of that specific task and training data. An ML model trained to identify patients at risk of heart attack is not guaranteed to work for identifying other types of heart failure, nor to be able to include additional information about the patient, and would probably perform poorly on cohorts of patients from other hospitals.

 

Machine Learning workflow

ML practitioners operate in a standard way when they approach a new problem. We can summarize a typical ML-based workflow into the following steps:

  • Data collection: gathering the data that the algorithm will learn from.
  • Data preparation: pre-processing this data into the optimal format, extracting important features (e.g. morphological parameters from an MRI image) and performing dimensionality reduction (i.e. reducing the number of input features). This latter step is important since high dimensionality (i.e. the number of features >> the number of available samples) can cause poor performance for ML algorithms.
  • Training: in this step the machine learning algorithm actually learns by showing it the data that has been collected and prepared.
  • Tuning/Validation: fine-tuning the model to maximize its performance.
  • Evaluation/Testing: testing the model on data not used in the training/validation step in order to evaluate how well it performs.

Machine Learning today is used for different tasks like image recognition, fraud detection, recommendation systems, text and speech systems, and so on. The main requirement for addressing all these tasks through a machine learning method is having a good representative set of data to train the learning model for the required task.

As an example, in the context of ML for image recognition, if the task is to identify different immune infiltrations from histopathological tissue images, but if you have a dataset showing 80% of the images with T-cell infiltrations, the learning model will have a difficult time to correctly classify infiltrations from macrophages or neutrophils.

 

Statistical vs ML techniques

At this stage, you might be wondering: this all sounds strangely familiar to conventional statistics, what is the catch? Well, both Machine Learning and statistics have a shared ambition: to learn from data. It is only when we consider their approach that the differences become apparent.

The goal of statistical methods is inference: to understand 'how' underlying relationships between variables work. This is radically different from Machine Learning, whose primary focus is on prediction: the ‘what’ itself, the results of this process. As explained by Bzdok et al.:

Inference creates a mathematical model of the data-generation process to formalize understanding or test a hypothesis about how the system behaves. Prediction aims at forecasting unobserved outcomes or future behavior, such as whether a mouse with a given gene expression pattern has a disease. Prediction makes it possible to identify best courses of action (e.g. treatment choice) without requiring understanding of the underlying mechanisms. In a typical research project, both inference and prediction can be of value - we want to know how biological processes work and what will happen next.

To give a concrete example of the difference between these two procedures, let’s focus on a dummy example. Let’s have, from a cohort of 1,000 subjects, blood pressure measurements at different environmental conditions, like different temperatures and altitudes.

Considering blood pressure as an outcome, the inferential approach aims at describing how the blood pressure measurements currently available are affected by temperature and altitude by building a model on these two variables and making some statistical assumptions on them (e.g. they are normally distributed). The final model combining these variables is then tested for the goodness of fit against all the 1,000 blood pressure measurements available.

On the other hand, the predictive approach implemented by the machine learning algorithms aims at predicting future blood pressure levels from the environmental conditions (temperature and altitude). To this aim, part of the measurements available are used to train different models (e.g. measurements from 800 subjects), only assuming that a relationship between the outcome (blood pressure) and the variables (temperature and altitude) exists, while the remaining part of the subjects are used to assess the performance of the models on new samples.

Understanding the difference between inference and prediction is important in identifying the strengths and weaknesses of statistical learning and machine learning methods and being able to select the most appropriate approach based on the research goals. For example, we might want to infer which biological processes are associated with a specific disease but predict the best therapy for that disease.

A nice schematization of this juxtaposition can be found in Statistical Modeling: The Two Cultures, and you can see a modified version depicted below.

 

In our next knowledge pill, we will be covering different types of learning and turning the spotlight on specific subfields of Machine Learning that are predominant in a clinical setting.

Stay tuned! 

 


Missed anything? Here's a glossary for you!

  • Artificial Intelligence (AI) - Any technique that enables computers to mimic human behaviour.
  • Deep Learning (DL) - A subset of NN that makes computation of multi-layer neural networks feasible.
  • Generalization - The machine's ability to achieve good performance in the task it was trained to do with data it has never seen before; or the machine's ability to perform new tasks.
  • Inference - In statistical methods, the process of understanding how underlying relationships between variables determine the final output.
  • Loss and reward functions - A quantitative measure of how well the training task is being performed; the machine learns by either trying to minimize its loss function or maximize its reward function.
  • Machine Learning (ML) - A subset of AI techniques which use statistical methods to enable machines to improve with experience.
  • Neural Networks (NN) - A subset of ML algorithms that draw inspiration from the operations of a human brain to recognize relationships between data.
  • Prediction - In Machine Learning, the output of a function that, given some input, is as close as possible to the result of the natural process we are considering.

 


This knowledge pill was created by Francesco Cremonesi, Tiziana Sanavia, Maria Paola Boaro and Diana López from the GenoMed4All consortium
Photo by Fakurian Design on Unsplash