We are back with a new Knowledge Pill… and the spotlight is now on Federated Learning! The backbone of GenoMed4All’s Artificial Intelligence framework for hematological diseases, and an emerging technology for the healthcare realm.

What is it exactly? Why does it matter? How does it work? – if any of these questions pique your interest, you are in the right place! Let us explore the ins and outs of Federated Learning and unravel its potential for healthcare in this two-part miniseries.

What is Federated Learning? 

Federated Learning is the latest piece in the ever-growing puzzle of machine data concepts and the underlying patterns than bind them together… and thus, a fundamental step in the evolution of how we look at and understand data. Throughout this journey, each milestone reached becomes a sort of ‘breakpoint’ in the continuum of change, a new push forward driven by the combined forces of emerging research outcomes, technological prowess and business needs colliding at a particular moment in time. These communities –research, technology and business– are converging, and this massive shift creates new and exciting opportunities for Federated Learning to bloom.

The continuum of machine data evolution 

The idea of data warehouses as a technology first appeared in the 80s and continued to reign supreme through the 90s as the go-to data architecture for big companies. The reason? It offered a way for them to integrate all analytics data into a single central repository, where they could then be optimally accessed, queried and visualized as a whole [1]. Data warehouses kicked off the conversation on data governance and continue to be relevant nowadays. However, they are not a one-size-fits-all solution, requiring highly structured data and usually running on third parties’ proprietary hardware. 

Then Hadoop showed up and changed everything. Now, thanks to the Map-Reduce paradigm [2], companies had free access to an open source solution that was able to process raw data –either unstructured or semi-structured– in a way that data warehouses never could… and the emerging cloud computing technology promised to free them from the shackles of on-premise infrastructure. With all this data suddenly there for the taking, the focus shifted to finding a better way to collect it and store it: the so-called data lake. 

The number of connected devices skyrocketed and so did the amount of raw data available. The Internet of Everything (IoE) took off and soon it became painfully clear that cloud computing could not keep up the pace in this new paradigm of ever-growing data: slow speeds, bandwidth issues and a deficient approach towards data privacy and security were the fuel to the flames of a new, brighter fire: edge computing [3]. This new era ushers in a shift in computing power: from a central node in the cloud to (unsurprisingly) the “edges” of the network –closer to the data sources–, where small-scale storage and data processing can be performed locally. 

Problem solved, right? Well, not really. For years, we have tirelessly gone through every trick in the book on processor chip design to try and keep up with Moore’s law, but engineers seem to have reached the limit in terms of transistor scale and capacity. In this race, traditional general-purpose processors chips –what we normally call CPUs (Central Processing Units)– have given way to specialized accelerators [4] –like GPUs (Graphical Processing Units) and TPUs (Tensor Processing Units)– capable of handling dedicated AI/ML workloads and performing much better under specific size, weight and/or power constraints. 

In the highly distributed modern world of Big Data, a centralized approach is not going to cut it anymore: data is everywhere and the idea of accessing, processing and consuming it from a single central place seems lacking. How to scale up then? Let’s move on to the next frontier: federation [5]. 

A federated –or decentralized– query processing system reconciles the need to access data in a unified way with the inherent nature of the data itself: heterogeneous and distributed. These query engines have no control over the datasets and data providers are free to come and go in a federated scheme. Together with improved flexibility, this also introduces an additional level of performance complexity –mostly related to data fragmentation, disparities in access and processing power– that centralized systems are typically exempt from. 

The final leap from here to Federated Learning is once again motivated by the ebbs and flows of business –the juxtaposition of the prevalence of AI models and the supremacy of data, versus the need for numerical and resource frugality, and the real, often bitter implications on both privacy and ethics.

Reflecting on our journey through the research, technological and business dimensions, we can see that Federated Learning is indeed part of an evolution of machine data concepts and patterns that has been going on for quite some time. As such, every single one of these milestones we have briefly touched upon along the way –the ‘breakpoints’ in this continuum– emerged as a result of mixing new, exciting research outcomes with a keen ambition to address pressing business needs at the time. 

All in all, Federated Learning is here to stay… let’s see what it has to offer. 

Federated Learning: an overview 

Unlike standard Machine Learning approaches, the revolution in Federated Learning starts by bringing model to data, instead of the other way around. The concept is surprisingly simple: let us train a model in multiple decentralized edge nodes. Each holds their local dataset, and none of them will be exchanging data. Following this approach, a community will be able to create and share a global AI/ML model without centralizing the data in a single location. 

 

As any other technology, Federated Learning comes with its own peculiarities. At the forefront of these, we find a decentralized learning process over multiple clients interconnected in a network, which generates information exchange on model and framework control data exclusively. Moreover, training datasets do not leave the edge nodes. Consequently, the model is trained locally and able to preserve data privacy. Every new training process is orchestrated in the same manner: first, we initialize the global model and broadcast it to the edge nodes in our network; next we monitor each and every one of our training phases and finally, we aggregate the result before deployment.

 

Federated Learning comes in different flavors [6]. The most common one is the so-called model-centric approach, in which training is carried out at the edge nodes and a central model is updated through a federation. As an umbrella term, this approach applies to two additional subtypes: cross-device and cross-silo. 

The cross-device alternative is based on horizontal data sharing: training is carried out in a large distributed network of multiple smart devices –as is the case for Google and Android– and thus, should carefully consider the capabilities at the edge in terms of connectivity, limited computing and battery life. In contrast, the cross-silo subtype relies instead on vertical data sharing: organizations (e.g. hospitals, private companies…) or data centers take on the role of edge nodes and share incentives to train a common model, while adhering to strict data security standards and refraining from peer-to-peer data sharing. 

 

Another flavor in this “FL menu” is the data-centric approach, a peer-to-peer distributed learning formula through which each node in the network graph sends and receives gradient steps from its neighbors in order to reach global model convergence via gradual consensus. 

And what about GenoMed4All? Considering everything we have told you so far, our project falls neatly into the cross-silo category of Federated Learning, though there is always room for a potential exploration on how to reach a clinical consensus on horizontal data sharing. After all, any and all advances on rare diseases strongly rely on both collaboration and standardization efforts. That said, and for the sake of clarity, from this point onwards we will focus our attention on the cross-silo approach exclusively. 

 

Why Federated Learning? 

An attractive new concept for healthcare informatics 

It is no secret that systems and processes in the healthcare realm are, more often than not, extremely complex. This complexity directly contributes to a high level of fragmentation that mostly stems from the sensitive nature of healthcare data. Consequently, countless regulations have been developed to dictate the way we access and analyze this kind of data, which typically falls under what is called Protected Health Information (PHI). At their core, these regulatory efforts share the same ambition: ensure that sensitive patient data stays either within local institutions (clinical sites, hospitals) or with the individuals themselves, effectively protecting patient privacy. 

In light of all this, the value proposition of Federated Learning in healthcare boils down to two critical features: scalability and data protection [7]. First up, Federated Learning streamlines the scaling up process for training ML models in multiple edge nodes –think hospitals, clinical sites, research institutions in GenoMed4All– which can theoretically improve the performance of the predictive models, reducing selection bias and facilitating the onboarding of new data providers and sources to the whole training strategy. We are also talking about a framework with built-in re-training and expansion capabilities by design, which greatly simplifies reusability of predictive algorithms. 

Similarly, and as far as data protection goes, Federated Learning is a perfect fit for the healthcare domain [8], since data is not required to leave the clinical data provider’s premises at any point. This has huge implications in a clinical setting and may be key in navigating the current limitations imposed by data protection laws on highly sensitive data –like PHI, for example–, since technically there is no data sharing involved. Coincidentally, this also solves the fragmentation issue by effectively linking multiple health data sources for the purpose of training predictive models.  

All in all, Federated Learning seems to check all the boxes in the healthcare sector wishlist, but real-world applications are still few and hard to come by. According to Gartner (and as of 2021), Federated Learning is in the middle of a climb up in the initial slope of the Privacy Hype Cycle, which effectively labels it as an ‘innovation trigger’ in the digital ethics landscape: a technology in its early stages with some proof-of-concept stories to share and lots of media attention, but with no proven commercial viability yet.

 

So, what exactly is holding us back? Keep an eye out for Part 2!


Missed anything? Check out these references!

 


This Knowledge Pill was created by Vincent Planat (DEDALUS), Francesco Cremonesi (DATAWIZARD) and Diana López (AUSTRALO) from the GenoMed4All consortium
Photo by Milad Fakurian on Unsplash