From its undeniable relevance as a buzzword in our day-to-day to its multiple applications in healthcare, Machine Learning is here to stay. But what is it exactly and how does it come into play in GenoMed4All?
Through a series of short knowledge pills, we intend to bridge the gap between technical experts, clinicians and the general public, and walk you through the core concepts, challenges and use cases of Machine Learning techniques in a clear and engaging way.
Curious? Below is a sneak peek of the main ideas we will be tackling in this first knowledge pill. Let’s jump right in!
What is Machine Learning?
There is no point in providing a definition for Machine Learning (ML) without first addressing the wider context it exists in: the world of Artificial Intelligence.
As a concept, Artificial Intelligence (AI) refers to any technique that makes computers capable of mimicking human behaviour to address and solve problems. In the 1980s, one group of AI methods became more common: Machine Learning, which uses statistical techniques to – you guessed it – enable machines to recognize specific patterns by learning and improving through experience on a set of provided data.
And what does ‘learning‘ mean in this context?
After getting a sense of the bigger picture and becoming familiar with all these definitions, the question arises: what is learning from a machine perspective? Surely, it must be something different from the human process of learning, but how so?
A tentative way of approaching the answer to this seemingly straight-forward question can be found in the paper An Introduction to Machine Learning for Clinicians:
Machine learning (ML) describes the ability of an algorithm to ‘learn’ by finding patterns in large datasets. In other words, the ‘answers’ produced by ML algorithms are inferences made from statistical analysis of very large datasets.
The key here is to step away from any preconceived notions on the concept of learning, as we humans understand it. Instead of combining a set of human-created rules with data to create answers to a problem, as is done also in conventional programming, machine learning uses data and answers to discover the rules behind a problem.
To learn the rules governing a phenomenon, machines must go through a learning process, trying different rules and learning from how well they perform. This is where reward and loss functions come into play: they allow the machine to automatically assess the rules it created.
Thus, for a machine, ‘learning’ is better understood as the process of maximizing its reward function, limited to the context of that specific task and training data. An ML model trained to identify patients at risk of heart attack is not guaranteed to work for identifying other types of heart failure, nor to be able to include additional information about the patient, and would probably perform poorly on cohorts of patients from other hospitals.
Machine Learning workflow
ML practitioners operate in a standard way when they approach a new problem. We can summarize a typical ML-based workflow into the following steps:
- Data collection: gathering the data that the algorithm will learn from.
- Data preparation: pre-processing this data into the optimal format, extracting important features (e.g. morphological parameters from an MRI image) and performing dimensionality reduction (i.e. reducing the number of input features). This latter step is important since high dimensionality (i.e. the number of features >> the number of available samples) can cause poor performance for ML algorithms.
- Training: in this step the machine learning algorithm actually learns by showing it the data that has been collected and prepared.
- Tuning/Validation: fine-tuning the model to maximize its performance.
- Evaluation/Testing: testing the model on data not used in the training/validation step in order to evaluate how well it performs.
Machine Learning today is used for different tasks like image recognition, fraud detection, recommendation systems, text and speech systems, and so on. The main requirement for addressing all these tasks through a machine learning method is having a good representative set of data to train the learning model for the required task.
As an example, in the context of ML for image recognition, if the task is to identify different immune infiltrations from histopathological tissue images, but if you have a dataset showing 80% of the images with T-cell infiltrations, the learning model will have a difficult time to correctly classify infiltrations from macrophages or neutrophils.
Statistical vs ML techniques
At this stage, you might be wondering: this all sounds strangely familiar to conventional statistics, what is the catch? Well, both Machine Learning and statistics have a shared ambition: to learn from data. It is only when we consider their approach that the differences become apparent.
The goal of statistical methods is inference: to understand ‘how’ underlying relationships between variables work. This is radically different from Machine Learning, whose primary focus is on prediction: the ‘what’ itself, the results of this process. As explained by Bzdok et al.:
Inference creates a mathematical model of the data-generation process to formalize understanding or test a hypothesis about how the system behaves. Prediction aims at forecasting unobserved outcomes or future behavior, such as whether a mouse with a given gene expression pattern has a disease. Prediction makes it possible to identify best courses of action (e.g. treatment choice) without requiring understanding of the underlying mechanisms. In a typical research project, both inference and prediction can be of value – we want to know how biological processes work and what will happen next.
To give a concrete example of the difference between these two procedures, let’s focus on a dummy example. Let’s have, from a cohort of 1,000 subjects, blood pressure measurements at different environmental conditions, like different temperatures and altitudes.
Considering blood pressure as an outcome, the inferential approach aims at describing how the blood pressure measurements currently available are affected by temperature and altitude by building a model on these two variables and making some statistical assumptions on them (e.g. they are normally distributed). The final model combining these variables is then tested for the goodness of fit against all the 1,000 blood pressure measurements available.
On the other hand, the predictive approach implemented by the machine learning algorithms aims at predicting future blood pressure levels from the environmental conditions (temperature and altitude). To this aim, part of the measurements available are used to train different models (e.g. measurements from 800 subjects), only assuming that a relationship between the outcome (blood pressure) and the variables (temperature and altitude) exists, while the remaining part of the subjects are used to assess the performance of the models on new samples.
Understanding the difference between inference and prediction is important in identifying the strengths and weaknesses of statistical learning and machine learning methods and being able to select the most appropriate approach based on the research goals. For example, we might want to infer which biological processes are associated with a specific disease but predict the best therapy for that disease.
A nice schematization of this juxtaposition can be found in Statistical Modeling: The Two Cultures, and you can see a modified version depicted below.
In our next knowledge pill, we will be covering different types of learning and turning the spotlight on specific subfields of Machine Learning that are predominant in a clinical setting.
Stay tuned!
Missed anything? Here’s a glossary for you!
- Artificial Intelligence (AI) – Any technique that enables computers to mimic human behaviour.
- Deep Learning (DL) – A subset of NN that makes computation of multi-layer neural networks feasible.
- Generalization – The machine‘s ability to achieve good performance in the task it was trained to do with data it has never seen before; or the machine‘s ability to perform new tasks.
- Inference – In statistical methods, the process of understanding how underlying relationships between variables determine the final output.
- Loss and reward functions – A quantitative measure of how well the training task is being performed; the machine learns by either trying to minimize its loss function or maximize its reward function.
- Machine Learning (ML) – A subset of AI techniques which use statistical methods to enable machines to improve with experience.
- Neural Networks (NN) – A subset of ML algorithms that draw inspiration from the operations of a human brain to recognize relationships between data.
- Prediction – In Machine Learning, the output of a function that, given some input, is as close as possible to the result of the natural process we are considering.