Data drives discovery. As a consequence one would think that more data would help drive more discovery, or at least accelerate discovery. To access more data oftentimes, one has to aggregate and centralize data from different sources. These sources can be from a variety of devices, instruments, formats, languages, countries and legislations. In the most straightforward approach, data is brought together and (pre-)processed at a single site and analysed centrally at the location of the AI model. This is ‘traditional’ centralized learning.
In practice, data is trapped. Data could be trapped in paper documents, pdfs, or in physical libraries. Intelligent Document Processing is a way to approach this type of unstructured ‘trapped’ data. Alternatively, data can be trapped by ownership, and privacy. As an example, an industrial partner may have an interest in a dataset, and therefore the owner of the data will be less willing to share the data. A patient, or an institution representing the patient, might be restricted to sharing the data for privacy reasons. A technical reason for data being trapped is when a device was developed without the objective of data aggregation, and thus lacking means for data harmonization altogether.
Driven by the advances in the Internet of Things, federated learning (FL) offers a solution, to release data and drive discovery. In FL, the model is trained without seeing or touching the data. Metaphorically speaking, the AI model travels to each dataset rather than data being aggregated at the site of the AI model. FL stimulates data sharing collaborations across sites without compromising patient privacy data legislation, governed by e.g. GDPR and HIPAA. In parallel, FL removes obstacles posed by heterogeneity of local devices or local (clinical) data management systems.
An example in biomedical research: In many instances the number of cases for orphan and rare diseases per institution are too low to garner the benefits of AI assisted prediction capabilities. Here FL trains a machine learning algorithm across multiple decentralized devices or servers, without actually exchanging the data or sharing any sensitive patient data. The federated approach promotes access to maximal data, while lowering institutional burdens to data sharing. Deploying FL over multiple sites and continents, prompted recent breakthroughs in the diagnosis of rare and aggressive cancers, like Triple-negative breast cancer and Glioblastoma. Thereby overcoming the lack of sufficient data for an AI analysis at a single site and thus outperforming the locally trained AI models.
Sensoworks, partner in the Frontiere network of companies, is in the initial phase of developing a FL model in collaboration with the Politecnico di Milano. Different FL architectures and types will be investigated to find the most optimal way to deploy an anomaly detection model on data derived from sensors, for which the benefits of FL outweigh the strenuous efforts of harmonizing data.
Contact: d.r. Remco Foppen