Skip to Content

Towards federated learning

Capgemini
2020-11-27

The problem

Imagine you are tasked with building a world beating travel app. You want your customers to see it as an ultimate travel companion – a digital helper to assist with planning their next dream travel and a local guide to lead them to the best hidden gems. Someone who seems to have known them for ages and caters for their tastes.

… Or you work in finance and want to provide your clients – financial institutions – with the most sophisticated credit scoring algorithm possible.

… Or, better yet. You are on a forefront of a medical research effort in a quest for a remedy against a mysterious disease.

In all three cases, in order to deliver, you need access to massive amounts of highly sensitive data – personal photos, financial records or patient data.

The challenge

Until very recently, the way to solve the problem at scale would be to pool the data together in a central data store creating a massive dataset. Then develop machine learning models.

By the modern standard, this centralised solution is less than ideal or even outright impossible. Challenges include

  1. Regulatory boundaries – different laws in different jurisdictions impose limitations on data residency, sovereignty and data localisation; sharing personally identifiable information (PII-data) may not be possible
  2. Risk of leak or accidental disclosure – having a central data store is a “privacy nightmare” with risks ranging from unsolicited access to unintended data leaks (for example, through model memorisation)
  3. Lack of trust – hacking incidents, data breaches and data leak scandals have eroded public trust
  4. Practical limitations – such as limited bandwidth or too big data

In general, more data means better models, but these challenges can limit or completely block data gathering.

The solution

Federated learning is an approach that makes it possible to train machine learning models on distributed data. Each node trains on local private training dataset and contributes to the generation of global model by sending the non-sensitive updates to the central server. The data owner retains control over the data and never moves it.

Figure 1. A schematic example of federated learning on mobile phones.
Figure 1. A schematic example of federated learning on mobile phones.

Main advantage of using federated approach is that it ensures privacy by design. Since no data leaves the local store or is otherwise exchanged this eliminates the single point of failure in terms of data breach.

Using federated learning and privacy-enhancing technologies also leads to simplified control and compliance.  Using secure aggregation, differential privacy and encryption provides theoretical/mathematically provable privacy guarantee. As a consequence, resulting analytics are considered anonymised by the legal standards like GDPR and HIIPA.

The catch

Even though federated learning avoids data communication, it relies on frequent communication between nodes during the learning process. Bandwidth and latency limitations led to creation of specialised algorithms, such as Federated Stochastic Gradient Descent or its generalisation Federated Averaging algorithm. The key idea is to use the processing power in each decentralised node – mobile or IoT device or a local machine – to compute high quality updates of the model weights. It is often complemented with compression of the model updates which further reduces the communication costs.

There are several statistical challenges arising from the federated approach. They include

  • Differences in local dataset’s distributions
  • Lack of global data makes it harder to detect certain unwanted biases
  • Loss of model updates due to node failures affecting the global model

The end

Even though the field of federated learning is new, privacy enhancing technologies are rapidly evolving. Among the best products, are

  • TensorFlow Federated Learning – federated learning on decentralised data
  • OpenMined– homomorphic encryption, differential privacy, and federated learning
  • Data Fleets – privacy-preserving engine for rapid access, agile analytics, and automated compliance
  • Scaleout ­– full-stack data science and federated learning

Finally, let me quote Bart Willemsen, Vice President Analyst at Gartner:

… by 2023, 65% of the world’s population will have its personal information covered under modern privacy regulations, up from 10% today

… by 2023, more than 80% of companies worldwide will be facing at least one privacy-focused data protection regulation

In light of that, the promise of federated learning looks even more appealing.

Thanks for reading!

Author


Dmitry Semashkov

Senior data scientist in the Insights & Data practice