Raft Protocol for Fault Tolerance and Self-Recovery in Federated Learning

Dautov, Rustem; Husom, Erik Johannes

dc.contributor.author	Dautov, Rustem
dc.contributor.author	Husom, Erik Johannes
dc.date.accessioned	2024-08-07T13:38:51Z
dc.date.available	2024-08-07T13:38:51Z
dc.date.created	2024-07-08T14:43:06Z
dc.date.issued	2024
dc.identifier.citation	SEAMS '24: Proceedings of the 19th International Symposium on Software Engineering for Adaptive and Self-Managing Systems. 2024, 110-121.	en_US
dc.identifier.isbn	979-8-4007-0585-4
dc.identifier.uri	https://hdl.handle.net/11250/3145149
dc.description.abstract	Federated Learning (FL) has emerged as a decentralised machine learning paradigm for distributed systems, particularly in edge and IoT environments. However, ensuring fault tolerance and self-recovery in such scenarios remains challenging, because of the centralised model aggregation which acts as a single point of failure. A possible solution to this challenge would rely on the continuous replication of the global FL state across participating nodes and the functional suitability of any node to replace the aggregator in case of failures. These functional requirements can be implemented using one of the existing distributed consensus algorithm, such as Raft. Our approach utilises Raft's leader election and log replication mechanisms to enable automatic stateful recovery after failures and thus to improve fault tolerance. The log replication process efficiently maintains consistency and coherence across distributed FL nodes, ensuring uninterrupted training process and model convergence. This enhances the robustness of the overall FL system, especially in dynamic and unreliable cyber-physical conditions. To demonstrate the viability of our approach, we present a proof-of-concept implementation based on the existing FL framework Flower. We conduct a series of experiments to measure the aggregator re-election time and traffic overheads associated with the state replication. Despite the expected traffic overheads growing with the number of FL nodes, the results demonstrate a resilient self-recovering system capable of withstanding node failures while maintaining model consistency.	en_US
dc.language.iso	eng	en_US
dc.publisher	Association for Computing Machinery (ACM)	en_US
dc.relation.ispartof	Proceedings of the 19th International Symposium on Software Engineering for Adaptive and Self-Managing Systems (SEAMS'24)
dc.rights	Navngivelse 4.0 Internasjonal	*
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/deed.no	*
dc.title	Raft Protocol for Fault Tolerance and Self-Recovery in Federated Learning	en_US
dc.title.alternative	Raft Protocol for Fault Tolerance and Self-Recovery in Federated Learning	en_US
dc.type	Chapter	en_US
dc.description.version	publishedVersion	en_US
dc.rights.holder	© 2024 Copyright held by the owner/author(s)	en_US
dc.source.pagenumber	110-121	en_US
dc.identifier.doi	10.1145/3643915.3644093
dc.identifier.cristin	2281630
dc.relation.project	EU – Horisont Europa (EC/HEU): 101120657	en_US
dc.relation.project	Norges forskningsråd: 309700	en_US
dc.relation.project	EU – Horisont Europa (EC/HEU): 101135576	en_US
dc.relation.project	EC/H2020/101020416	en_US
dc.relation.project	EU – Horisont Europa (EC/HEU): 101095634	en_US
cristin.ispublished	true
cristin.fulltext	postprint

Tilhørende fil(er)

Filnavn:: Dautov_2024_Raft_protocol_for_ ...
Størrelse:: 972.8Kb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Publikasjoner fra CRIStin - SINTEF AS [5850]
SINTEF Digital [2523]

Vis enkel innførsel

Med mindre annet er angitt, så er denne innførselen lisensiert som Navngivelse 4.0 Internasjonal