Skip to content

Latest commit

 

History

History
260 lines (194 loc) · 7.91 KB

File metadata and controls

260 lines (194 loc) · 7.91 KB

NVIDIA FLARE System Architecture

FLARE Architecture Overview FLARE Job Processing Architecture

This document describes the overall system architecture of NVIDIA FLARE, including its layered structure, major subsystems, and how they interact. It covers the runtime components on both server and client sides, the communication framework, and the process model.

The FLARE architecture (shown above) comprises three main layers:

  • Foundation Layer - Communication infrastructure, messaging protocols, privacy preservation tools, and secure platform management.
  • Application Layer - Building blocks for federated learning, including federation workflows and learning algorithms.
  • Tooling - FL Simulator and POC CLI for experimentation and simulation, plus deployment and management tools for production workflows.

Core Components and Code Structure

Primary System Modules

Process Responsibilities

Server Parent (SP)

  • Runs FederatedServer
  • Manages client registration and heartbeat monitoring
  • Houses ServerEngine which orchestrates job scheduling via JobRunner
  • Spawns Server Job (SJ) processes or docker/pod for each active job for different job launcher.

Server Job (SJ)

  • Runs ServerRunner
  • Executes workflow Controllers (e.g., ScatterAndGather)
  • Broadcasts tasks to client jobs and aggregates results
  • Separate process per job for isolation

Client Parent (CP)

  • Runs FederatedClient
  • Manages client registration with server
  • Houses ClientEngine which coordinates job execution
  • Spawns Client Job (CJ) processes or docker/pod for each assigned job for different job launcher.

Client Job (CJ)

  • Runs ClientRunner
  • Pulls tasks from server via Cell network
  • Launches training processes using JobExecutor
  • Routes task data to/from training process via Pipe

Training Process

  • User's ML training script
  • Uses Client API: flare.init(), flare.receive(), flare.send()
  • Communicates with CJ via FilePipe (file-based) or CellPipe (network-based)

Communication Mechanisms

Cell Network: All parent and job processes communicate via F3 Cell objects that provide:

  • FQCN (Fully Qualified Cell Name) addressing (e.g., server.job_123)
  • Channel-based routing (SERVER_MAIN, CLIENT_MAIN, AUX_COMMUNICATION)
  • Secure, encrypted messaging with authentication
  • Streaming support for large data transfers

Pipe Abstraction: CJ-to-training-process communication uses Pipe interface:

  • FilePipe: File system-based IPC for same-machine processes
  • CellPipe: Network-based IPC allowing training process on different machine

Deployment Modes

NVFLARE provides three deployment modes that share the same core runtime but differ in packaging, security, and deployment complexity. This design ensures consistency from development to production.

Deployment Modes Comparison

Deployment Modes Comparison
Mode Use Case Security Processes Setup Time
Simulator Rapid prototyping, algorithm testing None multiple threads, some cases if may create multiple process Seconds
POC Local multi-client testing, workflow validation Optional Multiple processes on one machine Minutes
Production Real-world deployment Full PKI/TLS Distributed processes across machines Hours (with provisioning)

Core FL Runtime

The Core FL Runtime is the execution engine that manages federated learning job processes and orchestration. This page documents the runtime components responsible for process lifecycle management, task coordination, and execution modes.

Scope and Components

The Core FL Runtime consists of:

  • ServerEngine : Server-side process orchestration and job lifecycle management
  • ClientEngine : Client-side process management and communication handling
  • JobRunner : Job scheduling, deployment, and monitoring
  • SimulatorRunner : Single-machine simulation for development

Process Types

Process Types
Process Type Code Symbol Description
SP ProcessType.SERVER_PARENT Server parent process running ServerEngine
SJ ProcessType.SERVER_JOB Server job process running ServerRunner
CP ProcessType.CLIENT_PARENT Client parent process running ClientEngine
CJ ProcessType.CLIENT_JOB Client job process running ClientRunner

Inter-Process Communication

The runtime uses Cell-based communication between parent and job processes.

Cell Communication Channels

Cell Communication Channels
Channel Purpose Used By
CellChannel.SERVER_MAIN Client-to-server FL messages CP to SP
CellChannel.CLIENT_MAIN Server-to-client FL messages SP to CP
CellChannel.SERVER_COMMAND Commands to server job SP to SJ
CellChannel.CLIENT_COMMAND Commands to client job CP to CJ
CellChannel.SERVER_PARENT_LISTENER Parent commands from SJ SJ to SP
CellChannel.AUX_COMMUNICATION Auxiliary messages All processes

JobRunner Architecture

JobRunner Component Structure

FLARE Job Runner Architecture

Communication Framework

Purpose and Scope

The Communication Framework, also known as F3 (FLARE Foundation Framework) and Cellnet, provides the foundational messaging infrastructure for all communication in NVIDIA FLARE. It implements a secure, scalable, and feature-rich messaging layer that handles all interactions between servers, clients, and administrative components.

This section provides an overview of the communication framework architecture, core components, and basic concepts.

  • CellNet Architecture - Detailed architecture and design patterns
  • Cell Communication Patterns - Message sending patterns and channel routing
  • Streaming and Data Transfer - Large data transfer and streaming protocols
  • Security and Encryption - Certificate management and message encryption

for mode details please refer to cellnet architecture :ref:`cellnet_architecture`

Security Architecture

Please refer to :ref:`flare_security_overview` for the security architecture.