FedML Octopus Example with MNIST + Logistic Regression

This example illustrates how to do real-world cross-silo federated learning with FedML Octopus. The source code locates at https://github.com/FedML-AI/FedML/tree/master/python/examples/cross_silo/mqtt_s3_fedavg_mnist_lr_example在新窗口打开.

One line API

python/examples/cross_silo/mqtt_s3_fedavg_mnist_lr_example/one_line

The highly encapsulated server and client API calls are shown as below:

run_server.sh is as follows:

#!/usr/bin/env bash

python3 server/torch_server.py --cf config/fedml_config.yaml --rank 0

server/torch_server.py

import fedml


if __name__ == "__main__":
    fedml.run_cross_silo_server()

run_client.sh

#!/usr/bin/env bash
RANK=$1
python3 torch_client.py --cf config/fedml_config.yaml --rank $RANK

client/torch_client.py

import fedml

if __name__ == "__main__":
    fedml.run_cross_silo_client()

At the client side, the client ID (a.k.a rank) starts from 1. Please also modify config/fedml_config.yaml, changing the worker_num the as the number of clients you plan to run.

At the server side, run the following script:

bash run_server.sh

For client 1, run the following script:

bash run_client.sh 1

For client 2, run the following script:

bash run_client.sh 2

config/fedml_config.yaml is shown below.

Here common_args.training_type is "cross_silo" type, and train_args.client_id_list needs to correspond to the client id in the client command line.

common_args:
  training_type: "cross_silo"
  random_seed: 0

data_args:
  dataset: "mnist"
  data_cache_dir: "./../../../data"
  partition_method: "hetero"
  partition_alpha: 0.5

model_args:
  model: "lr"
  model_file_cache_folder: "./model_file_cache" # will be filled by the server automatically
  global_model_file_path: "./model_file_cache/global_model.pt"

train_args:
  federated_optimizer: "FedAvg"
  client_id_list: 
  client_num_in_total: 1000
  client_num_per_round: 2
  comm_round: 50
  epochs: 1
  batch_size: 10
  client_optimizer: sgd
  learning_rate: 0.03
  weight_decay: 0.001

validation_args:
  frequency_of_the_test: 5

device_args:
  worker_num: 2
  using_gpu: false
  gpu_mapping_file: config/gpu_mapping.yaml
  gpu_mapping_key: mapping_default

comm_args:
  backend: "MQTT_S3"
  mqtt_config_path: config/mqtt_config.yaml
  s3_config_path: config/s3_config.yaml
  # If you want to use your customized MQTT or s3 server as training backends, you should uncomment and set the following lines.
  #customized_training_mqtt_config: {'BROKER_HOST': 'your mqtt server address or domain name', 'MQTT_PWD': 'your mqtt password', 'BROKER_PORT': 1883, 'MQTT_KEEPALIVE': 180, 'MQTT_USER': 'your mqtt user'}
  #customized_training_s3_config: {'CN_S3_SAK': 'your s3 aws_secret_access_key', 'CN_REGION_NAME': 'your s3 region name', 'CN_S3_AKI': 'your s3 aws_access_key_id', 'BUCKET_NAME': 'your s3 bucket name'}

tracking_args:
  log_file_dir: ./log
  enable_wandb: false
  wandb_key: ee0b5f53d949c84cee7decbe7a629e63fb2f8408
  wandb_project: fedml
  wandb_name: fedml_torch_fedavg_mnist_lr

Training Results

At the end of the 50th training round, the server window will see the following output:

FedML-Server(0) @device-id-0 - Wed, 27 Apr 2022 03:38:28 fedml_aggregator.py[line:199] INFO ################test_on_server_for_all_clients : 49
FedML-Server(0) @device-id-0 - Wed, 27 Apr 2022 03:38:38 fedml_aggregator.py[line:225] INFO {'training_acc': 0.7155714841722886, 'training_loss': 1.8997359397010631}
FedML-Server(0) @device-id-0 - Wed, 27 Apr 2022 03:38:38 mlops_metrics.py[line:67] INFO report_server_training_metric. message_json = {'run_id': '0', 'round_idx': 49, 'timestamp': 1651030718.803107, 'accuracy': 0.7156, 'loss': 1.8997}
FedML-Server(0) @device-id-0 - Wed, 27 Apr 2022 03:38:40 fedml_aggregator.py[line:262] INFO {'test_acc': 0.717948717948718, 'test_loss': 1.8972983557921448}
FedML-Server(0) @device-id-0 - Wed, 27 Apr 2022 03:38:40 mlops_metrics.py[line:74] INFO report_server_training_round_info. message_json = {'run_id': '0', 'round_index': 49, 'total_rounds': 50, 'running_time': 930.5741}
...

At the end of the 50th training round, the client1 window will see the following output:

FedML-Client(1) @device-id-1 - Wed, 27 Apr 2022 03:38:45 fedml_client_manager.py[line:145] INFO #######training########### round_id = 50
[2022-04-27 03:38:45,591] [INFO] [my_model_trainer_classification.py:56:train] Update Epoch: 0 [10/20 (50%)]    Loss: 1.984373
[2022-04-27 03:38:45,602] [INFO] [my_model_trainer_classification.py:56:train] Update Epoch: 0 [20/20 (100%)]   Loss: 2.053363
[2022-04-27 03:38:45,602] [INFO] [my_model_trainer_classification.py:63:train] Client Index = 1 Epoch: 0        Loss: 2.018868
FedML-Client(1) @device-id-1 - Wed, 27 Apr 2022 03:38:45 client_manager.py[line:107] INFO Sending message (type 3) to server
FedML-Client(1) @device-id-1 - Wed, 27 Apr 2022 03:38:45 mqtt_s3_multi_clients_comm_manager.py[line:240] INFO mqtt_s3.send_message: starting...
FedML-Client(1) @device-id-1 - Wed, 27 Apr 2022 03:38:45 mqtt_s3_multi_clients_comm_manager.py[line:296] INFO mqtt_s3.send_message: S3+MQTT msg sent, message_key = fedml_0_1_06180ace-1d4b-445c-b3d7-4ae765659471
FedML-Client(1) @device-id-1 - Wed, 27 Apr 2022 03:38:45 mqtt_s3_multi_clients_comm_manager.py[line:306] INFO mqtt_s3.send_message: to python client.
FedML-Client(1) @device-id-1 - Wed, 27 Apr 2022 03:38:47 mlops_metrics.py[line:81] INFO report_client_model_info. message_json = {'run_id': '0', 'edge_id': 1, 'round_idx': 51, 'client_model_s3_address': '...'}
...

At the end of the 50th training round, the client2 window will see the following output:

FedML-Client(2) @device-id-1 - Wed, 27 Apr 2022 03:38:58 fedml_client_manager.py[line:145] INFO #######training########### round_id = 50
[2022-04-27 03:38:58,128] [INFO] [my_model_trainer_classification.py:56:train] Update Epoch: 0 [10/20 (50%)]    Loss: 1.984373
[2022-04-27 03:38:58,137] [INFO] [my_model_trainer_classification.py:56:train] Update Epoch: 0 [20/20 (100%)]   Loss: 2.053363
[2022-04-27 03:38:58,137] [INFO] [my_model_trainer_classification.py:63:train] Client Index = 2 Epoch: 0        Loss: 2.018868
FedML-Client(2) @device-id-1 - Wed, 27 Apr 2022 03:38:58 client_manager.py[line:107] INFO Sending message (type 3) to server
FedML-Client(2) @device-id-1 - Wed, 27 Apr 2022 03:38:58 mqtt_s3_multi_clients_comm_manager.py[line:240] INFO mqtt_s3.send_message: starting...
FedML-Client(2) @device-id-1 - Wed, 27 Apr 2022 03:38:58 mqtt_s3_multi_clients_comm_manager.py[line:296] INFO mqtt_s3.send_message: S3+MQTT msg sent, message_key = fedml_0_2_a33a50ad-c289-41b8-a925-9205eea272f2
FedML-Client(2) @device-id-1 - Wed, 27 Apr 2022 03:38:58 mqtt_s3_multi_clients_comm_manager.py[line:306] INFO mqtt_s3.send_message: to python client.
FedML-Client(2) @device-id-1 - Wed, 27 Apr 2022 03:38:58 mlops_metrics.py[line:81] INFO report_client_model_info. message_json = {'run_id': '0', 'edge_id': 2, 'round_idx': 51, 'client_model_s3_address': '...'}
...

Five lines of APIs

torch_client.py

import fedml
from fedml.cross_silo import Client

if __name__ == "__main__":
    args = fedml.init()

    # init device
    device = fedml.device.get_device(args)

    # load data
    dataset, output_dim = fedml.data.load(args)

    # load model
    model = fedml.model.create(args, output_dim)

    # start training
    client = Client(args, device, dataset, model)
    client.run()

Custom data and model

The custom data and model example locates at the following folder:

python/examples/cross_silo/mqtt_s3_fedavg_mnist_lr_example/custum_data_and_model

def load_data(args):
    download_mnist(args.data_cache_dir)
    fedml.logger.info("load_data. dataset_name = %s" % args.dataset)

    """
    Please read through the data loader at to see how to customize the dataset for FedML framework.
    """
    (
        client_num,
        train_data_num,
        test_data_num,
        train_data_global,
        test_data_global,
        train_data_local_num_dict,
        train_data_local_dict,
        test_data_local_dict,
        class_num,
    ) = load_partition_data_mnist(
        args.batch_size,
        train_path=args.data_cache_dir + "MNIST/train",
        test_path=args.data_cache_dir + "MNIST/test",
    )
    """
    For shallow NN or linear models, 
    we uniformly sample a fraction of clients each round (as the original FedAvg paper)
    """
    args.client_num_in_total = client_num
    dataset = [
        train_data_num,
        test_data_num,
        train_data_global,
        test_data_global,
        train_data_local_num_dict,
        train_data_local_dict,
        test_data_local_dict,
        class_num,
    ]
    return dataset, class_num


class LogisticRegression(torch.nn.Module):
    def __init__(self, input_dim, output_dim):
        super(LogisticRegression, self).__init__()
        self.linear = torch.nn.Linear(input_dim, output_dim)

    def forward(self, x):
        outputs = torch.sigmoid(self.linear(x))
        return outputs

if __name__ == "__main__":
    # init FedML framework
    args = fedml.init()

    # init device
    device = fedml.device.get_device(args)

    # load data
    dataset, output_dim = load_data(args)

    # load model (the size of MNIST image is 28 x 28)
    model = LogisticRegression(28 * 28, output_dim)

    # start training
    client = Client(args, device, dataset, model)
    client.run()

A Better User-experience with FedML MLOps (open.fedml.ai)

To reduce the difficulty and complexity of these CLI commands. We recommend you to use our MLOps (open.fedml.ai). FedML MLOps provides:

Install Client Agent and Login
Inviting Collaborators and group management
Project Management
Experiment Tracking (visualizing training results)
monitoring device status
visualizing system performance (including profiling flow chart)
distributed logging
model serving

For more details, please refer to MLOps User Guide.

# FedML Octopus Example with MNIST + Logistic Regression

# One line API

# Training Results

# Five lines of APIs