GitHub - QuickerStudio/Advanced-Distributed-Training-System

# AdvancedDistributedTrainingSystem

## Overview
**AdvancedDistributedTrainingSystem** is a high-performance distributed framework for training large-scale deep learning models. It employs advanced techniques like data/model parallelism, storage optimization with DeepSpeed's ZeRO, and communication enhancements.

## Features
- **Data Parallelism**: Utilizes PyTorch's DistributedDataParallel (DDP) for efficient data parallel training.
- **Model Parallelism**: Supports tensor parallelism and pipeline parallelism for handling larger models and speeding up training.
- **Mixed Precision Training**: Uses PyTorch's automatic mixed precision (AMP) to reduce memory usage and increase computational efficiency.
- **Storage Optimization**: Implements DeepSpeed's ZeRO (Zero Redundancy Optimizer) to minimize memory footprint and support large model training.
- **Communication Optimization**: Includes gradient compression and communication scheduling to reduce communication overhead.
- **Secure Communication**: Uses gRPC with SSL/TLS for secure and encrypted communication between distributed nodes.

## Project Structure
```plaintext
AdvancedDistributedTrainingSystem/
│
├── algorithm_library.py
│
├── data_collector.py
│
├── ai_resource_manager.py
│
├── grpc_communication.py
│
├── storage_manager.py
│
├── train.py
│
├── distributed_setup.py
│
├── model.py
│
├── optimizer.py
│
├── data_loader.py
│
├── communication.py
│
├── main.py
│
├── storage.py
│
├── protocol_buffer.proto
│
├── tests/
│   ├── test_distributed_computing.py
│   ├── test_search_engine_algorithm.py
│   ├── test_crawler_technology.py
│   ├── test_index_management.py
│   ├── test_nlp.py
│   ├── test_machine_learning.py
│   ├── test_caching_and_storage.py
│   ├── test_load_balancing.py
│   ├── test_user_behavior_analysis.py
│   ├── test_ai_resource_manager.py
│   ├── test_grpc_communication.py
│   ├── test_storage_manager.py
│   ├── test_train.py
│   ├── test_distributed_setup.py
│   ├── test_model.py
│   ├── test_optimizer.py
│   ├── test_data_loader.py
│   ├── test_communication.py
│
├── requirements.txt
└── README.md

distributed_setup.py: Sets up and cleans up the distributed training environment.
model.py: Defines the model architecture, including basic fully connected layers and pipeline parallel model.
optimizer.py: Initializes the optimizer with DeepSpeed's ZeRO for storage optimization.
data_loader.py: Creates distributed data loader for training data.
communication.py: Defines communication optimization methods, such as gradient compression.
grpc_communication.py: Implements secure communication using gRPC with SSL/TLS.
distributed.proto: Defines the gRPC service and message structures.
train.py: Main training script integrating model, data loading, optimizer, and communication optimization.
main.py: Entry point script that launches training using multiple processes.
build_exe.py: Script to build the project into an EXE file using PyInstaller.

Getting Started

Prerequisites

PyTorch
DeepSpeed
CUDA
Python 3.x
gRPC and cryptography libraries

Installation

Clone the repository:

git clone https://github.com/your_username/AdvancedDistributedTrainingSystem.git
cd AdvancedDistributedTrainingSystem

Install the required packages:

pip install torch deepspeed grpcio grpcio-tools cryptography

Running the Training

Set up environment variables for distributed training:

export MASTER_ADDR=localhost
export MASTER_PORT=12355

Generate gRPC code from the .proto file:

python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. distributed.proto

Run the main script:
```
python main.py
```

Usage Instructions

Environment Setup

Install Dependencies:

pip install requests kafka-python torch PyInstaller

Configure Kafka Cluster: Please refer to the Kafka Official Documentation for setting up the Kafka cluster.

Running the Project

Start Kafka Services:

bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties

Run Data Collection and Training:
```
python main.py
```
Package the Project into an EXE File:
```
python build_exe.py
```

Building the EXE File

Install PyInstaller:
```
pip install pyinstaller
```
Run the build script:
```
python build_exe.py
```

This will create a single EXE file for the project, which can be distributed and run on other machines.

Example Usage

DataCollector: The DataCollector class is responsible for distributed data collection using Kafka and APIs.

import threading
import requests
from kafka import KafkaProducer, KafkaConsumer
from kafka.errors import KafkaError
from typing import List, Dict
import json

class DataCollector:
    def __init__(self, api_key: str, kafka_servers: List[str]):
        self.api_key = api_key
        self.base_url = "https://api.example.com/data"
        self.producer = KafkaProducer(bootstrap_servers=kafka_servers, value_serializer=lambda v: json.dumps(v).encode('utf-8'))
        self.consumer = KafkaConsumer('data-topic', bootstrap_servers=kafka_servers, value_deserializer=lambda m: json.loads(m.decode('utf-8')))
        self.data_lock = threading.Lock()

    def collect_data_from_api(self, data_type: str) -> List[Dict]:
        headers = {"Authorization": f"Bearer {self.api_key}"}
        params = {"type": data_type}
        response = requests.get(self.base_url, headers=headers, params=params)
        if response.status_code == 200:
            return response.json()
        else:
            response.raise_for_status()

    def produce_data(self, data_type: str):
        data = self.collect_data_from_api(data_type)
        for item in data:
            self.producer.send('data-topic', item)

    def collect_data_threaded(self, data_types: List[str]):
        threads = []
        for data_type in data_types:
            thread = threading.Thread(target=self.produce_data, args=(data_type,))
            threads.append(thread)
            thread.start()

        for thread in threads:
            thread.join()

    def classify_data(self) -> Dict[str, List[Dict]]:
        classified_data = {"text": [], "image": [], "audio": [], "video": [], "code": []}
        for message in self.consumer:
            item = message.value
            if item["type"] == "text":
                classified_data["text"].append(item)
            elif item["type"] == "image":
                classified_data["image"].append(item)
            elif item["type"] == "audio":
                classified_data["audio"].append(item)
            elif item["type"] == "video":
                classified_data["video"].append(item)
            elif item["type"] == "code":
                classified_data["code"].append(item)
        return classified_data

    def store_data(self, classified_data: Dict):
        # Add your storage logic here (e.g., Ceph or DAOS)
        pass

# Example usage
if __name__ == "__main__":
    api_key = "your_api_key_here"
    kafka_servers = ["localhost:9092"]
    data_types = ["text", "image", "audio", "video", "code"]
    collector = DataCollector(api_key, kafka_servers)
    collector.collect_data_threaded(data_types)
    classified_data = collector.classify_data()
    collector.store_data(classified_data)

Distributed Training: The main.py script initializes the distributed training process and starts the gRPC server and client.

import torch.multiprocessing as mp
from train import train
import grpc_communication
from data_collector import DataCollector

if __name__ == '__main__':
    api_key = "your_api_key_here"
    kafka_servers = ["localhost:9092"]
    data_types = ["text", "image", "audio", "video", "code"]
    collector = DataCollector(api_key, kafka_servers)
    collector.collect_data_threaded(data_types)
    classified_data = collector.classify_data()
    collector.store_data(classified_data)

    world_size = 4
    dataset = torch.utils.data.TensorDataset(torch.randn(1000, 128), torch.randn(1000, 128))
    mp.spawn(train, args=(world_size, dataset), nprocs=world_size, join=True)
    
    # Start the gRPC server
    grpc_communication.serve()
    
    # Run the gRPC client
    grpc_communication.run_client()

Training

Setup: The distributed_setup.py script is responsible for setting up and cleaning up the distributed training environment.

import torch
import torch.distributed as dist

def setup(rank, world_size):
    dist.init_process_group(backend='nccl', init_method='env://', rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)
    print(f"Setup completed for rank {rank} in world size {world_size}.")

def cleanup():
    dist.destroy_process_group()
    print("Cleanup completed.")

Model: The model.py script defines the model architecture.

import torch
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.layer1 = nn.Linear(128, 256)
        self.layer2 = nn.Linear(256, 512)
        self.layer3 = nn.Linear(512, 256)
        self.layer4 = nn.Linear(256, 128)

    def forward(self, x):
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        return x

class MyPipeModel(nn.Module):
    def __init__(self):
        super(MyPipeModel, self).__init__()
        self.layer1 = nn.Linear(128, 256)
        self.layer2 = nn.Linear(256, 512)
        self.layer3 = nn.Linear(512, 256)
        self.layer4 = nn.Linear(256, 128)

    def forward(self, x):
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        return x

Optimizer: The optimizer.py script initializes the optimizer and uses DeepSpeed's ZeRO for memory optimization.

import deepspeed
import torch.optim as optim

def get_optimizer(model):
    optimizer = deepspeed.zero.Init(
        model.parameters(),
        optimizer=optim.Adam(model.parameters(), lr=0.001)
    )
    return optimizer

DataLoader: The data_loader.py script creates a distributed data loader.

import torch

def get_dataloader(dataset, batch_size, num_workers, rank, world_size):
    sampler = torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    dataloader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=batch_size,
        shuffle=False,
        num_workers=num_workers,
        sampler=sampler
    )
    return dataloader

Communication: The communication.py script defines communication optimization methods, such as gradient compression.

import torch.distributed as dist

def compressed_allreduce(tensor, world_size):
    tensor_to_send = tensor / world_size
    dist.all_reduce(tensor_to_send, op=dist.ReduceOp.SUM)
    return tensor_to_send

gRPC Communication: 接下来继续。

gRPC Communication: The grpc_communication.py script defines the gRPC server and client for distributed communication.

import grpc
from concurrent import futures
import distributed_pb2
import distributed_pb2_grpc

class DataService(distributed_pb2_grpc.DataServiceServicer):
    def SendData(self, request, context):
        response = distributed_pb2.DataResponse()
        response.message = "Data received successfully"
        return response

def serve():
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    distributed_pb2_grpc.add_DataServiceServicer_to_server(DataService(), server)
    server.add_insecure_port('[::]:50051')
    server.start()
    print("Server started at port 50051")
    server.wait_for_termination()

def run_client():
    with grpc.insecure_channel('localhost:50051') as channel:
        stub = distributed_pb2_grpc.DataServiceStub(channel)
        response = stub.SendData(distributed_pb2.DataRequest(data="Sample data"))
        print("Client received: " + response.message)

Training: The train.py script is responsible for training the model in a distributed manner.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
from distributed_setup import setup, cleanup
from model import MyModel
from optimizer import get_optimizer
from data_loader import get_dataloader
from communication import compressed_allreduce

def train(rank, world_size, dataset):
    setup(rank, world_size)

    model = MyModel().to(rank)
    model = DDP(model, device_ids=[rank])
    optimizer = get_optimizer(model)
    criterion = nn.MSELoss()
    dataloader = get_dataloader(dataset, batch_size=32, num_workers=4, rank=rank, world_size=world_size)

    for epoch in range(10):
        for inputs, targets in dataloader:
            inputs, targets = inputs.to(rank), targets.to(rank)
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
            compressed_allreduce(loss, world_size)

        print(f"Rank {rank}, Epoch {epoch}, Loss: {loss.item()}")

    cleanup()

Contribution Guide

We welcome contributions through issues and pull requests. For detailed information, please refer to the Contribution Guide.

Contact Information

If you have any questions or suggestions, please contact us through GitHub Issues: https://github.com/QuickerStudio/AdvancedDistributedTrainingSystem/issues


## Contribution Guide
We welcome contributions through issues and pull requests. For detailed information, please refer to the [Contribution Guide](CONTRIBUTING.md).

## Contributing
Contributions are welcome! Please feel free to submit a Pull Request or open an Issue.

## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

## Acknowledgements
Special thanks to the PyTorch and DeepSpeed teams for their excellent libraries and tools.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Getting Started

Prerequisites

Installation

Running the Training

Usage Instructions

Environment Setup

Running the Project

Building the EXE File

Example Usage

Training

Contribution Guide

Contact Information

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github/workflows		.github/workflows
Test		Test
README.md		README.md
ai_resource_manager.py		ai_resource_manager.py
algorithm_library.py		algorithm_library.py
build_exe.py		build_exe.py
communication.py		communication.py
data_collector.py		data_collector.py
data_loader.py		data_loader.py
distributed.proto		distributed.proto
distributed_setup.py		distributed_setup.py
grpc_communication.py		grpc_communication.py
main.py		main.py
model.py		model.py
optimizer.py		optimizer.py
protocol_buffer.proto		protocol_buffer.proto
storage.py		storage.py
train.py		train.py

QuickerStudio/Advanced-Distributed-Training-System

Folders and files

Latest commit

History

Repository files navigation

Getting Started

Prerequisites

Installation

Running the Training

Usage Instructions

Environment Setup

Running the Project

Building the EXE File

Example Usage

Training

Contribution Guide

Contact Information

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages