Skip to content

QuickerStudio/Advanced-Distributed-Training-System

Repository files navigation

# AdvancedDistributedTrainingSystem

## Overview
**AdvancedDistributedTrainingSystem** is a high-performance distributed framework for training large-scale deep learning models. It employs advanced techniques like data/model parallelism, storage optimization with DeepSpeed's ZeRO, and communication enhancements.

## Features
- **Data Parallelism**: Utilizes PyTorch's DistributedDataParallel (DDP) for efficient data parallel training.
- **Model Parallelism**: Supports tensor parallelism and pipeline parallelism for handling larger models and speeding up training.
- **Mixed Precision Training**: Uses PyTorch's automatic mixed precision (AMP) to reduce memory usage and increase computational efficiency.
- **Storage Optimization**: Implements DeepSpeed's ZeRO (Zero Redundancy Optimizer) to minimize memory footprint and support large model training.
- **Communication Optimization**: Includes gradient compression and communication scheduling to reduce communication overhead.
- **Secure Communication**: Uses gRPC with SSL/TLS for secure and encrypted communication between distributed nodes.

## Project Structure
```plaintext
AdvancedDistributedTrainingSystem/

├── algorithm_library.py

├── data_collector.py

├── ai_resource_manager.py

├── grpc_communication.py

├── storage_manager.py

├── train.py

├── distributed_setup.py

├── model.py

├── optimizer.py

├── data_loader.py

├── communication.py

├── main.py

├── storage.py

├── protocol_buffer.proto

├── tests/
│   ├── test_distributed_computing.py
│   ├── test_search_engine_algorithm.py
│   ├── test_crawler_technology.py
│   ├── test_index_management.py
│   ├── test_nlp.py
│   ├── test_machine_learning.py
│   ├── test_caching_and_storage.py
│   ├── test_load_balancing.py
│   ├── test_user_behavior_analysis.py
│   ├── test_ai_resource_manager.py
│   ├── test_grpc_communication.py
│   ├── test_storage_manager.py
│   ├── test_train.py
│   ├── test_distributed_setup.py
│   ├── test_model.py
│   ├── test_optimizer.py
│   ├── test_data_loader.py
│   ├── test_communication.py

├── requirements.txt
└── README.md
  • distributed_setup.py: Sets up and cleans up the distributed training environment.
  • model.py: Defines the model architecture, including basic fully connected layers and pipeline parallel model.
  • optimizer.py: Initializes the optimizer with DeepSpeed's ZeRO for storage optimization.
  • data_loader.py: Creates distributed data loader for training data.
  • communication.py: Defines communication optimization methods, such as gradient compression.
  • grpc_communication.py: Implements secure communication using gRPC with SSL/TLS.
  • distributed.proto: Defines the gRPC service and message structures.
  • train.py: Main training script integrating model, data loading, optimizer, and communication optimization.
  • main.py: Entry point script that launches training using multiple processes.
  • build_exe.py: Script to build the project into an EXE file using PyInstaller.

Getting Started

Prerequisites

  • PyTorch
  • DeepSpeed
  • CUDA
  • Python 3.x
  • gRPC and cryptography libraries

Installation

  1. Clone the repository:

    git clone https://github.com/your_username/AdvancedDistributedTrainingSystem.git
    cd AdvancedDistributedTrainingSystem
  2. Install the required packages:

    pip install torch deepspeed grpcio grpcio-tools cryptography

Running the Training

  1. Set up environment variables for distributed training:

    export MASTER_ADDR=localhost
    export MASTER_PORT=12355
  2. Generate gRPC code from the .proto file:

    python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. distributed.proto
  3. Run the main script:

    python main.py

Usage Instructions

Environment Setup

  1. Install Dependencies:

    pip install requests kafka-python torch PyInstaller
  2. Configure Kafka Cluster: Please refer to the Kafka Official Documentation for setting up the Kafka cluster.

Running the Project

  1. Start Kafka Services:

    bin/zookeeper-server-start.sh config/zookeeper.properties
    bin/kafka-server-start.sh config/server.properties
  2. Run Data Collection and Training:

    python main.py
  3. Package the Project into an EXE File:

    python build_exe.py

Building the EXE File

  1. Install PyInstaller:

    pip install pyinstaller
  2. Run the build script:

    python build_exe.py

This will create a single EXE file for the project, which can be distributed and run on other machines.

Example Usage

  1. DataCollector: The DataCollector class is responsible for distributed data collection using Kafka and APIs.

    import threading
    import requests
    from kafka import KafkaProducer, KafkaConsumer
    from kafka.errors import KafkaError
    from typing import List, Dict
    import json
    
    class DataCollector:
        def __init__(self, api_key: str, kafka_servers: List[str]):
            self.api_key = api_key
            self.base_url = "https://api.example.com/data"
            self.producer = KafkaProducer(bootstrap_servers=kafka_servers, value_serializer=lambda v: json.dumps(v).encode('utf-8'))
            self.consumer = KafkaConsumer('data-topic', bootstrap_servers=kafka_servers, value_deserializer=lambda m: json.loads(m.decode('utf-8')))
            self.data_lock = threading.Lock()
    
        def collect_data_from_api(self, data_type: str) -> List[Dict]:
            headers = {"Authorization": f"Bearer {self.api_key}"}
            params = {"type": data_type}
            response = requests.get(self.base_url, headers=headers, params=params)
            if response.status_code == 200:
                return response.json()
            else:
                response.raise_for_status()
    
        def produce_data(self, data_type: str):
            data = self.collect_data_from_api(data_type)
            for item in data:
                self.producer.send('data-topic', item)
    
        def collect_data_threaded(self, data_types: List[str]):
            threads = []
            for data_type in data_types:
                thread = threading.Thread(target=self.produce_data, args=(data_type,))
                threads.append(thread)
                thread.start()
    
            for thread in threads:
                thread.join()
    
        def classify_data(self) -> Dict[str, List[Dict]]:
            classified_data = {"text": [], "image": [], "audio": [], "video": [], "code": []}
            for message in self.consumer:
                item = message.value
                if item["type"] == "text":
                    classified_data["text"].append(item)
                elif item["type"] == "image":
                    classified_data["image"].append(item)
                elif item["type"] == "audio":
                    classified_data["audio"].append(item)
                elif item["type"] == "video":
                    classified_data["video"].append(item)
                elif item["type"] == "code":
                    classified_data["code"].append(item)
            return classified_data
    
        def store_data(self, classified_data: Dict):
            # Add your storage logic here (e.g., Ceph or DAOS)
            pass
    
    # Example usage
    if __name__ == "__main__":
        api_key = "your_api_key_here"
        kafka_servers = ["localhost:9092"]
        data_types = ["text", "image", "audio", "video", "code"]
        collector = DataCollector(api_key, kafka_servers)
        collector.collect_data_threaded(data_types)
        classified_data = collector.classify_data()
        collector.store_data(classified_data)
  2. Distributed Training: The main.py script initializes the distributed training process and starts the gRPC server and client.

    import torch.multiprocessing as mp
    from train import train
    import grpc_communication
    from data_collector import DataCollector
    
    if __name__ == '__main__':
        api_key = "your_api_key_here"
        kafka_servers = ["localhost:9092"]
        data_types = ["text", "image", "audio", "video", "code"]
        collector = DataCollector(api_key, kafka_servers)
        collector.collect_data_threaded(data_types)
        classified_data = collector.classify_data()
        collector.store_data(classified_data)
    
        world_size = 4
        dataset = torch.utils.data.TensorDataset(torch.randn(1000, 128), torch.randn(1000, 128))
        mp.spawn(train, args=(world_size, dataset), nprocs=world_size, join=True)
        
        # Start the gRPC server
        grpc_communication.serve()
        
        # Run the gRPC client
        grpc_communication.run_client()

Training

  1. Setup: The distributed_setup.py script is responsible for setting up and cleaning up the distributed training environment.

    import torch
    import torch.distributed as dist
    
    def setup(rank, world_size):
        dist.init_process_group(backend='nccl', init_method='env://', rank=rank, world_size=world_size)
        torch.cuda.set_device(rank)
        print(f"Setup completed for rank {rank} in world size {world_size}.")
    
    def cleanup():
        dist.destroy_process_group()
        print("Cleanup completed.")
  2. Model: The model.py script defines the model architecture.

    import torch
    import torch.nn as nn
    
    class MyModel(nn.Module):
        def __init__(self):
            super(MyModel, self).__init__()
            self.layer1 = nn.Linear(128, 256)
            self.layer2 = nn.Linear(256, 512)
            self.layer3 = nn.Linear(512, 256)
            self.layer4 = nn.Linear(256, 128)
    
        def forward(self, x):
            x = self.layer1(x)
            x = self.layer2(x)
            x = self.layer3(x)
            x = self.layer4(x)
            return x
    
    class MyPipeModel(nn.Module):
        def __init__(self):
            super(MyPipeModel, self).__init__()
            self.layer1 = nn.Linear(128, 256)
            self.layer2 = nn.Linear(256, 512)
            self.layer3 = nn.Linear(512, 256)
            self.layer4 = nn.Linear(256, 128)
    
        def forward(self, x):
            x = self.layer1(x)
            x = self.layer2(x)
            x = self.layer3(x)
            x = self.layer4(x)
            return x
  3. Optimizer: The optimizer.py script initializes the optimizer and uses DeepSpeed's ZeRO for memory optimization.

    import deepspeed
    import torch.optim as optim
    
    def get_optimizer(model):
        optimizer = deepspeed.zero.Init(
            model.parameters(),
            optimizer=optim.Adam(model.parameters(), lr=0.001)
        )
        return optimizer
  4. DataLoader: The data_loader.py script creates a distributed data loader.

    import torch
    
    def get_dataloader(dataset, batch_size, num_workers, rank, world_size):
        sampler = torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=world_size, rank=rank)
        dataloader = torch.utils.data.DataLoader(
            dataset=dataset,
            batch_size=batch_size,
            shuffle=False,
            num_workers=num_workers,
            sampler=sampler
        )
        return dataloader
  5. Communication: The communication.py script defines communication optimization methods, such as gradient compression.

    import torch.distributed as dist
    
    def compressed_allreduce(tensor, world_size):
        tensor_to_send = tensor / world_size
        dist.all_reduce(tensor_to_send, op=dist.ReduceOp.SUM)
        return tensor_to_send
  6. gRPC Communication: 接下来继续。

  7. gRPC Communication: The grpc_communication.py script defines the gRPC server and client for distributed communication.

    import grpc
    from concurrent import futures
    import distributed_pb2
    import distributed_pb2_grpc
    
    class DataService(distributed_pb2_grpc.DataServiceServicer):
        def SendData(self, request, context):
            response = distributed_pb2.DataResponse()
            response.message = "Data received successfully"
            return response
    
    def serve():
        server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
        distributed_pb2_grpc.add_DataServiceServicer_to_server(DataService(), server)
        server.add_insecure_port('[::]:50051')
        server.start()
        print("Server started at port 50051")
        server.wait_for_termination()
    
    def run_client():
        with grpc.insecure_channel('localhost:50051') as channel:
            stub = distributed_pb2_grpc.DataServiceStub(channel)
            response = stub.SendData(distributed_pb2.DataRequest(data="Sample data"))
            print("Client received: " + response.message)
  8. Training: The train.py script is responsible for training the model in a distributed manner.

    import torch
    import torch.nn as nn
    import torch.optim as optim
    from torch.nn.parallel import DistributedDataParallel as DDP
    from distributed_setup import setup, cleanup
    from model import MyModel
    from optimizer import get_optimizer
    from data_loader import get_dataloader
    from communication import compressed_allreduce
    
    def train(rank, world_size, dataset):
        setup(rank, world_size)
    
        model = MyModel().to(rank)
        model = DDP(model, device_ids=[rank])
        optimizer = get_optimizer(model)
        criterion = nn.MSELoss()
        dataloader = get_dataloader(dataset, batch_size=32, num_workers=4, rank=rank, world_size=world_size)
    
        for epoch in range(10):
            for inputs, targets in dataloader:
                inputs, targets = inputs.to(rank), targets.to(rank)
                optimizer.zero_grad()
                outputs = model(inputs)
                loss = criterion(outputs, targets)
                loss.backward()
                optimizer.step()
                compressed_allreduce(loss, world_size)
    
            print(f"Rank {rank}, Epoch {epoch}, Loss: {loss.item()}")
    
        cleanup()

Contribution Guide

We welcome contributions through issues and pull requests. For detailed information, please refer to the Contribution Guide.

Contact Information

If you have any questions or suggestions, please contact us through GitHub Issues: https://github.com/QuickerStudio/AdvancedDistributedTrainingSystem/issues


## Contribution Guide
We welcome contributions through issues and pull requests. For detailed information, please refer to the [Contribution Guide](CONTRIBUTING.md).

## Contributing
Contributions are welcome! Please feel free to submit a Pull Request or open an Issue.

## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

## Acknowledgements
Special thanks to the PyTorch and DeepSpeed teams for their excellent libraries and tools.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published