Skip to content

SMAT-Lab/CodeBridge

Repository files navigation

CodeBridge

CodeBridge is a project dedicated to optimizing large language models (LLMs) for low-resource programming languages (LRPLs) like Cangjie. It utilizes CodeBridge, a three-stage transfer learning approach, to enhance code completion accuracy by leveraging knowledge from high-resource programming languages (HRPLs) such as Java and Rust. Additionally, Retrieval-Augmented Generation (RAG) is incorporated during inference to improve performance.

Project Structure

  • .env: Environment variables for API keys.
  • dataset: Contains datasets for training and evaluation.
  • LLaMA-Factory: Framework used for fine-tuning LLMs.
  • src: Core source code directory, containing:
    • metric/: Scripts for evaluating model performance.
    • rag/: Implementations of Retrieval-Augmented Generation (RAG).
    • tree_sitter_cj/: Cangjie code parsing utilities.
    • data_cleaning.py: Scripts for preprocessing datasets.
    • inference.py: Hugging Face transformers-based inference scripts.
  • inference.ipynb: Jupyter notebook for inference using vLLM.
  • train.sh: Shell script for training the model.
  • llm.py: Interface for interacting with the model.
  • requirements.txt: Dependencies for the project.

CodeBridge: Three-Stage Fine-Tuning Process

CangjieLLM adopts CodeBridge, a novel three-stage training strategy that improves code completion for low-resource programming languages (LRPLs) through transfer learning from high-resource languages (HRPLs).

Overview

Training Strategy

  1. Teaching Phase:

    • Dataset: Cangjie corpus (~8M tokens)
    • Epochs: 4
    • Learning Rate: 2e-5
    • Goal: Rapidly expose the model to Cangjie's syntax and semantics.
  2. Practice Phase:

    • Dataset: Java/Rust corpus (~24M tokens)
    • Epochs: 1
    • Learning Rate: 7e-6
    • Goal: Enhance structural and semantic understanding by leveraging high-resource programming languages.
  3. Correction Phase:

    • Dataset: Cangjie corpus (same as step 1)
    • Epochs: 4
    • Learning Rate: 5e-6
    • Goal: Fine-tune the model to rectify transfer-induced biases.

Training Strategy

Dataset Preparation

  • Data Sources:
    • Cangjie dataset from Huawei repositories (Cangjie-SIG, Cangjie-TPC, HW-PLLab).
    • Java/Rust dataset from StarCoder preprocessed corpus.
  • Data Cleaning:
    • File filtering based on size, encoding, character composition, and comment removal.
    • Deduplication using a 90% similarity threshold.
  • Data Splitting:
    • 20 projects used for evaluation (held-out test set).
    • Remaining data used for training.

Inference with RAG and Prefix Matching

For inference, Retrieval-Augmented Generation (RAG) is combined with a prefix-matching strategy to improve code completion accuracy.

Prefix-Matching Decoding Strategy

  • If the input ends with a space → Extract the preceding non-space segment as prefix.
  • If the input ends with a symbol → Use context-based matching to determine the appropriate completion.

This method ensures that the generated output aligns more accurately with user expectations.

Experimental Setup

  • Hardware: 4x A100 GPUs (80GB)
  • Batch Size: 4
  • Training Time:
    • Teaching Phase: 36 hours
    • Practice Phase: 10 hours
    • Correction Phase: 36 hours

Metrics

The metric module evaluates performance using:

  • Exact Match Rate (EM): Measures the percentage of perfect matches.
  • Edit Similarity (ES): Computes edit distance similarity.
  • Line Accuracy: Percentage of correctly generated lines within a block.

Results

1. Effectiveness of CodeBridge (RQ1)

Setting Line-Level Exact Match Rate Line-Level Edit Similarity Function-Level Line Accuracy
Baseline (Untrained Model) 35.49% 0.6699 25.15%
Teaching Only 44.07% 0.7397 31.94%
No Transfer Learning (High LR) 49.44% 0.7645 30.51%
No Transfer Learning (Low LR) 46.53% 0.7568 30.70%
Transfer Learning First 47.43% 0.7563 31.91%
Full Three-Step Strategy 52.35% 0.7692 33.27%

2. Impact of Training Configurations (RQ2)

This section explores how different training settings, such as transfer data volume, final-stage learning rate, and number of epochs, influence both line-level and function-level performance.

Final Stage LR Transfer Data Volume Epochs Line-Level Exact Match Rate Line-Level Edit Similarity Function-Level Line Accuracy
Cosine LR - 8 45.64% 0.7432 31.78%
5e-6 1:3 (24M tokens) 4+1+4 52.35% 0.7692 33.27%
5e-6 1:1 (8M tokens) 4+1+4 40.49% 0.7165 32.00%
5e-6 1:5 (40M tokens) 4+1+4 48.32% 0.7659 32.10%
1e-5 1:3 (24M tokens) 4+1+4 47.65% 0.7660 32.70%
3e-6 1:3 (24M tokens) 4+1+4 48.77% 0.7616 32.57%
5e-6 1:3 (24M tokens) 2+1+2 48.55% 0.7497 32.16%
5e-6 1:3 (24M tokens) 3+1+3 46.98% 0.7536 32.87%

3. Generalizability of CodeBridge (RQ3)

To assess CodeBridge's generalizability, we evaluate its effectiveness across different LLM architectures and model sizes, analyzing line-level and function-level performance.

Model Training Step Line-Level Exact Match Rate Line-Level Edit Similarity Function-Level Line Accuracy
CodeLlama-13B-Instruct Origin 32.21% 0.6695 27.24%
Step 1 56.60% 0.8082 33.68%
Step 2 50.56% 0.7900 32.10%
Step 3 57.94% 0.8135 34.13%
No Transfer 56.82% 0.8142 33.90%
Qwen2.5-14B-Instruct Origin 29.69% 0.6342 20.60%
Step 1 46.98% 0.7473 25.75%
Step 2 32.89% 0.7034 21.89%
Step 3 50.56% 0.7717 27.12%
No Transfer 48.10% 0.7446 26.64%
DeepSeek-Coder-1.3B-Instruct Origin 27.46% 0.5518 20.80%
Step 1 43.53% 0.7017 26.39%
Step 2 39.96% 0.7150 22.83%
Step 3 44.20% 0.7179 27.28%
No Transfer 43.97% 0.7059 26.69%
DeepSeek-Coder-6.7B-Instruct Origin 32.37% 0.6147 23.58%
Step 1 51.79% 0.7683 30.85%
Step 2 39.53% 0.6546 24.60%
Step 3 54.02% 0.7807 31.45%
No Transfer 52.90% 0.7764 30.94%

Setup

  1. Clone the repository and install dependencies:
    pip install -r requirements.txt
    cd LLaMA-Factory
    pip install -e ".[torch,metrics]"
  2. Prepare the dataset and store dataset metadata in dataset_info.json.
  3. Modify train.sh to specify the training configuration.
  4. Run training:
    bash train.sh
  5. For inference, use inference.ipynb.

Contribution

We welcome contributions! Feel free to open issues or submit pull requests.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •