CodeBridge is a project dedicated to optimizing large language models (LLMs) for low-resource programming languages (LRPLs) like Cangjie. It utilizes CodeBridge, a three-stage transfer learning approach, to enhance code completion accuracy by leveraging knowledge from high-resource programming languages (HRPLs) such as Java and Rust. Additionally, Retrieval-Augmented Generation (RAG) is incorporated during inference to improve performance.
- .env: Environment variables for API keys.
- dataset: Contains datasets for training and evaluation.
- LLaMA-Factory: Framework used for fine-tuning LLMs.
- src: Core source code directory, containing:
- metric/: Scripts for evaluating model performance.
- rag/: Implementations of Retrieval-Augmented Generation (RAG).
- tree_sitter_cj/: Cangjie code parsing utilities.
- data_cleaning.py: Scripts for preprocessing datasets.
- inference.py: Hugging Face transformers-based inference scripts.
- inference.ipynb: Jupyter notebook for inference using vLLM.
- train.sh: Shell script for training the model.
- llm.py: Interface for interacting with the model.
- requirements.txt: Dependencies for the project.
CangjieLLM adopts CodeBridge, a novel three-stage training strategy that improves code completion for low-resource programming languages (LRPLs) through transfer learning from high-resource languages (HRPLs).
-
Teaching Phase:
- Dataset: Cangjie corpus (~8M tokens)
- Epochs: 4
- Learning Rate: 2e-5
- Goal: Rapidly expose the model to Cangjie's syntax and semantics.
-
Practice Phase:
- Dataset: Java/Rust corpus (~24M tokens)
- Epochs: 1
- Learning Rate: 7e-6
- Goal: Enhance structural and semantic understanding by leveraging high-resource programming languages.
-
Correction Phase:
- Dataset: Cangjie corpus (same as step 1)
- Epochs: 4
- Learning Rate: 5e-6
- Goal: Fine-tune the model to rectify transfer-induced biases.
- Data Sources:
- Cangjie dataset from Huawei repositories (Cangjie-SIG, Cangjie-TPC, HW-PLLab).
- Java/Rust dataset from StarCoder preprocessed corpus.
- Data Cleaning:
- File filtering based on size, encoding, character composition, and comment removal.
- Deduplication using a 90% similarity threshold.
- Data Splitting:
- 20 projects used for evaluation (held-out test set).
- Remaining data used for training.
For inference, Retrieval-Augmented Generation (RAG) is combined with a prefix-matching strategy to improve code completion accuracy.
- If the input ends with a space → Extract the preceding non-space segment as prefix.
- If the input ends with a symbol → Use context-based matching to determine the appropriate completion.
This method ensures that the generated output aligns more accurately with user expectations.
- Hardware: 4x A100 GPUs (80GB)
- Batch Size: 4
- Training Time:
- Teaching Phase: 36 hours
- Practice Phase: 10 hours
- Correction Phase: 36 hours
The metric module evaluates performance using:
- Exact Match Rate (EM): Measures the percentage of perfect matches.
- Edit Similarity (ES): Computes edit distance similarity.
- Line Accuracy: Percentage of correctly generated lines within a block.
| Setting | Line-Level Exact Match Rate | Line-Level Edit Similarity | Function-Level Line Accuracy |
|---|---|---|---|
| Baseline (Untrained Model) | 35.49% | 0.6699 | 25.15% |
| Teaching Only | 44.07% | 0.7397 | 31.94% |
| No Transfer Learning (High LR) | 49.44% | 0.7645 | 30.51% |
| No Transfer Learning (Low LR) | 46.53% | 0.7568 | 30.70% |
| Transfer Learning First | 47.43% | 0.7563 | 31.91% |
| Full Three-Step Strategy | 52.35% | 0.7692 | 33.27% |
This section explores how different training settings, such as transfer data volume, final-stage learning rate, and number of epochs, influence both line-level and function-level performance.
| Final Stage LR | Transfer Data Volume | Epochs | Line-Level Exact Match Rate | Line-Level Edit Similarity | Function-Level Line Accuracy |
|---|---|---|---|---|---|
| Cosine LR | - | 8 | 45.64% | 0.7432 | 31.78% |
| 5e-6 | 1:3 (24M tokens) | 4+1+4 | 52.35% | 0.7692 | 33.27% |
| 5e-6 | 1:1 (8M tokens) | 4+1+4 | 40.49% | 0.7165 | 32.00% |
| 5e-6 | 1:5 (40M tokens) | 4+1+4 | 48.32% | 0.7659 | 32.10% |
| 1e-5 | 1:3 (24M tokens) | 4+1+4 | 47.65% | 0.7660 | 32.70% |
| 3e-6 | 1:3 (24M tokens) | 4+1+4 | 48.77% | 0.7616 | 32.57% |
| 5e-6 | 1:3 (24M tokens) | 2+1+2 | 48.55% | 0.7497 | 32.16% |
| 5e-6 | 1:3 (24M tokens) | 3+1+3 | 46.98% | 0.7536 | 32.87% |
To assess CodeBridge's generalizability, we evaluate its effectiveness across different LLM architectures and model sizes, analyzing line-level and function-level performance.
| Model | Training Step | Line-Level Exact Match Rate | Line-Level Edit Similarity | Function-Level Line Accuracy |
|---|---|---|---|---|
| CodeLlama-13B-Instruct | Origin | 32.21% | 0.6695 | 27.24% |
| Step 1 | 56.60% | 0.8082 | 33.68% | |
| Step 2 | 50.56% | 0.7900 | 32.10% | |
| Step 3 | 57.94% | 0.8135 | 34.13% | |
| No Transfer | 56.82% | 0.8142 | 33.90% | |
| Qwen2.5-14B-Instruct | Origin | 29.69% | 0.6342 | 20.60% |
| Step 1 | 46.98% | 0.7473 | 25.75% | |
| Step 2 | 32.89% | 0.7034 | 21.89% | |
| Step 3 | 50.56% | 0.7717 | 27.12% | |
| No Transfer | 48.10% | 0.7446 | 26.64% | |
| DeepSeek-Coder-1.3B-Instruct | Origin | 27.46% | 0.5518 | 20.80% |
| Step 1 | 43.53% | 0.7017 | 26.39% | |
| Step 2 | 39.96% | 0.7150 | 22.83% | |
| Step 3 | 44.20% | 0.7179 | 27.28% | |
| No Transfer | 43.97% | 0.7059 | 26.69% | |
| DeepSeek-Coder-6.7B-Instruct | Origin | 32.37% | 0.6147 | 23.58% |
| Step 1 | 51.79% | 0.7683 | 30.85% | |
| Step 2 | 39.53% | 0.6546 | 24.60% | |
| Step 3 | 54.02% | 0.7807 | 31.45% | |
| No Transfer | 52.90% | 0.7764 | 30.94% |
- Clone the repository and install dependencies:
pip install -r requirements.txt cd LLaMA-Factory pip install -e ".[torch,metrics]"
- Prepare the dataset and store dataset metadata in
dataset_info.json. - Modify
train.shto specify the training configuration. - Run training:
bash train.sh
- For inference, use
inference.ipynb.
We welcome contributions! Feel free to open issues or submit pull requests.

