CodeBridge

CodeBridge is a project dedicated to optimizing large language models (LLMs) for low-resource programming languages (LRPLs) like Cangjie. It utilizes CodeBridge, a three-stage transfer learning approach, to enhance code completion accuracy by leveraging knowledge from high-resource programming languages (HRPLs) such as Java and Rust. Additionally, Retrieval-Augmented Generation (RAG) is incorporated during inference to improve performance.

Project Structure

.env: Environment variables for API keys.
dataset: Contains datasets for training and evaluation.
LLaMA-Factory: Framework used for fine-tuning LLMs.
src: Core source code directory, containing:
- metric/: Scripts for evaluating model performance.
- rag/: Implementations of Retrieval-Augmented Generation (RAG).
- tree_sitter_cj/: Cangjie code parsing utilities.
- data_cleaning.py: Scripts for preprocessing datasets.
- inference.py: Hugging Face transformers-based inference scripts.
inference.ipynb: Jupyter notebook for inference using vLLM.
train.sh: Shell script for training the model.
llm.py: Interface for interacting with the model.
requirements.txt: Dependencies for the project.

CodeBridge: Three-Stage Fine-Tuning Process

CangjieLLM adopts CodeBridge, a novel three-stage training strategy that improves code completion for low-resource programming languages (LRPLs) through transfer learning from high-resource languages (HRPLs).

Training Strategy

Teaching Phase:
- Dataset: Cangjie corpus (~8M tokens)
- Epochs: 4
- Learning Rate: 2e-5
- Goal: Rapidly expose the model to Cangjie's syntax and semantics.
Practice Phase:
- Dataset: Java/Rust corpus (~24M tokens)
- Epochs: 1
- Learning Rate: 7e-6
- Goal: Enhance structural and semantic understanding by leveraging high-resource programming languages.
Correction Phase:
- Dataset: Cangjie corpus (same as step 1)
- Epochs: 4
- Learning Rate: 5e-6
- Goal: Fine-tune the model to rectify transfer-induced biases.

Dataset Preparation

Data Sources:
- Cangjie dataset from Huawei repositories (Cangjie-SIG, Cangjie-TPC, HW-PLLab).
- Java/Rust dataset from StarCoder preprocessed corpus.
Data Cleaning:
- File filtering based on size, encoding, character composition, and comment removal.
- Deduplication using a 90% similarity threshold.
Data Splitting:
- 20 projects used for evaluation (held-out test set).
- Remaining data used for training.

Inference with RAG and Prefix Matching

For inference, Retrieval-Augmented Generation (RAG) is combined with a prefix-matching strategy to improve code completion accuracy.

Prefix-Matching Decoding Strategy

If the input ends with a space → Extract the preceding non-space segment as prefix.
If the input ends with a symbol → Use context-based matching to determine the appropriate completion.

This method ensures that the generated output aligns more accurately with user expectations.

Experimental Setup

Hardware: 4x A100 GPUs (80GB)
Batch Size: 4
Training Time:
- Teaching Phase: 36 hours
- Practice Phase: 10 hours
- Correction Phase: 36 hours

Metrics

The metric module evaluates performance using:

Exact Match Rate (EM): Measures the percentage of perfect matches.
Edit Similarity (ES): Computes edit distance similarity.
Line Accuracy: Percentage of correctly generated lines within a block.

Results

1. Effectiveness of CodeBridge (RQ1)

Setting	Line-Level Exact Match Rate	Line-Level Edit Similarity	Function-Level Line Accuracy
Baseline (Untrained Model)	35.49%	0.6699	25.15%
Teaching Only	44.07%	0.7397	31.94%
No Transfer Learning (High LR)	49.44%	0.7645	30.51%
No Transfer Learning (Low LR)	46.53%	0.7568	30.70%
Transfer Learning First	47.43%	0.7563	31.91%
Full Three-Step Strategy	52.35%	0.7692	33.27%

2. Impact of Training Configurations (RQ2)

This section explores how different training settings, such as transfer data volume, final-stage learning rate, and number of epochs, influence both line-level and function-level performance.

Final Stage LR	Transfer Data Volume	Epochs	Line-Level Exact Match Rate	Line-Level Edit Similarity	Function-Level Line Accuracy
Cosine LR	-	8	45.64%	0.7432	31.78%
5e-6	1:3 (24M tokens)	4+1+4	52.35%	0.7692	33.27%
5e-6	1:1 (8M tokens)	4+1+4	40.49%	0.7165	32.00%
5e-6	1:5 (40M tokens)	4+1+4	48.32%	0.7659	32.10%
1e-5	1:3 (24M tokens)	4+1+4	47.65%	0.7660	32.70%
3e-6	1:3 (24M tokens)	4+1+4	48.77%	0.7616	32.57%
5e-6	1:3 (24M tokens)	2+1+2	48.55%	0.7497	32.16%
5e-6	1:3 (24M tokens)	3+1+3	46.98%	0.7536	32.87%

3. Generalizability of CodeBridge (RQ3)

To assess CodeBridge's generalizability, we evaluate its effectiveness across different LLM architectures and model sizes, analyzing line-level and function-level performance.

Model	Training Step	Line-Level Exact Match Rate	Line-Level Edit Similarity	Function-Level Line Accuracy
CodeLlama-13B-Instruct	Origin	32.21%	0.6695	27.24%
	Step 1	56.60%	0.8082	33.68%
	Step 2	50.56%	0.7900	32.10%
	Step 3	57.94%	0.8135	34.13%
	No Transfer	56.82%	0.8142	33.90%
Qwen2.5-14B-Instruct	Origin	29.69%	0.6342	20.60%
	Step 1	46.98%	0.7473	25.75%
	Step 2	32.89%	0.7034	21.89%
	Step 3	50.56%	0.7717	27.12%
	No Transfer	48.10%	0.7446	26.64%
DeepSeek-Coder-1.3B-Instruct	Origin	27.46%	0.5518	20.80%
	Step 1	43.53%	0.7017	26.39%
	Step 2	39.96%	0.7150	22.83%
	Step 3	44.20%	0.7179	27.28%
	No Transfer	43.97%	0.7059	26.69%
DeepSeek-Coder-6.7B-Instruct	Origin	32.37%	0.6147	23.58%
	Step 1	51.79%	0.7683	30.85%
	Step 2	39.53%	0.6546	24.60%
	Step 3	54.02%	0.7807	31.45%
	No Transfer	52.90%	0.7764	30.94%

Setup

Clone the repository and install dependencies:

pip install -r requirements.txt
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

Prepare the dataset and store dataset metadata in dataset_info.json.
Modify train.sh to specify the training configuration.
Run training:
```
bash train.sh
```
For inference, use inference.ipynb.

Contribution

We welcome contributions! Feel free to open issues or submit pull requests.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LLaMA-Factory		LLaMA-Factory
dataset		dataset
figures		figures
src		src
.env		.env
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
ast.ipynb		ast.ipynb
data.json		data.json
download.ipynb		download.ipynb
inference.ipynb		inference.ipynb
inference.sh		inference.sh
llm.py		llm.py
requirements.txt		requirements.txt
single_line_split.html		single_line_split.html
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CodeBridge

Project Structure

CodeBridge: Three-Stage Fine-Tuning Process

Training Strategy

Dataset Preparation

Inference with RAG and Prefix Matching

Prefix-Matching Decoding Strategy

Experimental Setup

Metrics

Results

1. Effectiveness of CodeBridge (RQ1)

2. Impact of Training Configurations (RQ2)

3. Generalizability of CodeBridge (RQ3)

Setup

Contribution

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

SMAT-Lab/CodeBridge

Folders and files

Latest commit

History

Repository files navigation

CodeBridge

Project Structure

CodeBridge: Three-Stage Fine-Tuning Process

Training Strategy

Dataset Preparation

Inference with RAG and Prefix Matching

Prefix-Matching Decoding Strategy

Experimental Setup

Metrics

Results

1. Effectiveness of CodeBridge (RQ1)

2. Impact of Training Configurations (RQ2)

3. Generalizability of CodeBridge (RQ3)

Setup

Contribution

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages