Skip to content

πŸ’»πŸ” How Programming Concepts and Neurons Are Shared in Code Language Models

License

Notifications You must be signed in to change notification settings

cisnlp/code-specific-neurons

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

45 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

code-specific-neurons

Code and data for the How Programming Concepts and Neurons Are Shared in Code Language Models paper.

git clone https://github.com/cisnlp/code-specific-neurons.git

code-logitlens

To interpret latent embeddings, we use the logit lens. We implement our version of Logit Lens in code-logitlens/compute_lens.ipynb. It uses the datasets/parallel and perform the translation task from one programming langauge to another. It sets one language as the input language and the other as the output language. It records the decoded tokens for each token and layer along with their probabilities and ranks.

code-mexa

To calculate cross-lingual alignment between programming languages, we use MEXA.
MEXA uses datasets/parallel to compute alignment between a pivot language and the other languages. We use the MEXA codebase and implement our code in code-mexa/compute_mexa.ipynb.

code-lape

To calculate language-specific neurons, we use LAPE. LAPE uses datasets/raw to identify language-specific neurons within LLMs.
We use the LAPE codebase. The majority of the code remains unchanged, but we add code-lape/id-gen.ipynb, which is missing from the original code, and modify code-lape/identify.ipynb to ensure the same number of neurons is selected for each language.

datasets

keywords: We included keywords and built-ins for different programming languages in the datasets/keywords. Built-ins include: primitive types, macros, modules, collections, containers, and built-in functions, excluding keywords.

parallel: We store parallel few-shot prompts in different languages in datasets/parallel/prompts. The code to generate parallel data from the MuST-CoST repository is in datasets/parallel/download_parallel.ipynb. For reproducibility, we include the results in datasets/parallel/code_snippets.zip (Extract the ZIP file).

raw: The code to generate raw data from the Codeparrot repository and Wikipedia is in datasets/raw/download_raw.ipynb.

citation

If you find our method, code, and data useful for your research, please cite:

@article{kargaran2025programming,
  title={How Programming Concepts and Neurons Are Shared in Code Language Models},
  author={Kargaran, Amir Hossein and Liu, Yihong and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
  journal={arXiv preprint},
  year={2025}
}

About

πŸ’»πŸ” How Programming Concepts and Neurons Are Shared in Code Language Models

Topics

Resources

License

Stars

Watchers

Forks