GitHub - cisnlp/code-specific-neurons: 💻🔍 How Programming Concepts and Neurons Are Shared in Code Language Models

code-specific-neurons

Code and data for the How Programming Concepts and Neurons Are Shared in Code Language Models paper.

git clone https://github.com/cisnlp/code-specific-neurons.git

code-logitlens

To interpret latent embeddings, we use the logit lens. We implement our version of Logit Lens in code-logitlens/compute_lens.ipynb. It uses the datasets/parallel and perform the translation task from one programming langauge to another. It sets one language as the input language and the other as the output language. It records the decoded tokens for each token and layer along with their probabilities and ranks.

code-mexa

To calculate cross-lingual alignment between programming languages, we use MEXA.
MEXA uses datasets/parallel to compute alignment between a pivot language and the other languages. We use the MEXA codebase and implement our code in code-mexa/compute_mexa.ipynb.

code-lape

To calculate language-specific neurons, we use LAPE. LAPE uses datasets/raw to identify language-specific neurons within LLMs.
We use the LAPE codebase. The majority of the code remains unchanged, but we add code-lape/id-gen.ipynb, which is missing from the original code, and modify code-lape/identify.ipynb to ensure the same number of neurons is selected for each language.

datasets

keywords: We included keywords and built-ins for different programming languages in the datasets/keywords. Built-ins include: primitive types, macros, modules, collections, containers, and built-in functions, excluding keywords.

parallel: We store parallel few-shot prompts in different languages in datasets/parallel/prompts. The code to generate parallel data from the MuST-CoST repository is in datasets/parallel/download_parallel.ipynb. For reproducibility, we include the results in datasets/parallel/code_snippets.zip (Extract the ZIP file).

raw: The code to generate raw data from the Codeparrot repository and Wikipedia is in datasets/raw/download_raw.ipynb.

citation

If you find our method, code, and data useful for your research, please cite:

@article{kargaran2025programming,
  title={How Programming Concepts and Neurons Are Shared in Code Language Models},
  author={Kargaran, Amir Hossein and Liu, Yihong and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
  journal={arXiv preprint},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
assets		assets
code-lape		code-lape
code-logitlens		code-logitlens
code-mexa		code-mexa
datasets		datasets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

code-specific-neurons

code-logitlens

code-mexa

code-lape

datasets

citation

About

Uh oh!

Uh oh!

Languages

License

cisnlp/code-specific-neurons

Folders and files

Latest commit

History

Repository files navigation

code-specific-neurons

code-logitlens

code-mexa

code-lape

datasets

citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages