Convert Unicode text to LaTeX and vice versa.
- Overview
- Installation
- unicode2latex
- latex2unicode
- LaTeX Package Requirements
- Known Limitations
- Development
- Testing
- Authors
- License
- Continuous Integration
- Acknowledgements
- Contributing
- Support
unicode2latex converts Unicode characters to their LaTeX equivalents,
making it easy to prepare text with special characters for LaTeX documents.
The package also includes latex2unicode for reverse conversion.
Key features:
- Converts accented characters (é →
\'{e}or\acute{e}) - Handles Greek letters (α →
\alpha) - Converts math symbols (∫ →
\int, ∞ →\infty) - Expands ligatures (ffi → ffi)
- Converts fractions (⅖ →
\sfrac{2}{5}) - Handles subscripts and superscripts (x₁ →
x_{1}, x² →x^{2}) - Supports font modifiers (𝔸 →
\mathbb{A}) - Thread-safe for concurrent use
pip install unicode2latexgit clone https://github.com/yourusername/unicode2latex.git
cd unicode2latex
pip install .- Linux: Full support with optional TeX Live integration
- Windows: Full support using bundled backup files
- macOS: Full support with optional TeX Live integration
The package includes backup copies of required TeX files (unicode-math-table.tex and unicode-math-xetex.sty), so it works out-of-the-box on all platforms without requiring TeX Live installation.
# Convert text directly
unicode2latex "café résumé"
# Output: caf\'{e} r\'{e}sum\'{e}
# Process a file
unicode2latex --input myfile.txt
# Read from stdin
cat myfile.txt | unicode2latex --stdin
# Use math-mode accents
unicode2latex --accent-mode=math "é"
# Output: \acute{e}Converts both precomposed and combining accents:
| Unicode | Codepoint | LaTeX (text) | LaTeX (math) |
|---|---|---|---|
| è | U+00E8 | \{e}` |
\grave{e} |
| é | U+00E9 | \'{e} |
\acute{e} |
| ê | U+00EA | \^{e} |
\hat{e} |
| ñ | U+00F1 | \~{n} |
\tilde{n} |
| ü | U+00FC | \"{u} |
\ddot{u} |
| ā | U+0101 | \={a} |
\bar{a} |
| Ç | U+00C7 | \c{C} |
\c{C} |
Works with both single codepoints (U+00C7) and combining characters (U+0043 + U+0327).
Multiple accents:
ṩ → \.{\d{s}}
| Unicode | LaTeX |
|---|---|
| α | \alpha |
| β | \beta |
| γ | \gamma |
| Δ | \Delta |
| θ | \theta |
| ϑ | \vartheta |
| π | \pi |
| Σ | \Sigma |
and so on...
| Unicode | LaTeX |
|---|---|
| ∫ | \int |
| ∞ | \infty |
| ∩ | \cap |
| ∪ | \cup |
| ⊂ | \subset |
| ∈ | \in |
| × | \times |
| ÷ | \div |
| ≤ | \leq |
| ≥ | \geq |
| ≠ | \neq |
| ⇎ | \nLeftrightarrow |
and so on...
| Unicode | LaTeX |
|---|---|
| fi | fi |
| fl | fl |
| ffi | ffi |
| ffl | ffl |
⅓ → \sfrac{1}{3}
⅖ → \sfrac{2}{5}
¾ → \sfrac{3}{4}
rₐ tᵪ → r_{a} t_{\chi}
x² y³ → x^{2} y^{3}
0⁺ → 0^{+}
| Unicode | LaTeX | Font |
|---|---|---|
| 𝔸 | \mathbb{A} |
Blackboard bold |
| ⅅ | \symbbit{D} |
Double-struck italic |
| 𝒜 | \mathcal{A} |
Calligraphic |
| 𝔄 | \mathfrak{A} |
Fraktur |
| 𝐀 | \mathbf{A} |
Bold |
| 𝐴 | \mathit{A} |
Italic |
Use --convert-quotes/--CQ and --convert-dashes/--CD to normalize typographic punctuation prior to conversion. Both flags default to off so source text remains untouched unless explicitly requested.
Quotes & primes (--convert-quotes):
| Unicode | Code point | ASCII output |
|---|---|---|
| ‘ | U+2018 | ` |
| ’ | U+2019 | ' |
| “ | U+201C | `` |
| ” | U+201D | '' |
| ‹ | U+2039 | < |
| › | U+203A | > |
| ′ | U+2032 | ' |
| ″ | U+2033 | '' |
| ‴ | U+2034 | ''' |
Dashes, hyphens & spaces (--convert-dashes):
| Unicode | Code point | ASCII output |
|---|---|---|
| ‐ | U+2010 | - |
| ‑ | U+2011 | - |
| ‒ | U+2012 | - |
| – | U+2013 | -- |
| — | U+2014 | --- |
| ― | U+2015 | --- |
| (NBSP) | U+00A0 | ~ |
| (en space) | U+2002 | space |
| (em space) | U+2003 | space |
| (thin space) | U+2009 | space |
| (narrow no-break space) | U+202F | space |
unicode2latex [OPTIONS] [text ...]
Input options:
text- Text to convert (positional arguments)--input FILE,-i FILE- Read from file--stdin,-s- Read from stdin--input-encoding ENCODING,--input-enc ENCODING- Force the codec for--input/--stdin; passAUTOto sniff BOMs and fall back to heuristic detection (default: UTF-8)
Output options:
--output FILE,-o FILE- Write to file (default: stdout)
Conversion options:
--accent-mode {text,math,auto}- Accent output mode (default: text)--convert-quotes,--CQ- Normalize Unicode quotes/primes to ASCII equivalents (see Quote & Dash Normalization)--convert-dashes,--CD- Normalize Unicode dashes/non-breaking spaces to ASCII equivalents (see Quote & Dash Normalization)--no-accents- Do not convert accents--no-fonts- Do not add font modifiers--prefer-unicode-math,-P- Use unicode-math commands when possible
Other options:
--verbose,-v- Verbose output--help,-h- Show help message
By default both unicode2latex and latex2unicode expect UTF-8 input streams. Use --input-encoding (alias --input-enc) whenever a file or pipe uses a different codec:
unicode2latex --input legacy.txt --input-encoding=cp1252Passing --input-encoding=AUTO tells the converter to read a small probe (up to 256 KiB), look for BOM markers, and fall back to chardet for statistical guesses before streaming the rest of the data through the detected codec. AUTO works for both --input files and --stdin pipes without truncating or buffering the entire payload, making it practical for large pipelines. When the detector cannot decide, the CLI exits with a clear error so you can pick the encoding manually.
Windows console note: On stock Windows the active codepage is frequently CP1252 (or another ANSI/OEM page) rather than UTF-8, so both command-line arguments and
--input/--stdindata may arrive in a legacy encoding. Always pass--input-encoding(and consider the same value when entering strings on the command line) to avoid mojibake. Recent Windows builds supportchcp 65001/“Beta: Use Unicode UTF-8”, but the default remains locale-specific.
Additionally, the Windows console host has long-standing issues with printing arbitrary Unicode: certain scalars render as placeholder glyphs or get replaced entirely (see Stack Overflow answer #70013690 for examples and workarounds). Prefer redirecting output to a file (--output or > file.txt) and viewing it in a UTF‑8 aware editor, or switch to Windows Terminal/PowerShell 7 with UTF‑8 enabled.
The --accent-mode option controls how accented characters are converted:
Uses standard LaTeX text-mode accent commands:
unicode2latex "café"
# Output: caf\'{e}
unicode2latex --accent-mode=text "é è ê ñ ü"
# Output: \'{e} \`{e} \^{e} \~{n} \"{u}Best for: Regular text, paragraphs, titles
Uses LaTeX math-mode accent commands:
unicode2latex --accent-mode=math "café"
# Output: caf\acute{e}
unicode2latex --accent-mode=math "é è ê ñ ü"
# Output: \acute{e} \grave{e} \hat{e} \tilde{n} \ddot{u}Best for: Mathematical expressions, equations
Math mode accent mapping:
| Text mode | Math mode |
|---|---|
\' |
\acute |
\` |
\grave |
\^ |
\hat |
\~ |
\tilde |
\" |
\ddot |
\= |
\bar |
\. |
\dot |
\u |
\breve |
\v |
\check |
Auto-detection of context (currently defaults to text mode):
unicode2latex --accent-mode=auto "é"
# Output: \'{e} (currently defaults to text)Future implementation will detect mathematical vs. text context automatically.
# Simple text
unicode2latex "Hello, café!"
# Output: Hello, caf\'{e}!
# Greek letters
unicode2latex "The angle θ = π/2"
# Output: The angle \theta{} = \pi{}/2
# Math symbols
unicode2latex "∫₀^∞ e^{-x} dx"
# Output: \int{}_{0}^{\infty{}} e^{-x} dx# Convert a file
unicode2latex --input document.txt --output document.tex
# Process stdin
cat notes.txt | unicode2latex --stdin > notes.tex
# Chain with other tools
grep "café" data.txt | unicode2latex --stdinWindows console note: The default
cmd.exe/PowerShell encodings are typically cp1252, so writing Unicode-rich output directly to stdout may raiseUnicodeEncodeErroror mangle characters. When runningunicode2latexorlatex2unicodeon Windows, prefer--output FILEto generate a UTF-8 file (with BOM, LF line endings) and view it in an editor that understands Unicode. Bothunicode2latexandlatex2unicodeintentionally use Unix-style\nline breaks for portability; convert to CRLF afterwards if a tool absolutely requires it.
# Math mode accents for equations
unicode2latex --accent-mode=math "Let á be the acceleration"
# Output: Let \acute{a} be the acceleration
# Disable accent conversion (preserve Unicode)
unicode2latex --no-accents "café"
# Output: café
# Disable font modifiers
unicode2latex --no-fonts "The set ℝ"
# Output: The set ℝ
# Prefer unicode-math package commands
unicode2latex --prefer-unicode-math "α + β"
# Output: \alpha + \beta# Math mode with file input
unicode2latex --accent-mode=math --input equations.txt --output equations.tex
# Process multiple files
for file in *.txt; do
unicode2latex --input "$file" --output "${file%.txt}.tex"
done
# Verbose output for debugging
unicode2latex --verbose --input problematic.txtConverts LaTeX commands back to Unicode:
# Convert Greek letters
latex2unicode --greek "\\alpha \\beta \\gamma"
# Output: α β γ
# Convert math symbols
latex2unicode --math "\\int \\infty \\cap"
# Output: ∫ ∞ ∩
# Combine both
latex2unicode --greek --math "\\alpha \\cap \\beta"
# Output: α ∩ βOptions:
--greek,-G- Convert Greek letters to Unicode--math,-M- Convert math symbols to Unicode--input FILE,-i FILE- Read from file--stdin,-s- Read from stdin--output FILE,-o FILE- Write to file
Note: These are LaTeX packages needed to compile the generated LaTeX output, not to run unicode2latex itself. The unicode2latex Python package works standalone without any TeX installation.
If your output is pure ASCII, standard LaTeX is sufficient:
\documentclass{article}
\usepackage[utf8]{inputenc}
\begin{document}
caf\'{e}
\end{document}For conversions using unicode-math commands (like \symbbit{D}), use XeLaTeX or LuaLaTeX:
\documentclass{article}
\usepackage{fontspec}
\usepackage{unicode-math}
\begin{document}
\symbbit{D}
\end{document}Compile with:
xelatex document.tex
# or
lualatex document.texFor \sfrac fractions, include the xfrac package:
\documentclass{article}
\usepackage{xfrac}
\begin{document}
\sfrac{2}{5}
\end{document}Some symbols may require additional packages:
\documentclass{article}
\usepackage{amssymb} % For \nLeftrightarrow, etc.
\usepackage{amsmath} % For extended math support
\begin{document}
\nLeftrightarrow
\end{document}The conversion of Greek letters is not currently differentiated inside and outside mathematical environments. All Greek letters are converted to their LaTeX command form (e.g., \alpha) without distinguishing between text and math mode.
Characters beyond the Basic Multilingual Plane (> U+FFFF), including emoji, are not converted and will be passed through with a warning.
While basic multiple accents are supported (e.g., ṩ), complex combinations of multiple accents per character may not be handled optimally.
The --accent-mode=auto option is planned but not yet implemented. It currently defaults to text mode.
# Clone the repository
git clone https://github.com/yourusername/unicode2latex.git
cd unicode2latex
# Install development dependencies
pip install -r requirements-test.txt
# Set up git hooks for pre-commit testing
git config --local core.hooksPath .githooks/
# Install the developed code, to check installation process
pip install -e .The project includes a Makefile for common development tasks:
# Update backup TeX files from system (requires TeX Live)
make update-tex-files
# Run all tests
make test
# Run linter
make lint
# Clean build artifacts
make clean
# Build distribution packages
make build
# Install in development mode
make install
# Show all available targets
make helpThe package includes backup copies of unicode-math-table.tex and unicode-math-xetex.sty in the unicode2latex/tex/ directory. These files should be updated when new versions of the unicode-math package are released:
make update-tex-filesThis command:
- Locates the current system files using
kpsewhich - Copies them to
unicode2latex/tex/(only if newer) - Shows file information and version details
A GitHub Actions workflow can also automatically check for updates monthly and create a pull request if the files have changed.
unicode2latex/
├── unicode2latex/ # Main package
│ ├── u2l.py # Core conversion logic
│ ├── FakePlasTeX/ # LaTeX tokenizer
│ └── tex/ # Backup TeX files (NEW)
│ ├── unicode-math-table.tex
│ └── unicode-math-xetex.sty
├── unittests/ # Test suite (285 tests)
│ ├── test_unicode2latex.py # Unicode → LaTeX tests
│ ├── test_latex2unicode.py # LaTeX → Unicode tests
│ ├── test_accents.py # Accent handling tests
│ ├── test_fonts.py # Font modifier tests
│ ├── test_bugs.py # Bug regression tests
│ ├── test_accent_modes.py # Accent mode feature tests
│ ├── test_cli_accent_mode.py # CLI integration tests
│ ├── test_thread_safety.py # Concurrency tests
│ ├── test_bug5_fix.py # Thread safety fix tests
│ └── test_bug6_investigation.py
├── .github/workflows/ # CI/CD
│ ├── test.yaml # Tests on Ubuntu & Windows
│ └── update-tex-files.yaml # Monthly TeX file updates
├── Makefile # Development tasks (NEW)
├── CLAUDE.md # Claude Code guide (NEW)
├── BUGS.md # Known bugs and fixes
├── README.md # This file
├── pyproject.toml # Package configuration
└── setup.cfg # Setup configuration
The project includes a comprehensive test suite with 285 tests.
# Run all tests with current Python version
python3 -m unittest discover unittests -v
# Run tests across multiple Python versions with tox
tox
# Run tests in parallel (faster)
tox -p autoTest across Python 3.8-3.13:
# Install tox
pip install tox
# Run all environments
tox
# Run specific Python version
tox -e py310
# Run with coverage
tox -e coverage
# Run linting
tox -e lintSee TESTING.md for complete testing documentation.
Test coverage by category:
- Unicode → LaTeX: 88 tests
- LaTeX → Unicode: 38 tests
- Accent handling: 40 tests (+ 35 accent mode tests)
- Font modifiers: 47 tests
- Bug regression: 29 tests
- Thread safety: 17 tests
- CLI integration: 20 tests
- Investigation: 11 tests
All 285 tests passing ✅
This software is Copyright © 2023-2025 Andrea C. G. Mennucci
See file LICENSE.txt in the code distribution.
The code is tested using GitHub Actions on both Ubuntu and Windows, for Python 3.8 up to 3.14.
Test Matrix:
- Operating Systems: Ubuntu 24.04, Windows (latest)
- Python Versions: 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14
- Total: 14 test jobs (7 Python versions × 2 OS)
On Linux, tests use system TeX Live installation. On Windows, tests use the bundled backup TeX files, verifying cross-platform compatibility.
A separate GitHub Actions workflow checks for updates to unicode-math files monthly and can be triggered manually. If updates are found, it automatically creates a pull request.
The principal author has used the Python code editor Wing IDE by Wingware to develop this project, with a license kindly donated by WingWare.
Claude Code by Anthropic was used to debug and enhance this package, including implementing thread safety, adding the accent mode feature, and creating comprehensive test coverage.
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Run the test suite (
python3 -m unittest discover unittests) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please ensure all tests pass and add new tests for new features.
For bugs and feature requests, please open an issue on the GitHub repository.
Made with ❤️ for the LaTeX community