Skip to content

mennucc/unicode2latex

Repository files navigation

unicode2latex

Convert Unicode text to LaTeX and vice versa.

Python Version Tests License Platforms

Table of Contents

Overview

unicode2latex converts Unicode characters to their LaTeX equivalents, making it easy to prepare text with special characters for LaTeX documents. The package also includes latex2unicode for reverse conversion.

Key features:

  • Converts accented characters (é → \'{e} or \acute{e})
  • Handles Greek letters (α → \alpha)
  • Converts math symbols (∫ → \int, ∞ → \infty)
  • Expands ligatures (ffi → ffi)
  • Converts fractions (⅖ → \sfrac{2}{5})
  • Handles subscripts and superscripts (x₁ → x_{1}, x² → x^{2})
  • Supports font modifiers (𝔸 → \mathbb{A})
  • Thread-safe for concurrent use

Installation

From PyPI (recommended)

pip install unicode2latex

From source

git clone https://github.com/yourusername/unicode2latex.git
cd unicode2latex
pip install .

Platform Support

  • Linux: Full support with optional TeX Live integration
  • Windows: Full support using bundled backup files
  • macOS: Full support with optional TeX Live integration

The package includes backup copies of required TeX files (unicode-math-table.tex and unicode-math-xetex.sty), so it works out-of-the-box on all platforms without requiring TeX Live installation.

unicode2latex Usage

Quick Start

# Convert text directly
unicode2latex "café résumé"
# Output: caf\'{e} r\'{e}sum\'{e}

# Process a file
unicode2latex --input myfile.txt

# Read from stdin
cat myfile.txt | unicode2latex --stdin

# Use math-mode accents
unicode2latex --accent-mode=math "é"
# Output: \acute{e}

Features

Accents

Converts both precomposed and combining accents:

Unicode Codepoint LaTeX (text) LaTeX (math)
è U+00E8 \{e}` \grave{e}
é U+00E9 \'{e} \acute{e}
ê U+00EA \^{e} \hat{e}
ñ U+00F1 \~{n} \tilde{n}
ü U+00FC \"{u} \ddot{u}
ā U+0101 \={a} \bar{a}
Ç U+00C7 \c{C} \c{C}

Works with both single codepoints (U+00C7) and combining characters (U+0043 + U+0327).

Multiple accents:

ṩ → \.{\d{s}}

Greek Letters

Unicode LaTeX
α \alpha
β \beta
γ \gamma
Δ \Delta
θ \theta
ϑ \vartheta
π \pi
Σ \Sigma

and so on...

Math Symbols

Unicode LaTeX
\int
\infty
\cap
\cup
\subset
\in
× \times
÷ \div
\leq
\geq
\neq
\nLeftrightarrow

and so on...

Ligatures

Unicode LaTeX
fi
fl
ffi
ffl

Fractions

⅓ → \sfrac{1}{3}
⅖ → \sfrac{2}{5}
¾ → \sfrac{3}{4}

Subscripts and Superscripts

rₐ tᵪ → r_{a} t_{\chi}
x² y³ → x^{2} y^{3}
0⁺ → 0^{+}

Font Modifiers

Unicode LaTeX Font
𝔸 \mathbb{A} Blackboard bold
\symbbit{D} Double-struck italic
𝒜 \mathcal{A} Calligraphic
𝔄 \mathfrak{A} Fraktur
𝐀 \mathbf{A} Bold
𝐴 \mathit{A} Italic

Quote & Dash Normalization

Use --convert-quotes/--CQ and --convert-dashes/--CD to normalize typographic punctuation prior to conversion. Both flags default to off so source text remains untouched unless explicitly requested.

Quotes & primes (--convert-quotes):

Unicode Code point ASCII output
U+2018 `
U+2019 '
U+201C ``
U+201D ''
U+2039 <
U+203A >
U+2032 '
U+2033 ''
U+2034 '''

Dashes, hyphens & spaces (--convert-dashes):

Unicode Code point ASCII output
U+2010 -
U+2011 -
U+2012 -
U+2013 --
U+2014 ---
U+2015 ---
  (NBSP) U+00A0 ~
  (en space) U+2002 space
  (em space) U+2003 space
  (thin space) U+2009 space
  (narrow no-break space) U+202F space

Command-Line Options

unicode2latex [OPTIONS] [text ...]

Input options:

  • text - Text to convert (positional arguments)
  • --input FILE, -i FILE - Read from file
  • --stdin, -s - Read from stdin
  • --input-encoding ENCODING, --input-enc ENCODING - Force the codec for --input/--stdin; pass AUTO to sniff BOMs and fall back to heuristic detection (default: UTF-8)

Output options:

  • --output FILE, -o FILE - Write to file (default: stdout)

Conversion options:

  • --accent-mode {text,math,auto} - Accent output mode (default: text)
  • --convert-quotes, --CQ - Normalize Unicode quotes/primes to ASCII equivalents (see Quote & Dash Normalization)
  • --convert-dashes, --CD - Normalize Unicode dashes/non-breaking spaces to ASCII equivalents (see Quote & Dash Normalization)
  • --no-accents - Do not convert accents
  • --no-fonts - Do not add font modifiers
  • --prefer-unicode-math, -P - Use unicode-math commands when possible

Other options:

  • --verbose, -v - Verbose output
  • --help, -h - Show help message

Input Encoding

By default both unicode2latex and latex2unicode expect UTF-8 input streams. Use --input-encoding (alias --input-enc) whenever a file or pipe uses a different codec:

unicode2latex --input legacy.txt --input-encoding=cp1252

Passing --input-encoding=AUTO tells the converter to read a small probe (up to 256 KiB), look for BOM markers, and fall back to chardet for statistical guesses before streaming the rest of the data through the detected codec. AUTO works for both --input files and --stdin pipes without truncating or buffering the entire payload, making it practical for large pipelines. When the detector cannot decide, the CLI exits with a clear error so you can pick the encoding manually.

Windows console note: On stock Windows the active codepage is frequently CP1252 (or another ANSI/OEM page) rather than UTF-8, so both command-line arguments and --input/--stdin data may arrive in a legacy encoding. Always pass --input-encoding (and consider the same value when entering strings on the command line) to avoid mojibake. Recent Windows builds support chcp 65001/“Beta: Use Unicode UTF-8”, but the default remains locale-specific.

Additionally, the Windows console host has long-standing issues with printing arbitrary Unicode: certain scalars render as placeholder glyphs or get replaced entirely (see Stack Overflow answer #70013690 for examples and workarounds). Prefer redirecting output to a file (--output or > file.txt) and viewing it in a UTF‑8 aware editor, or switch to Windows Terminal/PowerShell 7 with UTF‑8 enabled.

Accent Mode

The --accent-mode option controls how accented characters are converted:

Text Mode (default)

Uses standard LaTeX text-mode accent commands:

unicode2latex "café"
# Output: caf\'{e}

unicode2latex --accent-mode=text "é è ê ñ ü"
# Output: \'{e} \`{e} \^{e} \~{n} \"{u}

Best for: Regular text, paragraphs, titles

Math Mode

Uses LaTeX math-mode accent commands:

unicode2latex --accent-mode=math "café"
# Output: caf\acute{e}

unicode2latex --accent-mode=math "é è ê ñ ü"
# Output: \acute{e} \grave{e} \hat{e} \tilde{n} \ddot{u}

Best for: Mathematical expressions, equations

Math mode accent mapping:

Text mode Math mode
\' \acute
\` \grave
\^ \hat
\~ \tilde
\" \ddot
\= \bar
\. \dot
\u \breve
\v \check

Auto Mode (planned)

Auto-detection of context (currently defaults to text mode):

unicode2latex --accent-mode=auto "é"
# Output: \'{e} (currently defaults to text)

Future implementation will detect mathematical vs. text context automatically.

Examples

Basic conversion

# Simple text
unicode2latex "Hello, café!"
# Output: Hello, caf\'{e}!

# Greek letters
unicode2latex "The angle θ = π/2"
# Output: The angle \theta{} = \pi{}/2

# Math symbols
unicode2latex "∫₀^∞ e^{-x} dx"
# Output: \int{}_{0}^{\infty{}} e^{-x} dx

File processing

# Convert a file
unicode2latex --input document.txt --output document.tex

# Process stdin
cat notes.txt | unicode2latex --stdin > notes.tex

# Chain with other tools
grep "café" data.txt | unicode2latex --stdin

Windows console note: The default cmd.exe/PowerShell encodings are typically cp1252, so writing Unicode-rich output directly to stdout may raise UnicodeEncodeError or mangle characters. When running unicode2latex or latex2unicode on Windows, prefer --output FILE to generate a UTF-8 file (with BOM, LF line endings) and view it in an editor that understands Unicode. Both unicode2latex and latex2unicode intentionally use Unix-style \n line breaks for portability; convert to CRLF afterwards if a tool absolutely requires it.

Advanced options

# Math mode accents for equations
unicode2latex --accent-mode=math "Let á be the acceleration"
# Output: Let \acute{a} be the acceleration

# Disable accent conversion (preserve Unicode)
unicode2latex --no-accents "café"
# Output: café

# Disable font modifiers
unicode2latex --no-fonts "The set ℝ"
# Output: The set ℝ

# Prefer unicode-math package commands
unicode2latex --prefer-unicode-math "α + β"
# Output: \alpha + \beta

Combining options

# Math mode with file input
unicode2latex --accent-mode=math --input equations.txt --output equations.tex

# Process multiple files
for file in *.txt; do
    unicode2latex --input "$file" --output "${file%.txt}.tex"
done

# Verbose output for debugging
unicode2latex --verbose --input problematic.txt

latex2unicode Usage

Converts LaTeX commands back to Unicode:

# Convert Greek letters
latex2unicode --greek "\\alpha \\beta \\gamma"
# Output: α β γ

# Convert math symbols
latex2unicode --math "\\int \\infty \\cap"
# Output: ∫ ∞ ∩

# Combine both
latex2unicode --greek --math "\\alpha \\cap \\beta"
# Output: α ∩ β

Options:

  • --greek, -G - Convert Greek letters to Unicode
  • --math, -M - Convert math symbols to Unicode
  • --input FILE, -i FILE - Read from file
  • --stdin, -s - Read from stdin
  • --output FILE, -o FILE - Write to file

LaTeX Package Requirements

Note: These are LaTeX packages needed to compile the generated LaTeX output, not to run unicode2latex itself. The unicode2latex Python package works standalone without any TeX installation.

ASCII Output

If your output is pure ASCII, standard LaTeX is sufficient:

\documentclass{article}
\usepackage[utf8]{inputenc}
\begin{document}
caf\'{e}
\end{document}

Unicode-Math Output

For conversions using unicode-math commands (like \symbbit{D}), use XeLaTeX or LuaLaTeX:

\documentclass{article}
\usepackage{fontspec}
\usepackage{unicode-math}
\begin{document}
\symbbit{D}
\end{document}

Compile with:

xelatex document.tex
# or
lualatex document.tex

Fractions

For \sfrac fractions, include the xfrac package:

\documentclass{article}
\usepackage{xfrac}
\begin{document}
\sfrac{2}{5}
\end{document}

Special Symbols

Some symbols may require additional packages:

\documentclass{article}
\usepackage{amssymb}    % For \nLeftrightarrow, etc.
\usepackage{amsmath}    % For extended math support
\begin{document}
\nLeftrightarrow
\end{document}

Known Limitations

Greek Letter Context

The conversion of Greek letters is not currently differentiated inside and outside mathematical environments. All Greek letters are converted to their LaTeX command form (e.g., \alpha) without distinguishing between text and math mode.

Emoji and High Unicode

Characters beyond the Basic Multilingual Plane (> U+FFFF), including emoji, are not converted and will be passed through with a warning.

Multiple Accents

While basic multiple accents are supported (e.g., ṩ), complex combinations of multiple accents per character may not be handled optimally.

Auto-Detection

The --accent-mode=auto option is planned but not yet implemented. It currently defaults to text mode.

Development

Setting Up Development Environment

# Clone the repository
git clone https://github.com/yourusername/unicode2latex.git
cd unicode2latex

# Install development dependencies
pip install -r requirements-test.txt

# Set up git hooks for pre-commit testing
git config --local core.hooksPath .githooks/

# Install the developed code, to check installation process
pip install -e .

Makefile Targets

The project includes a Makefile for common development tasks:

# Update backup TeX files from system (requires TeX Live)
make update-tex-files

# Run all tests
make test

# Run linter
make lint

# Clean build artifacts
make clean

# Build distribution packages
make build

# Install in development mode
make install

# Show all available targets
make help

Updating Backup TeX Files

The package includes backup copies of unicode-math-table.tex and unicode-math-xetex.sty in the unicode2latex/tex/ directory. These files should be updated when new versions of the unicode-math package are released:

make update-tex-files

This command:

  • Locates the current system files using kpsewhich
  • Copies them to unicode2latex/tex/ (only if newer)
  • Shows file information and version details

A GitHub Actions workflow can also automatically check for updates monthly and create a pull request if the files have changed.

Project Structure

unicode2latex/
├── unicode2latex/              # Main package
│   ├── u2l.py                  # Core conversion logic
│   ├── FakePlasTeX/            # LaTeX tokenizer
│   └── tex/                    # Backup TeX files (NEW)
│       ├── unicode-math-table.tex
│       └── unicode-math-xetex.sty
├── unittests/                  # Test suite (285 tests)
│   ├── test_unicode2latex.py   # Unicode → LaTeX tests
│   ├── test_latex2unicode.py   # LaTeX → Unicode tests
│   ├── test_accents.py         # Accent handling tests
│   ├── test_fonts.py           # Font modifier tests
│   ├── test_bugs.py            # Bug regression tests
│   ├── test_accent_modes.py    # Accent mode feature tests
│   ├── test_cli_accent_mode.py # CLI integration tests
│   ├── test_thread_safety.py   # Concurrency tests
│   ├── test_bug5_fix.py        # Thread safety fix tests
│   └── test_bug6_investigation.py
├── .github/workflows/          # CI/CD
│   ├── test.yaml               # Tests on Ubuntu & Windows
│   └── update-tex-files.yaml   # Monthly TeX file updates
├── Makefile                    # Development tasks (NEW)
├── CLAUDE.md                   # Claude Code guide (NEW)
├── BUGS.md                     # Known bugs and fixes
├── README.md                   # This file
├── pyproject.toml              # Package configuration
└── setup.cfg                   # Setup configuration

Testing

The project includes a comprehensive test suite with 285 tests.

Quick Start

# Run all tests with current Python version
python3 -m unittest discover unittests -v

# Run tests across multiple Python versions with tox
tox

# Run tests in parallel (faster)
tox -p auto

Testing with Tox

Test across Python 3.8-3.13:

# Install tox
pip install tox

# Run all environments
tox

# Run specific Python version
tox -e py310

# Run with coverage
tox -e coverage

# Run linting
tox -e lint

See TESTING.md for complete testing documentation.

Test coverage by category:

  • Unicode → LaTeX: 88 tests
  • LaTeX → Unicode: 38 tests
  • Accent handling: 40 tests (+ 35 accent mode tests)
  • Font modifiers: 47 tests
  • Bug regression: 29 tests
  • Thread safety: 17 tests
  • CLI integration: 20 tests
  • Investigation: 11 tests

All 285 tests passing

Authors

This software is Copyright © 2023-2025 Andrea C. G. Mennucci

License

See file LICENSE.txt in the code distribution.

Continuous Integration

The code is tested using GitHub Actions on both Ubuntu and Windows, for Python 3.8 up to 3.14.

Test Matrix:

  • Operating Systems: Ubuntu 24.04, Windows (latest)
  • Python Versions: 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14
  • Total: 14 test jobs (7 Python versions × 2 OS)

On Linux, tests use system TeX Live installation. On Windows, tests use the bundled backup TeX files, verifying cross-platform compatibility.

Test results

Automated Maintenance

A separate GitHub Actions workflow checks for updates to unicode-math files monthly and can be triggered manually. If updates are found, it automatically creates a pull request.

Acknowledgements

The principal author has used the Python code editor Wing IDE by Wingware to develop this project, with a license kindly donated by WingWare.

Claude Code by Anthropic was used to debug and enhance this package, including implementing thread safety, adding the accent mode feature, and creating comprehensive test coverage.

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Run the test suite (python3 -m unittest discover unittests)
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

Please ensure all tests pass and add new tests for new features.

Support

For bugs and feature requests, please open an issue on the GitHub repository.


Made with ❤️ for the LaTeX community

About

convert unicode to LaTeX and vice versa

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages