Skip to content

sebastian-software/ferromark

Repository files navigation

ferromark

CI crates.io docs.rs License: MIT Rust 1.85+ clippy

Markdown to HTML at 309 MiB/s. Faster than pulldown-cmark, md4c (C), and comrak. Passes all 652 CommonMark spec tests. Every GFM extension included.

Quick start

let html = ferromark::to_html("# Hello\n\n**World**");

One function call, no setup. When allocation pressure matters:

let mut buffer = Vec::new();
ferromark::to_html_into("# Reuse me", &mut buffer);
// buffer survives across calls — zero repeated allocation

Benchmarks

Numbers, not adjectives. Apple Silicon (M-series), February 2026. All parsers run with GFM tables, strikethrough, and task lists enabled. Output buffers reused where APIs allow. Non-PGO binaries for a fair comparison.

CommonMark 5 KB (wiki-style, mixed content with tables)

Parser Throughput vs ferromark
ferromark 289.9 MiB/s baseline
pulldown-cmark 247.7 MiB/s 0.85x
md4c (C) 242.3 MiB/s 0.84x
comrak 73.7 MiB/s 0.25x

CommonMark 50 KB (same style, scaled)

Parser Throughput vs ferromark
ferromark 309.3 MiB/s baseline
pulldown-cmark 271.7 MiB/s 0.88x
md4c (C) 247.4 MiB/s 0.80x
comrak 76.0 MiB/s 0.25x

17% faster than pulldown-cmark. 25% faster than md4c. 4x faster than comrak.

The fixtures are synthetic wiki-style documents with paragraphs, lists, code blocks, and tables. Nothing cherry-picked. Run them yourself: cargo bench --bench comparison

What you get

Full CommonMark: 652/652 spec tests pass. No filtering, no exceptions.

All five GFM extensions: Tables, strikethrough, task lists, autolink literals, disallowed raw HTML.

Beyond GFM: Footnotes, front matter extraction (---/+++), heading IDs (GitHub-compatible slugs), math spans ($/$$), highlight/mark syntax (==text==), superscript (^text^), subscript (~text~), and callouts (> [!NOTE], > [!WARNING], ...).

MDX support (opt-in via mdx feature): Segment and render .mdx files without a JavaScript toolchain. Covers 90%+ of real-world MDX patterns in Next.js, Docusaurus, and Astro.

15 feature flags to turn on exactly what you need:

allow_html · allow_link_refs · tables · strikethrough · highlight · superscript · subscript · task_lists
autolink_literals · disallowed_raw_html · footnotes · front_matter
heading_ids · math · callouts

Syntax note: ferromark uses ~~text~~ for strikethrough, ~text~ for subscript, and ^text^ for superscript. Single-tilde strikethrough is intentionally not supported.

Trade-offs

ferromark is built for one job: turning Markdown into HTML as fast as possible. That focus means some things it deliberately skips:

  • No AST access. You can't walk a syntax tree or write custom renderers against parsed nodes. If you need that, pulldown-cmark's iterator model or comrak's AST are better fits.
  • No source maps. No byte-offset tracking for mapping HTML back to Markdown positions.
  • HTML only. No XML, no CommonMark round-tripping, no alternative output formats.

These aren't planned. They'd compromise the streaming architecture that makes ferromark fast.

MDX support

MDX is the standard for component-driven docs in Next.js, Docusaurus, and Astro. Processing it usually requires a full JavaScript toolchain — Node.js, acorn, babel, the works.

ferromark takes a different approach: segment .mdx files into typed blocks and render them at native speed. No JS runtime. No AST.

ferromark = { version = "0.1", features = ["mdx"] }

Render — one call, full output

render() assembles the final output automatically: Markdown segments become HTML, JSX and expressions pass through unchanged, ESM and front matter are extracted separately.

use ferromark::mdx::render;

let input = r#"import { Card } from './card'

---
title: Hello
---

# Hello World

<Card title="Example">

Markdown **inside** a component.

</Card>

{new Date().getFullYear()}
"#;

let output = render(input);
// output.body        — HTML with JSX/expressions passed through
// output.esm         — vec!["import { Card } from './card'\n"]
// output.front_matter — Some("title: Hello\n")

Use render_with_options() for custom Markdown settings (heading IDs, math, footnotes, etc.).

Component — ready-to-use JSX module

to_component() wraps the output as a complete JSX/TSX module with a named export. Works with React 19, Preact, Solid, and any JSX framework.

let output = render(input);
let tsx = output.to_component("HelloWorld");
import { Card } from './card'

export function HelloWorld() {
  return (
    <>
      <h1 id="hello-world">Hello World</h1>
      <Card title="Example">
        <p>Markdown <strong>inside</strong> a component.</p>
      </Card>
      {new Date().getFullYear()}
    </>
  );
}

Segment — low-level control

When you need full control over each block, use segment() directly:

use ferromark::mdx::{segment, Segment};

for seg in segment(input) {
    match seg {
        Segment::Esm(s)              => { /* import/export — pass through */ }
        Segment::Markdown(s)         => { /* parse with ferromark::to_html(s) */ }
        Segment::JsxBlockOpen(s)     => { /* <Component> */ }
        Segment::JsxBlockClose(s)    => { /* </Component> */ }
        Segment::JsxBlockSelfClose(s)=> { /* <Component /> */ }
        Segment::Expression(s)       => { /* {expression} */ }
    }
}

The segmenter handles JSX attribute parsing (strings, expressions, spreads), brace-depth tracking (with string/comment/template-literal awareness), fragment syntax, member expressions (<Foo.Bar>), and multiline tags. Invalid constructs fall back to Markdown — no panics, always valid output.

Full example: cargo run --features mdx --example mdx_segment

Scope and coverage

The segmenter covers the block-level MDX patterns that make up 90%+ of real-world .mdx files: imports at the top, components wrapping content, expressions between paragraphs. This is what a typical Docusaurus, Next.js, or Astro page looks like — and it works out of the box.

What the segmenter deliberately skips — and why that's fine for most use cases:

What Our approach When it matters
Inline JSX (text <em>here</em>) Stays inside Markdown segments Only if you mix JSX and prose on the same line inside a paragraph — rare in practice
JS validation Heuristic detection (keyword + brace counting) instead of acorn/swc Only if you need to report syntax errors in user-authored MDX at parse time
Markdown grammar Standard CommonMark/GFM rules Official mdxjs disables indented code and HTML syntax — relevant if your content relies on <div> being JSX, not HTML
Container nesting > <Component> stays Markdown Only if you put JSX inside blockquotes or list items — uncommon
TypeScript generics <Component<T>> not parsed Only relevant for TSX-heavy content pages — very rare in docs
Error reporting Silent fallback to Markdown Means broken JSX renders as text instead of failing — arguably safer for content pipelines

The full @mdx-js/mdx compiler exists to produce a React component tree from MDX. It needs a JavaScript parser because it compiles to JSX. ferromark's segmenter exists to answer a simpler question: where does the Markdown stop and the JSX start? That question doesn't need a JS runtime.

For the detailed technical spec, see src/mdx/mod.rs.

How it works

No AST. Block events stream from the scanner to the HTML writer with nothing in between.

Input bytes (&[u8])
       │
       ▼
   Block parser (line-oriented, memchr-driven)
       │ emits BlockEvent stream
       ▼
   Inline parser (mark collection → resolution → emit)
       │ emits InlineEvent stream
       ▼
   HTML writer (direct buffer writes)
       │
       ▼
   Output (Vec<u8>)

What makes this fast in practice:

  • Block scanning runs on memchr for line boundaries. Container state is a compact stack, not a tree.
  • Inline parsing has three phases: collect delimiter marks, resolve precedence (code spans, math, links, emphasis, strikethrough, subscript, superscript, highlight), emit. No backtracking.
  • Emphasis resolution uses the CommonMark modulo-3 rule with a delimiter stack instead of expensive rescans.
  • SIMD scanning (NEON on ARM) detects special characters in inline content.
  • Zero-copy references: events carry Range pointers into the input, not copied strings.
  • Compact events: 24 bytes each, cache-line friendly.
  • Hot/cold annotation: #[inline] on tight loops, #[cold] on error paths, table-driven byte classification.

Design principles

  • Linear time. No regex, no backtracking, no quadratic blowup on adversarial input.
  • Low allocation pressure. Compact events, range references, reusable output buffers.
  • Operational safety. Depth and size limits guard against pathological nesting.
  • Small dependency surface. Minimal crates, straightforward integration.
Detailed parser comparison

How ferromark compares to the other three top-tier parsers across architecture, features, and output. Ratings use a 4-level heatmap focused on end-to-end Markdown-to-HTML throughput. Scoring is relative per row, so each row has at least one top mark.

Legend: 🟩 strongest   🟨 close behind   🟧 notable tradeoffs   🟥 weakest

Ferromark optimization backlog: docs/arch/ARCH-PLAN-001-performance-opportunities.md

Feature ferromark md4c pulldown-cmark comrak
Performance-critical architecture and memory
Parser model (streaming, no AST) 🟩 🟩 🟨 🟥
Streaming parsers emit output as they scan, avoiding intermediate trees. ferromark and md4c stream directly; pulldown-cmark uses a pull iterator; comrak builds an AST.
API overhead profile 🟩 🟩 🟨 🟥
Measures overhead on straight Markdown-to-HTML throughput. md4c callbacks and ferromark streaming events are lean; pulldown-cmark pull iterators are close; comrak's AST model adds more overhead for this workload.
Parse/render separation 🟨 🟩 🟩 🟧
Clear separation lets renderers be swapped or tuned. md4c and pulldown-cmark separate parse and render clearly; ferromark is mostly separated; comrak leans on AST-based renderers.
Inline parsing pipeline 🟩 🟨 🟨 🟥
Multi-phase inline parsing (collect, resolve, emit) keeps the hot path linear. ferromark uses this approach; md4c and pulldown-cmark are optimized byte scanners; comrak does more AST bookkeeping.
Emphasis matching efficiency 🟩 🟨 🟨 🟥
Stack-based algorithms reduce rescans on text-heavy documents. ferromark uses modulo-3 stacks; md4c and pulldown-cmark are optimized; comrak pays AST overhead.
Link reference processing cost 🟩 🟩 🟩 🟨
Link labels need normalization. ferromark, md4c, and pulldown-cmark minimize allocations; comrak handles more feature paths.
Zero-copy text handling 🟩 🟨 🟨 🟥
Text slices that point directly into input reduce allocation and copy costs. ferromark uses ranges; md4c and pulldown-cmark borrow slices; comrak allocates AST nodes.
Allocation pressure (hot path) 🟩 🟩 🟨 🟥
Fewer allocations in tight loops means better CPU utilization. Streaming parsers allocate less during parse/render; AST parsers allocate many nodes.
Output buffer reuse 🟩 🟩 🟨 🟥
Reusing buffers avoids repeated allocations across runs. ferromark, md4c, and pulldown-cmark allow reuse; comrak allocates internally.
Memory locality 🟩 🟩 🟨 🟥
A small working set fits in cache. Streaming parsers keep it small; AST-based parsing expands it.
Cache friendliness 🟩 🟩 🟨 🟥
Linear scans and contiguous buffers work well for CPU caches. ferromark and md4c favor linear scans; pulldown-cmark is close; comrak traverses AST allocations.
SIMD availability 🟩 🟨 🟩 🟥
SIMD accelerates scanning for special characters. ferromark and pulldown-cmark have SIMD paths; md4c relies on C compiler optimizations; comrak is not SIMD-focused.
Hot-path control 🟩 🟩 🟧 🟥
Performance headroom from low-level control in inner loops. md4c (C) and ferromark use tighter tuning; pulldown-cmark is mostly safe-Rust hot loops; comrak prioritizes flexibility.
Dependency footprint 🟩 🟩 🟨 🟥
Fewer dependencies simplify builds. md4c and ferromark are minimal; pulldown-cmark is moderate; comrak is heavier.
Throughput ceiling (architectural) 🟩 🟩 🟨 🟥
Streaming architectures with fewer allocations generally allow higher throughput ceilings. ferromark and md4c lead; pulldown-cmark is close; comrak trades throughput for flexibility.
 
Feature coverage and extensibility
Extension breadth 🟩 🟧 🟨 🟩
comrak has the broadest catalog; ferromark implements all 5 GFM extensions plus footnotes, front matter, heading IDs, math, highlight, subscript, superscript, and callouts; pulldown-cmark supports common GFM features; md4c supports common GFM features.
Spec compliance (CommonMark) 🟩 🟩 🟨 🟩
All four target CommonMark. Beyond CommonMark and GFM, ferromark, pulldown-cmark, and comrak also support footnotes, heading IDs, math spans, and callouts.
Extension configuration surface 🟨 🟩 🟨 🟨
Fine-grained flags let you disable features to reduce work. md4c has many flags; ferromark has 15 options; pulldown-cmark and comrak use option structs.
Raw HTML control 🟩 🟩 🟧 🟩
md4c and comrak expose explicit switches; ferromark provides allow_html and disallowed_raw_html; pulldown-cmark is more fixed.
GFM tables 🟩 🟩 🟩 🟩
All four support GFM tables.
Task lists, strikethrough 🟩 🟨 🟨 🟩
All four support both.
Footnotes 🟩 🟥 🟨 🟩
ferromark, pulldown-cmark, and comrak support footnotes; md4c does not.
Permissive autolinks 🟩 🟩 🟧 🟨
ferromark and md4c support GFM autolink literals (URL, www, email); comrak has relaxed autolinks; pulldown-cmark focuses on spec defaults.
Output safety toggles 🟨 🟩 🟧 🟩
md4c and comrak provide explicit unsafe/escape switches; ferromark provides allow_html and disallowed_raw_html; pulldown-cmark is more fixed.
 
Rendering and output
Output streaming 🟩 🟩 🟨 🟥
Incremental output lowers peak memory and removes extra passes. ferromark and md4c stream to buffers; pulldown-cmark streams events; comrak renders after AST work.
Output customization hooks 🟧 🟩 🟨 🟩
Callbacks and ASTs are great for custom rendering but add indirection. md4c callbacks and comrak AST are very flexible; pulldown-cmark iterators are easy to transform; ferromark is lower level.
Output formats 🟥 🟧 🟨 🟩
comrak emits HTML, XML, and CommonMark; pulldown-cmark provides HTML plus event streams; md4c has HTML and callbacks; ferromark targets HTML only.
Source position support 🟥 🟥 🟩 🟨
pulldown-cmark has strong source map support; comrak can emit source positions; ferromark and md4c skip this for speed.
Source map tooling 🟥 🟥 🟩 🟨
pulldown-cmark exposes event ranges; comrak can emit source position attributes; ferromark and md4c keep this minimal.
IO friendliness 🟩 🟩 🟧 🟥
md4c and ferromark stream into buffers; pulldown-cmark recommends buffered output; comrak often builds strings after AST work.

Building

cargo build            # development
cargo build --release  # optimized (recommended for benchmarks)
cargo test             # run tests
cargo test --test commonmark_spec -- --nocapture  # CommonMark spec
cargo bench            # benchmarks

Project structure

src/
├── lib.rs          # Public API (to_html, to_html_into, parse, Options)
├── main.rs         # CLI binary
├── block/          # Block-level parser
│   ├── parser.rs   # Line-oriented block parsing
│   └── event.rs    # BlockEvent types
├── inline/         # Inline-level parser
│   ├── mod.rs      # Three-phase inline parsing
│   ├── marks.rs    # Mark collection + SIMD integration
│   ├── simd.rs     # NEON SIMD character scanning
│   ├── event.rs    # InlineEvent types
│   ├── code_span.rs
│   ├── emphasis.rs      # Modulo-3 stack optimization
│   ├── strikethrough.rs # GFM strikethrough resolution
│   ├── subscript.rs     # Subscript resolution (~text~)
│   ├── superscript.rs   # Superscript resolution (^text^)
│   ├── math.rs          # Math span resolution ($/$$ delimiters)
│   └── links.rs         # Link/image/autolink parsing
├── mdx/            # MDX segmenter + renderer (feature = "mdx")
│   ├── mod.rs      # Public API — Segment enum, segment(), render()
│   ├── render.rs   # Assembly layer: segments → HTML body + ESM + front matter
│   ├── splitter.rs # Line-based state machine
│   ├── jsx_tag.rs  # JSX tag boundary parser
│   └── expr.rs     # Expression boundary parser (brace/string/comment tracking)
├── footnote.rs     # Footnote store and rendering
├── link_ref.rs     # Link reference definitions
├── cursor.rs       # Pointer-based byte cursor
├── range.rs        # Compact u32 range type
├── render.rs       # HTML writer
├── escape.rs       # HTML escaping (memchr-optimized)
└── limits.rs       # DoS prevention constants

License

MIT -- Copyright 2026 Sebastian Software GmbH, Mainz, Germany

About

Markdown to HTML at 309 MiB/s. Faster than pulldown-cmark and md4c. 100% CommonMark, all GFM extensions.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors