Skip to content

[Jinja] Everything is an identifier #1418

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 38 commits into from
May 5, 2025
Merged

[Jinja] Everything is an identifier #1418

merged 38 commits into from
May 5, 2025

Conversation

xenova
Copy link
Contributor

@xenova xenova commented May 2, 2025

image

Turns out, ✨ anything ✨ can be an identifier, even supposed keywords. For example, the following jinja template is valid:

{% if if in in %}a{% endif %}{% set if = "a" %}{% set in = "abc" %}{% if if in in %}b{% endif %}

So, like other jinja implementations, we should rather treat all "keywords" as identifiers during lexing, and only assert semantics during parsing.

This is necessary to support the call functionality (see docs), which had conflicts with existing templates on the HF hub that happened to use call as a variable instead.

@julien-c
Copy link
Member

julien-c commented May 2, 2025

🤯

@xenova xenova marked this pull request as ready for review May 3, 2025 02:12
@xenova
Copy link
Contributor Author

xenova commented May 3, 2025

I've made a bunch of improvements, and now @huggingface/jinja supports at least the top 100,000 valid template on the HF hub (ordered by downloads). 🥳

There were some strange cases to consider (people are very creative), but I think this is really impactful for the new chat template formatting feature (cc @mishig25).

Also cc @Rocketknight1 for viz, just highlighting the kinds of features that people are using in the wild.

@julien-c
Copy link
Member

julien-c commented May 3, 2025

supports at least the top 100,000 valid template on the HF hub

this is insane 🤯

Copy link
Collaborator

@mishig25 mishig25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm !

So, like other jinja implementations, we should rather treat all "keywords" as identifiers during lexing, and only assert semantics during parsing.

do you have examples of other implementations?

@xenova
Copy link
Contributor Author

xenova commented May 5, 2025

I've now tested rendering chat templates on the top 100,000 templates, and I think I've got all the cases covered now! 🥳

Last thing: improve formatting to not require excessive bracketing (not the most important thing; might leave for a separate PR).

do you have examples of other implementations?

I'm mainly basing my implementation on the official library, and double-checking with minijinja.

@xenova
Copy link
Contributor Author

xenova commented May 5, 2025

Turns out, it was pretty simple ✅

Before:

{%- set reasoning_content = ((((((message.content).split("</think>"))[0]).rstrip("\n")).split("<think>"))[-1]).lstrip("\n") -%}

After:

{%- set reasoning_content = message.content.split("</think>")[0].rstrip("\n").split("<think>")[-1].lstrip("\n") %}

Also, I've tested that formatting doesn't affect any output for the top 100,000 templates.

Copy link
Collaborator

@mishig25 mishig25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, amazing!

@xenova
Copy link
Contributor Author

xenova commented May 5, 2025

I think this is a good checkpoint for now ✅ Will merge and put out a new release (0.5.0).

Summary:

  • Aligned usage of identifiers w/ official jinja implementation (anything can now be an identifier)

  • We now support comment parsing and formatting (meaning this won't be preprocessed away when formatting). Rendering remains unchanged of course (ignore comments).

  • Differentiate between Integer and Float types, which is something Python does, but JavaScript doesn't, so we needed to take special care here.

  • Add support for new statement types:

    • filter: {% filter %}...{% endfilter %}
    • call: {% call %}...{% endcall %}
    • (custom) generation: {% generation %}{% endgeneration %}
  • Add support for new expression types:

    • spread: {{ fn(*args) }}
    • ternary (previously we just used the if expression): {{ 1 if true else 2 }}
  • Add support for new functions:

    • replace
    • strftime_now (get current time according to a narrow set of time templates, found in the wild)
    • and many others
  • Reduce number of redundant brackets when formatting membership and property accesses.

  • Improved binary operator precedence rules and general formatting rules.

  • Fixed edge-case lexing issues

This now means we support (at least) parsing, formatting, and rendering of the top 100,000 transformers-compatible chat templates on the Hugging Face Hub 🥳

@xenova xenova merged commit b5deb41 into main May 5, 2025
5 checks passed
@xenova xenova deleted the identifiers branch May 5, 2025 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants