Skip to content

Conversation

sverker
Copy link
Contributor

@sverker sverker commented Jun 18, 2025

Problem

Before OTP 28.0 it was possible to abuse the compiled format of regular expressions as returned by re:compile as if it was a serialized format to be imported into other Erlang node instances. This abuse happened to work as long as the underlying hardware architecture and PCRE version was not too incompatible. But it was unsafe as any unpleasant behavior could be the result of passing an incompatible compiled regular expression to re:run.

In OTP 28.0 the compiled format has changed to not expose the internals of PCRE but instead return a safe (magic) reference to the internal regex structures. A compiled regex is now safe but can only be used in the node instance that compiled it.

Solution

This PR introduces a supported safe way to export compiled regular expressions. The exported format is self-contained and can be stored off-node or sent to another nodes. If the importing node is compatible (architecture and PCRE version), then the compiled regex can be used directly with minimal overhead. If not compatible, then the regular expression will be recompiled from the original string and options which are included as a fallback in the exported format.

Usage

% Use 'export' option to re:compile
{ok, Exported} = re:compile(RegexString, [export | OtherOptions]),

then in a potentially other node do

Imported = re:import(Exported),

re:run(Subject, Imported),

Exported format

The exported format is opaque but look currently like this:

{re_exported_pattern, HeaderBin, OrigBin, OrigOpts, EncodedBin}

  • EncodedBin - binary containing the compiled regex as encoded by pcre2_serialize_encode()
  • HeaderBin - binary with some meta information including a CRC checksum over EncodedBin
  • OrigBin - original regular expression as a binary string
  • OrigOpts - options passed to re:compile/2.

Future optimization

For users that earlier generated Erlang code with compiled regular expressions as literals would now instead compile with option export and generate re:import(Literal) instead of just the literal. If done like that, the beam loader could be optimized to detect such calls to re:import with literals as arguments, evaluate the calls in load-time and replace them with just the returned compiled regular expression as a literal term.

@sverker sverker requested a review from rickard-green June 18, 2025 18:33
@sverker sverker self-assigned this Jun 18, 2025
@sverker sverker added team:VM Assigned to OTP team VM enhancement labels Jun 18, 2025
@sverker
Copy link
Contributor Author

sverker commented Jun 18, 2025

@josevalim What do you think about this?

Copy link
Contributor

github-actions bot commented Jun 18, 2025

CT Test Results

    4 files    228 suites   1h 54m 13s ⏱️
3 729 tests 3 626 ✅ 103 💤 0 ❌
4 859 runs  4 730 ✅ 129 💤 0 ❌

Results for commit efd5ef0.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@josevalim
Copy link
Contributor

I believe this is fantastic and simplifies many of the issues we had to tackle in Elixir. Thank you.

It would be fantastic if this could be used from Erlang too. Perhaps a pass in the compiler will rewrite re:compile into re:import?

Also, do you see this making to 28.1 or would it be 29 only?

@sverker sverker added this to the OTP-28.1 milestone Jun 19, 2025
@sverker
Copy link
Contributor Author

sverker commented Jun 19, 2025

The plan is to get this export/import functionality into 28.1. And then potentially do the loader optimization later maybe already in 28.2.

@josevalim
Copy link
Contributor

@sverker making it part of 28.1 would help Elixir codebases migrate to latest OTP, so thank you.

I have one additional question: do you think it is reasonable for re:run to automatically import an exported regex? I am thinking about the multi-node scenario, where you would need to explicitly import messages across nodes (which could be arbitrarily nested), so having it just work is beneficial. Or are you worried about importing being expensive if we have to do it on every operation?

@josevalim
Copy link
Contributor

I have one additional thought: what if the export is part of the existing tagged tuple? For example, you can add a new field to {re_pattern, _, _, _, _} that returns the export or the atom none. If exported, then you can transparently send it across nodes or run it locally with no performance cost. The receiving node can also run it transparently but it has the option of importing it to make sure it is optimised. What do you think would be the pros and cons of this approach?

rickard-green
rickard-green previously approved these changes Aug 2, 2025
@sverker
Copy link
Contributor Author

sverker commented Aug 11, 2025

The "import" step was literally free but unsafe. It is now safe but not totally free. It has to

  1. Check the CRC checksum of the imported binary.
  2. Allocate memory for the compiled regex.
  3. Do the "decoding" which seems to be basically a memory copy operation in current PCRE2.

I did some measurements, and the import seems to be a least a factor 10 cheaper than compiling the corresponding expression. Compiling a large 20 kb regex took ~500μs while importing it took ~40μs.

Our idea was to keep the import as a separate step for performance reasons. At least to begin with. After all, the only reason to precompile regex is performance. If you don't care much about that, just send the regex across node instances uncompiled.

For example, if someone has existing generated code looking like this

choose_regex(foo) ->
    {re_pattern, ...};
choose_regex(bar) ->
    {re_pattern, ...}.

do_the_match(Subject, Mode) ->
    re:run(Subject, choose_regex(Mode)).

then the loader trick would probably not trigger as the regex argument to re:run is not a compile time literal.

If we keep the import separate, then the code generation could be changed simply by adding the export option to re:compile and re:import around the generated literals.

choose_regex(foo) ->
    re:import({re_exported_pattern, ...});
choose_regex(bar) ->
    re:import({re_exported_pattern, ...}).

do_the_match(Subject, Mode) ->
    re:run(Subject, choose_regex(Mode)).

The loader can detect the calls to re:import with literal arguments while the rest of the code can stay untouched.

We can always add automatic import to re:compile and/or re:run later if we find it useful.

@josevalim
Copy link
Contributor

Got it, thank you. I think I misunderstood it initially but it is now clear to me: I need to call re:export at compile time and have re:import({re_exported_pattern, ...}) in the Erlang AST. That's what will be seen and optimized the loader. This way, exported regexes also won't show up anywhere else in the code, because they are converted into regular ones by the loader.

@sverker
Copy link
Contributor Author

sverker commented Aug 11, 2025

Yes. Except, instead of a new re:export you call re:compile with option export. I don't remember now the reasoning why we preferred an option before a separate export function. Summer vacation amnesia.

@josevalim
Copy link
Contributor

Given export returns a completely different opaque type, it may be handy to gate it behind a separate function indeed. But from my side they both work the same.

@sverker sverker force-pushed the sverker/erts/pcre2-export branch from 0597625 to c77a347 Compare August 13, 2025 15:23
@sverker sverker changed the base branch from master to maint August 13, 2025 15:25
@sverker sverker added the testing currently being tested, tag is used by OTP internal CI label Aug 13, 2025
@sverker sverker requested a review from bjorng August 14, 2025 14:06
@sverker sverker force-pushed the sverker/erts/pcre2-export branch from 7cecc13 to 314caf6 Compare August 18, 2025 12:07
@sverker sverker force-pushed the sverker/erts/pcre2-export branch from 314caf6 to efd5ef0 Compare August 19, 2025 12:10
@sverker sverker merged commit f7b5667 into erlang:maint Aug 19, 2025
28 checks passed
sabiwara added a commit to sabiwara/elixir that referenced this pull request Aug 21, 2025
sabiwara added a commit to sabiwara/elixir that referenced this pull request Aug 21, 2025
sabiwara added a commit to sabiwara/elixir that referenced this pull request Aug 21, 2025
sabiwara added a commit to sabiwara/elixir that referenced this pull request Aug 21, 2025
sabiwara added a commit to sabiwara/elixir that referenced this pull request Aug 30, 2025
sabiwara added a commit to sabiwara/elixir that referenced this pull request Aug 30, 2025
sabiwara added a commit to elixir-lang/elixir that referenced this pull request Aug 30, 2025
sabiwara added a commit to elixir-lang/elixir that referenced this pull request Aug 30, 2025
@sabiwara
Copy link

Thank you so much @sverker and everybody involved!
Successfully integrated in Elixir (PR), looking forward to 28.1's release 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants