Skip to content

[ php-wasm ] add intl extension #2187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: trunk
Choose a base branch
from

Conversation

mho22
Copy link
Contributor

@mho22 mho22 commented Apr 16, 2025

Motivation for the change, related issues


Mostly based on @oskardydo excellent work and pull request #2173


Implementation details

Refactoring of #2173, several modifications :

  1. gettext extension separation from intl
  2. root/lib/data directory composed of libicud74l.dat file only in compile/libintl and Makefile
  3. new step before buildconf in compile/php/Dockerfile
  4. --pre-js flag adds environment variables to PHP during build [ like ICU_DATA path ]
  5. --preload-file flag will import external files to the build. Useful when files like libicud74l.dat are necessary.
  6. ZEND_BROKEN_SPRINTF needs to be disabled in order for PHP 7.1 to build successfully.


Questions :

  • PHP 7.0 currently doesn't support intl. Because it needs its own old version of ICU. A version where PHP and ICU were more intimately linked. But I think PHP 7.0 and PHP 7.1 are not supported anymore, right ?

  • The ICU version is 74.2, version used in base pull request. Should I use a more recent or older release ?

  • The preloaded ICU data file is located in /internal/shared/preload. Since this file is mandatory in order for PHP to work properly, should I create another more specific directory ?

  • I didn't find another way to inject ICU_DATA inside the php-wasm build except with the use of --pre-js. But instead of adding yet another possibly useless file, do you think of a better way? Maybe that --pre-js flag and .env.js file can be useful in the future ?

  • It is currently available for web and node, but should I wait for the icu compression tests results or is it definitely node-only ?



Next :

  • Build the tiniest libintl files, libraries and data to make sure this can be loaded in web ?


P.S. : The first commit is composed of only the files without the builds. For readibility. Next commit will add libintl libraries and web/jspi/7.1, web/jspi/8.4, node/asyncify/7.1, node/asyncify/8.4.

Testing Instructions

Run this with php.run or php.cli

<?php

$formatter = new \NumberFormatter('en-US', \NumberFormatter::CURRENCY);

var_dump($formatter->format(100.00));

$formatter = new \NumberFormatter('fr-FR', \NumberFormatter::CURRENCY);

var_dump($formatter->format(100.00));


Should return this result :

string(7) "$100.00"
string(11) "€ 100.00"

@mho22
Copy link
Contributor Author

mho22 commented Apr 16, 2025

I tried to optimize the intl builds in different ways :

But first of all, these are the sizes of the compile/libintl directories before build :

data : 30,8 Mb [ icudt74l.dat file ]
include : 4,8 Mb
lib : 8 Mb


My first attempt was to greatly decrease the icudt74l.dat file with a filter.json file :

{
	"localeFilter": {
		"filterType": "language",
		"includelist": ["en"]
	}
}

it resulted in a 12.7Mb size. Not great, not terrible.



The second attempt was to decrease the lib directory with new flags :

RUN set -euxo pipefail && \
    mkdir -p /root/lib && \
    source /root/emsdk/emsdk_env.sh && \
    CPPFLAGS="-DUCONFIG_NO_LEGACY_CONVERSION=1 -DUCONFIG_NO_COLLATION=1 -DUCONFIGU_NO_FORMATTING=1 -DUCONFIG_NO_TRANSLITERATION=1 -DUCONFIG_NO_REGULAR_EXPRESSIONS=1" \
    emconfigure /root/icu/source/configure \
include : 4,8 Mb
lib : 6,8 Mb


RUN set -euxo pipefail && \
    mkdir -p /root/lib && \
    source /root/emsdk/emsdk_env.sh && \
    CPPFLAGS="-DUCONFIG_ONLY_COLLATION=1 -DUCONFIG_NO_LEGACY_CONVERSION=1 -DUCONFIG_NO_SERVICE=1" \
    emconfigure /root/icu/source/configure \
include : 4,8 Mb
lib : 3,9 Mb

Unfortunately PHP won't compile with the -DUCONFIG_ONLY_COLLATION flag on.

So I ended up having my best optimization process :

data : 12,7 Mb
include : 4,8 Mb
lib : 6,8 Mb


And here is a comparison between dependenciesTotalSize data from different php_8_4.js versions :

php 8.4 with intl

- export const dependenciesTotalSize = 16143865; 
+ export const dependenciesTotalSize = 18472927;
php 8.4 with intl -DUCONFIG_NO_LEGACY_CONVERSION=1 -DUCONFIG_NO_COLLATION=1 -DUCONFIGU_NO_FORMATTING=1 -DUCONFIG_NO_TRANSLITERATION=1 -DUCONFIG_NO_REGULAR_EXPRESSIONS=1

- export const dependenciesTotalSize = 16143865;
+ export const dependenciesTotalSize = 18135309;


Questions :

  1. Should I keep these flags ?
  2. Should I remove WITH_INTL from web ?

@adamziel
Copy link
Collaborator

adamziel commented Apr 17, 2025

Should I keep these flags ?

What are the consequences of having them? Are we missing out on some languages or types of information? Or is it just more compressed? If we retain most information, yes, let's keep those flags on.

Should I remove WITH_INTL from web ?

Just to summarize the total download size impact for the JSPI build

  • php.data: ~29MB, new file, must be downloaded upfront
  • php_8_4.wasm: 15MB -> 17MB
  • php_8_4.js: More or less the same

29MB is way too large for a default download on the web, so let's leave WITH_INTL=false until we figure out how to ship extensions as dynamic libraries that can be declared in the Blueprint (e.g. XDebug). Even better if we had optimistic lazy loading without declarations.

The ICU version is 74.2, version used in base pull request. Should I use a more recent or older release ?

Would using the latest version be just a matter of changing the build configuration? If so, let's do it. However, if that would create additional compilation hurdles, let's stick with 74.2 for now. It's from December 2023 so still fairly recent.

The preloaded ICU data file is located in /internal/shared/preload. Since this file is mandatory in order for PHP to work properly, should I create another more specific directory ?

/internal/shared/preload is for PHP files that are preloaded with auto_prepend_file. Just /internal/shared should be fine.

I didn't find another way to inject ICU_DATA inside the php-wasm build except with the use of --pre-js. But instead of adding yet another possibly useless file, do you think of a better way? Maybe that --pre-js flag and .env.js file can be useful in the future?

Thinking about Node.js, a separate file seems fine. Here's a few thoughts I had:

  • Using an actual meaningful library name would be more helpful than shipping a file called php.data
  • With dynamic libraries, we'll need to separate the dependencies of every library. We may potentially ship them as separate via npm packages eventually, or have a small package repository embedded in the Playground repo.
  • Do we know why it only works via --pre-js? Does the file need to be present when the first WASM function call is made, for example? If so, would creating it in the initializeRuntime() method still work?

@mho22
Copy link
Contributor Author

mho22 commented Apr 18, 2025

-DUCONFIG_NO_LEGACY_CONVERSION=1 disables support for legacy encodings, like ISO-8859-1, Shift-JIS, etc.
-DUCONFIG_NO_REGULAR_EXPRESSIONS=1 disables ICU’s regex engine. But PHP itself uses PCRE extension for regex.

Adding the others will disable some php functions like collator_compare or numfmt_format. I decided to remove them.

Here are the different sizes without and with intl.

php 8.4 without intl 

php_8_4.wasm: 16,1 Mb
php_8_4.js: 148 Kb
php 8.4 with intl without filters

data : 31,9 Mb
include : 5,1 Mb
lib : 8,5 Mb

php.data: 31,9 Mb
php_8_4.wasm: 18,5 Mb
php_8_4.js: 153 Kb
php 8.4 with intl -DUCONFIG_NO_LEGACY_CONVERSION=1 -DUCONFIG_NO_REGULAR_EXPRESSIONS=1 without filters

data : 31,9 Mb
include : 5,1 Mb
lib : 8,2 Mb

php.data: 31,9 Mb
php_8_4.wasm: 18,4 Mb
php_8_4.js: 153 Kb

These builds are made with latest ICU version 77.1. Nothing more has to be done to make this version work.



29MB is way too large for a default download on the web, so let's leave WITH_INTL=false until we figure out how to ship extensions as dynamic libraries that can be declared in the Blueprint (e.g. XDebug). Even better if we had optimistic #89 without declarations.

Just to be sure, should I disable intl completely or only in web :

web: {
    WITH_INTL: 'no',
},

Using an actual meaningful library name would be more helpful than shipping a file called php.data

If you add multiple --preload-file files in emcc all of these preloaded files will be stored in that php.data, that is why I didn't want to rename it intl.data for example.

With dynamic libraries, we'll need to separate the dependencies of every library. We may potentially ship them as separate via npm packages eventually, or have a small package repository embedded in the Playground repo.

If I understand that correctly : We have two strategies here. First is the asyncify way, we will ship extensions with dynamic libraries [ .so files, I guess ]. This means keeping asyncify. But this seems to be normal since jspi is still experimental. On the other hand, with JSPI, we have lazy loading. JSPI lazy loading seems really promising and I will be glad to contribute on that, but I suppose dynamic libraries is more short term than lazy loading. Correct ?

Do we know why it only works via --pre-js? Does the file need to be present when the first WASM function call is made, for example? If so, would creating it in the initializeRuntime() method still work?

I am still investigating this but simply instanciating the environment variable ENV.ICU_DATA = "/internal/shared" before calling the line callRuntimeCallbacks(__ATINIT__) composed of one callback : (...args) => original(...args) works.

node/jspi/php_8_4.js on line 346:

function initRuntime() {
    runtimeInitialized = true;
    SOCKFS.root = FS.mount(SOCKFS, {}, null);
    if (!Module["noFSInit"] && !FS.init.initialized)
      FS.init();
    FS.ignorePermissions = false;
    TTY.init();
    PIPEFS.root = FS.mount(PIPEFS, {}, null);
    ENV.ICU_DATA = "/internal/shared";  // This works
    callRuntimeCallbacks(__ATINIT__);
    // ENV.ICU_DATA = "/internal/shared"; This doesn't work
}

Adding that line in the --pre-js file will run the content of that file before the initRuntime function and therefore run it successfully.

This is not the initializeRuntime you were looking for I guess.

@adamziel
Copy link
Collaborator

Just to be sure, should I disable intl completely or only in web :

Only in web, let's still build the Node version with intl since the bundle size doesn't matter that much there. Could we reuse the same .data file for all the PHP versions to keep the npm package size small?

@adamziel
Copy link
Collaborator

If you add multiple --preload-file files in emcc all of these preloaded files will be stored in that php.data, that is why I didn't want to rename it intl.data for example

Gotcha! What would it take to still rename it, though? Would it be as simple as a string replacement in the built php.js? Or is there more to it? If it's complex, let's leave it.

This means keeping asyncify.

We'll need to keep Asyncify until Blink (Safari, Bun) supports JSPI 😢

I suppose dynamic libraries is more short term than lazy loading. Correct ?

Yes, e.g. XDebug is a dynamic library and it's a short term priority. Lazy loading will be challenging in that we'll need to create extension stubs with the right function signatures to trick PHP into thinking it actually loaded the extension.

This is not the initializeRuntime you were looking for I guess.

I've meant this one:

https://github.com/Automattic/wordpress-playground-private/blob/0d16adc6c1935037099e7d34466afd14d158be23/packages/php-wasm/universal/src/lib/php.ts#L212

But it seems to be called too late. Hm. There's always the ENV here that we can control without messing with the php.js module:

https://github.com/Automattic/wordpress-playground-private/blob/0d16adc6c1935037099e7d34466afd14d158be23/packages/php-wasm/universal/src/lib/load-php-runtime.ts#L141

Perhaps there's some elegant way of injecting that env variable from here:

https://github.com/Automattic/wordpress-playground-private/blob/0d16adc6c1935037099e7d34466afd14d158be23/packages/php-wasm/node/src/lib/load-runtime.ts#L21

Or maybe baking it into the php.js module is for the best, since it depends on the build options. Looping in @brandonpayton for thoughts

@mho22
Copy link
Contributor Author

mho22 commented Apr 19, 2025

Could we reuse the same .data file for all the PHP versions to keep the npm package size small?

Yes but there is one for asyncify and another for jspi. Should I investigate for one unique file?



What would it take to still rename it, though?

It is as easy as it looks. What would you like to name it? Maybe intl.data?



Perhaps there's some elegant way of injecting that env variable from here:

This works as you mentioned :

const runtime = await loadNodeRuntime( '8.4', { emscriptenOptions : { ENV : { ICU_DATA : "/internal/shared" } } } );

However, it implies that the path can be changed, while in reality it's fixed at build time based on this line in php/Dockerfile:

echo -n ' --preload-file /root/lib/data@/internal/shared ' >> /root/.emcc-php-wasm-flags; \

But it is indeed way more elegant and it means we can avoid a --pre-js file during build.

@mho22
Copy link
Contributor Author

mho22 commented Apr 20, 2025

Actually, this works as well without --pre-js and --preload-file during php build :

const script = `<?php $formatter = new \NumberFormatter('en-US', \NumberFormatter::CURRENCY);

var_dump($formatter->format(100.00));

$formatter = new \NumberFormatter('fr-FR', \NumberFormatter::CURRENCY);

var_dump($formatter->format(100.00));
`;




const php = new PHP( await loadNodeRuntime( '8.4', { emscriptenOptions : { ENV : { ICU_DATA : "/icu-data-path" } } } ) );

fs.readFile( 'data/icudt74l.dat', async ( error, data ) =>
{
    php.mkdir( '/icu-data-path' );

    php.writeFile( '/icu-data-path/icudt74l.dat', data );

    const result = await php.run( { code : script } );

    console.log( result.text );
} );


node --experimental-wasm-stack-switching scripts/node.js

string(7) "$100.00"
string(9) "€100.00"




import script from '../php/intl.php?raw';

...

const php = new PHP( loadWebRuntime( '8.4', { emscriptenOptions : { ENV : { ICU_DATA : "/icu-data-path" }, ... } ) );

fetch( 'data/icudt74l.dat' ).then( async data =>
{
    php.mkdir( '/icu-data-path' );

    php.writeFile( '/icu-data-path/icudt74l.dat', new Uint8Array( await data.arrayBuffer() ) );

    const result = await php.run( { code : script } );

    console.log( script );

    console.log( result.text );

    const phpinfo = await php.run( { code : "<?php echo phpinfo();" } );

    document.getElementById( 'app' ).innerHTML = phpinfo.text;
} );

web.js

Capture d’écran 2025-04-21 à 10 19 44

So we could avoid having to add the big icudt74l.dat data file inside the builds, letting users use the data file version they want and perhaps enable WITH_INTL for web and node without extra data file ? The wasm file will still be 2Mb heavier per php version.

We should probably provide some documentation about the process if we decide to go for that solution.

@brandonpayton
Copy link
Member

Perhaps there's some elegant way of injecting that env variable from here:

https://github.com/Automattic/wordpress-playground-private/blob/0d16adc6c1935037099e7d34466afd14d158be23/packages/php-wasm/node/src/lib/load-runtime.ts#L21

Or maybe baking it into the php.js module is for the best, since it depends on the build options. Looping in @brandonpayton for thoughts

If we are configuring a fixed path that we completely control, it seems like it would be cleanest to just bake a global into the build. I haven't digested all the details in this PR, but adding another --pre-js file that is populated conditionally seems like a fine approach.

@adamziel
Copy link
Collaborator

So we could avoid having to add the big icudt74l.dat data file inside the builds, letting users use the data file version they want and perhaps enable WITH_INTL for web and node without extra data file ? The wasm file will still be 2Mb heavier per php version.

This is great! Lovely! To confirm my understanding:

  • Every php.wasm version would be 2MB larger
  • The npm packages for php-wasm/node and php-wasm/web would both ship a single icudt74l.dat file that's under 20MB
  • We can choose whether or not to download that file. If we do, the intl extension just works. If we don't, php still works, but the PHP refuses to load the intl extension.

Is that right? If yes then yes, let's build all php versions WITH_INTL. Then, separately from this PR, let's discuss the API to load the dat file on the web. In Node we can just always load it.

@mho22
Copy link
Contributor Author

mho22 commented Apr 24, 2025

@adamziel That's right! I probably need some extra informations :

  • Where should I store the dat file in php-wasm/web and php-wasm/node ?
  • Where should I load the dat file in Node ? In the php/Dockerfile build with --pre-js for the env variable and --preload-file ? Or maybe more elegantly [ I currently don't know how but there is certainly another way ]

If we don't, php still works, but the PHP refuses to load the intl extension.

PHP will still work, and intl extension will be loaded, but when running intl functions without the data from the dat file, php exceptions will be thrown.

@adamziel
Copy link
Collaborator

Where should I store the dat file in php-wasm/web and php-wasm/node?

For php-wasm/web, the public directory seems reasonable. For php-wasm/node, I'm not sure – feel free to propose something. The most important part is to make sure it's shipped with the built package and double-check it's being loaded. Unfortunately we don't have any post-build smoke tests.

Where should I load the dat file in Node ? In the php/Dockerfile build with --pre-js for the env variable and --preload-file ? Or maybe more elegantly [ I currently don't know how but there is certainly another way ]

I'm confused. I thought it worked as well without --pre-js and --preload-file? In which case we'd load it via fetch or node:fs somewhere around getPHPLoaderModule?

PHP will still work, and intl extension will be loaded, but when running intl functions without the data from the dat file, php execptions will be thrown.

This is fine for v1. For v2, let's explore disabling those functions – I worry some developers might check the availability of the intl extension with a simplistic function_exists() check.

@mho22
Copy link
Contributor Author

mho22 commented May 2, 2025

@adamziel I was wrong about the size of icudt74l.dat. It is not 20Mb, but 30.8Mb. And I assume this is much more than expected. I think I need to make a summary of what this pull request is trying to do :



  • We can enable intl on each node and web php 7.2+ versions . Even without icu data file, PHP-WASM still runs.
  • PHP-WASM will return errors when trying to run related code and functions from the intl extension if there is no ICU_DATA environment variable added :
Uncaught IntlException: Constructor failed in /internal/eval.php:3

Users could add it manually in loadNodeRuntime or loadWebRuntime through emscriptenOptions OR I could add it in universal/src/lib/load-php-runtime.js on line 142:

ENV: {
	ICU_DATA : "/icu-data-path"
},


  • PHP-WASM will return errors when trying to run related code and functions from the intl extension if there is no data file available in the ICU_DATA directory mentionned in the environment variable above:
Uncaught IntlException: Constructor failed in /internal/eval.php:3

Users could add it manually after PHP-WASM is loaded this way :

const php = new PHP( await loadNodeRuntime( '8.3' ) );

php.mkdir( '/icu-data-path' );

php.writeFile( '/icu-data-path/icudt74l.dat', fs.readFileSync( 'node_modules/@php-wasm/node/shared/icudt74l.dat' ) );


OR I should do it in the code, around getPHPLoaderModule as you said, but honestly, I don't know exactly where. I need to access the PHP-WASM FS and also the data file with node:fs or fetch but to access the FS I need to have php-node or php-web ready. And I don't. So the only way I made it was with this :

node/src/lib/load-runtime.ts

export async function loadNodeRuntime(
	phpVersion: SupportedPHPVersion,
	options: PHPLoaderOptions = {}
) {
	const emscriptenOptions: EmscriptenOptions = {...};

	const id = await loadPHPRuntime(
		await getPHPLoaderModule(phpVersion),
		await withNetworking(emscriptenOptions)
	);

	const php = new PHP( id );

	php.mkdir( '/icu-data-path' );

	php.writeFile( '/icu-data-path/icudt74l.dat', new Uint8Array( readFileSync( `${__dirname}/shared/icudt74l.dat` ) ) );

	return id;
}

And yes, this is really bad.



But now this code works without having to indicate a ENV variable or loading a data file by myself :

import { PHP } from '@php-wasm/universal';
import { loadNodeRuntime } from '@php-wasm/node';


const code = `<?php

$formatter = new \NumberFormatter('en-US', \NumberFormatter::CURRENCY);

var_dump($formatter->format(100.00));

$formatter = new \NumberFormatter('fr-FR', \NumberFormatter::CURRENCY);

var_dump($formatter->format(100.00));`




const php = new PHP( await loadNodeRuntime( '8.3' ) );

const result = await php.run( { code : code } );

console.log( result.text );
> node --experimental-wasm-stack-switching scripts/node.js

string(7) "$100.00"
string(11) "100,00 €"


But honestly, this is not the right solution. What do you think about it ?



I'm confused. I thought it worked as well without --pre-js and --preload-file?

Apologies for the confusion. It works as well without, I just wanted to know what was the best way for you, and it seems to be the "after runtime loaded" way.

@mho22
Copy link
Contributor Author

mho22 commented May 2, 2025

@adamziel Regarding the directories, I suggest creating a shared directory in the following locations:

- dist/packages/php-wasm/web
- dist/packages/php-wasm/node
- packages/php-wasm/node
- packages/php-wasm/web/public

This setup will streamline the transfer of the icudt74l.dat file from the compile directory to the dist directory. The flow would look like this:



npm run recompile:php:node:jspi:8.3

packages/php-wasm/compile/libintl/icudt74l.dat    >>    packages/php-wasm/node/shared/icudt74l.dat

nx run php-wasm-node:build

packages/php-wasm/node/shared/icudt74l.dat    >>    dist/packages/php-wasm/node/shared/icudt74l.dat


npm run recompile:php:web:jspi:8.3

packages/php-wasm/compile/libintl/icudt74l.dat    >>    packages/php-wasm/web/public/shared/icudt74l.dat

nx run php-wasm-web:build

packages/php-wasm/web/public/shared/icudt74l.dat    >>    dist/packages/php-wasm/web/shared/icudt74l.dat


I think shared is a suitable name for a directory composed of files that are used by each php version. What do you think ?

@adamziel
Copy link
Collaborator

adamziel commented May 2, 2025

Shared directory sounds great! The rest I'll address on Monday, but the rule of thumb is this: we dont want the minimum download size by more than 2-3 MB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Inbox
Development

Successfully merging this pull request may close these issues.

3 participants