Skip to content

Conversation

@xworld21
Copy link
Contributor

Use URI::file for serialising and deserialising paths written to XML for @graphic, @candidates, and the processing instructions 'searchpaths' and 'graphicspath'. It also replaces previous manual uses of URI::file.

This prevents potential issues regarding:

  • OS-specific path separators: backslashes on Win32 will never leak into XML
  • absolute paths: URI::file will add file:// when the paths is absolute, including on Windows for paths starting with e.g. C: (or longpaths, but LaTeXML does not support those yet)
  • spaces, commas, Unicode, etc: URLs are automatically encoded and decoded
  • lists of URLs: multiple URLs are now space separated, which is safe, since spaces within URLs will be percent-encoded

Note that I have replaced the previous implementation of pathname_to_url as it didn't make sense to have two. I have adjusted CrossRef accordingly.

This supersedes #2367 and fixes #2355 (well, I haven't tried it on a Windows machine yet, but I am sure it does).

@xworld21 xworld21 marked this pull request as ready for review February 23, 2025 12:09
@dginev dginev requested review from brucemiller and dginev February 23, 2025 12:15
@xworld21 xworld21 force-pushed the pathname-url branch 4 times, most recently from 14f6bc7 to 0a600a7 Compare February 23, 2025 16:54
return unless defined $_[0];
my @urls = split(/,/, $_[0]);
my @nonempty = grep { $_ } @urls;
return map { pathname_from_url($_) } @nonempty; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies in advance for the naming nit-picking.

pathname_from_urls probably wants to indicate the many-to-many map. So an extra s -
pathnames_from_urls

Similarly for the _to_ variant.

Also, we probably want to decide if we want to drop the _file from _file_url, or the opposite - to use it everywhere. As indicated, a URL is ambiguous. Indeed a relative pathname on linux such as ./subdir/a/b/c/test.png is also a match for a relative URL.

So maybe all of your new methods should be fileurl/fileurls as suffixes... But maybe don't rush to take my feedback before we hear from @brucemiller .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want to decide if we want to drop the _file from _file_url, or the opposite - to use it everywhere

Additional datapoint: most of the times LaTeXML wants an URL path (e.g. resources, images, CrossRef). The _file_url variant is for DTD and setURI which require actual URLs and may do strange things with absolute URL paths (not sure how, but I wouldn't risk it). Third use is for candidates and search paths in this PR, where we should stick to paths for backwards compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably wants to indicate the many-to-many map. So an extra s -
pathnames_from_urls

Makes sense. I was just uncertain as to whether pathname_ was meant as a namespace as opposed to an actual word.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back at (and rebasing) this PR, a summary of why the PR looks like this:

  • the URL without file: must remain for backward compatibility with external tools that do not expect it
  • the file: scheme can be omitted because those 'urls' are used in contexts where file: is implied (e.g. graphics candidates)
  • pathname_to_url(s?) is the version without file: scheme, for backward compatibility with the previous pathname_to_url (which was never about producing complete well formed URLs!)
  • pathname_to_file_url exists only for those rare situations in which the file: scheme is actually required (DTD and document URI)

So the naming problem really comes from the original pathname_to_url not producing URLs in the first place. We could rename pathname_to_url to pathname_urlencode, which is more accurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Install Fails on Windows 10 via "cpan LaTeXML"

2 participants