Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MBS-11948: More strict favicon class assignment #2256

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

reosarevok
Copy link
Member

Fix MBS-11948

We have had several issues in the past of favicons being assigned incorrectly. The one triggering this issue was kenjimiura.com
being matched with the residentadvisor "ra.co" string.

There's no reason why we should stick to this naive approach to matching when we have regular expressions and URL objects, so this makes use of both to actually validate the full hostnames of URLs and, when needed, the start of their paths.

The regexes are mostly based on the existing matchup/cleanup we do. The few we don't validate/clean up at all I based on what the structure seems to be via a quick database check, although admittedly I might have missed some edge cases.

As an additional bonus, I added testing for the favicon classes, to make sure that at least all the URLs we were already testing
do match the expected favicon class (that should also serve as a reminder to change the regexes there if we change how a site is cleaned up).

AFAICT there's no reason not to just let URL() do the validation
for us here - if it errors, then it's not a valid URL.
Copy link
Contributor

@yvanzo yvanzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks very nice overall.

  • If @mwiencek confirms URL usage is appropriate here, it is probably not useful to add unit tests for the isValidURL function as its new implementation seems to be browser-dependent.
  • MBS-11948’s implementation looks nice, and I did check some favicon classes but not all of them since you added unit tests for these and any regression here would not be blocking editors. My only comments are about tests’ implementation.


export default function isUrlValid(url: string): boolean {
try {
new URL(url);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT there's no reason not to just let URL() do the validation

The previous commits 69d185a and c5278c9 don’t explain why it has been implemented like that, but IIRC @mwiencek made a recent comment to another PR about URL usage; Maybe it is unrelated or outdated though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, when I wrote that in 2014 the URL API was brand new (if supported at all outside of Chrome) and we didn't have core-js polyfills or anything back then.

One difference that probably exists, though, is that something like new URL('/haha') is valid, whereas the current function requires a fully-specified URL. This seems like something that should be looked into, if it matters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not quite true - https://developer.mozilla.org/en-US/docs/Web/API/URL/URL "If url is a relative URL, base is required, and will be used as the base URL". Otherwise, it errors anyway.

},
{
input_url: 'https://ra.co/tracks/544258',
input_entity_type: 'recording',
expected_relationship_type: 'otherdatabases',
expected_clean_url: 'https://ra.co/tracks/544258',
only_valid_entity_types: ['recording'],
expected_favicon_class: 'residentadvisor-favicon',
},
{
input_url: 'http://www.ra.co/reviews/7636',
Copy link
Contributor

@yvanzo yvanzo Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe test RA reviews’ favicon too?
Edit: Don’t make extraneous tests if setting the favicon doesn’t depend the path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think they have a different favicon for reviews? Am I missing something? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I was initially looking for a favicon test of kenjimiura.com when I noticed you added a favicon test to all RA tests but that one; I just wondered why. Now looking at the code again (commit 86ef099 - root/static/scripts/common/constants.js - line 156), setting this favicon doesn’t depend on the path indeed. So maybe it’s not useful to have 7 favicon tests for RA URLs, having 1 might be enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, now I see :) I added it to all of them but the one with no valid entity types (that's what I generally did elsewhere too) just in case we changed some specific cleanup and it no longer matched the favicon, but admittedly it might be overdoing it :) I can remove it from the rest if desired.

Comment on lines 3837 to 3844
{
input_url: 'http://www.ra.co/exchange/552',
input_entity_type: 'release',
expected_relationship_type: 'shownotes',
expected_clean_url: 'https://ra.co/exchange/552',
only_valid_entity_types: ['release'],
expected_favicon_class: 'residentadvisor-favicon',
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest testing the reported URL’s favicon for that bug:

Suggested change
{
input_url: 'http://www.ra.co/exchange/552',
input_entity_type: 'release',
expected_relationship_type: 'shownotes',
expected_clean_url: 'https://ra.co/exchange/552',
only_valid_entity_types: ['release'],
expected_favicon_class: 'residentadvisor-favicon',
},
{
input_url: 'http://www.ra.co/exchange/552',
input_entity_type: 'release',
expected_relationship_type: 'shownotes',
expected_clean_url: 'https://ra.co/exchange/552',
only_valid_entity_types: ['release'],
expected_favicon_class: 'residentadvisor-favicon',
},
{
input_url: 'https://www.kenjimiura.com/',
input_entity_type: 'artist',
expected_relationship_type: undefined,
expected_clean_url: 'https://www.kenjimiura.com/',
expected_favicon_class: 'no-favicon',
},

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds sensible, will add

@@ -4848,6 +5405,11 @@ testData.forEach(function (subtest, i) {
}
tested = true;
}
if (subtest.expected_favicon_class) {
const actualFaviconClass = getFaviconClass(actualCleanUrl);
st.equal(actualFaviconClass, subtest.expected_favicon_class, 'Correct favicon');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it test the input URL too? For existing unclean URLs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting we wouldn't necessarily care if the favicon is not shown for existing URLs that need fixing anyway :) (that said, wonder if we could / should have some sort of either report or automatic script for fixing those).

if (subtest.expected_favicon_class) {
const actualFaviconClass = getFaviconClass(actualCleanUrl);
st.equal(actualFaviconClass, subtest.expected_favicon_class, 'Correct favicon');
tested = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that imply that a test can just consist of input_url and expected_favicon_class ?
If it is tested against the clean URL, expected_clean_url should be mandatory.
Maybe the documentation about test properties in dcabded commit’s message should be added to this file?

Copy link
Member Author

@reosarevok reosarevok Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, IMO that should be ok, shouldn't it? It's tested against the return of cleanURL(), which is the original if there's no cleanup function, so effectively, it is tested against whatever will actually be added to the DB.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, or is your idea (seeing the miura.com example above) that there should be an expected_clean_url even if the expectation is there's no cleanup and you get the same that came in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, IMO that should be ok, shouldn't it?

Not sure what “that” and “it” refer to.

It's tested against the return of cleanURL(), which is the original if there's no cleanup function

It is hiding the URL being actually tested for favicon class, from both the test file and the test output.

The clean URL may significantly differ from the input URL, that’s why I suggested to make expected_clean_url mandatory for use with expected_favicon_class.

so effectively, it is tested against whatever will actually be added to the DB.

So effectively, it doesn’t test against actual unclean URLs already in the DB.
Let’s use the above thread #2256 (comment) for follow-up on this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, or is your idea (seeing the miura.com example above) that there should be an expected_clean_url even if the expectation is there's no cleanup and you get the same that came in?

Yes. (Did post the previous comment before seeing that one.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we can always do that and always have expected_clean_url - I just thought that was a bit misleading because it suggested a cleanup was being run, but maybe not :)

We have had several issues in the past of favicons being assigned
incorrectly. The one triggering this issue was kenjimiura.com
being matched with the residentadvisor "ra.co" string.
There's no reason why we should stick to this naive approach to matching
when we have regular expressions and URL objects, so this
makes use of both to actually validate the full hostnames of URLs
and, when needed, the start of their paths.
The regexes are mostly based on the existing matchup/cleanup
we do. The few we don't validate/clean up at all I based
on what the structure seems to be via a quick database check,
although admittedly I might have missed some edge cases.

As an additional bonus, I added testing for the favicon classes,
to make sure that at least all the URLs we were already testing
do match the expected favicon class (that should also serve
as a reminder to change the regexes there if we change how a site
is cleaned up).
@reosarevok
Copy link
Member Author

A note: I did not add tests for the URLs not currently being tested at all in the URLCleanup tests (a bunch of them we don't clean up at all, we just autoselect, and I think there might be one or two stores we don't even autoselect to begin with). We could do that, but then we'd need to decide for sure how to do the testing for non-cleaned URLs :)

@yvanzo
Copy link
Contributor

yvanzo commented Sep 15, 2021

After further discussion on IRC, it seems that existing match properties in URLCleanupcould be used instead of new regular expressions in FAVICON_CLASSES, so deduplicating code and displaying favicon for non-clean URLs too (as it is for now). Favicon tests would then have to run against both input_url and clean URL (expected_clean_url if available, otherwise cleanURL()).

I did investigate for potential edge cases: Found unwelcomeimages in CLEANUPS; So it seems faviconify should be optional.

It would also require to import URLCleanup.js with non-edit pages. @mwiencek: Does that seem reasonable?

@mwiencek
Copy link
Member

It would also require to import URLCleanup.js with non-edit pages. @mwiencek: Does that seem reasonable?

It's a very large file, so if we can't avoid it. :\ Can we put the match regular expressions into a separate file and import them into URLCleanup.js instead?

@yvanzo
Copy link
Contributor

yvanzo commented Sep 22, 2021

It would also require to import URLCleanup.js with non-edit pages. @mwiencek: Does that seem reasonable?

It's a very large file, so if we can't avoid it. :\ Can we put the match regular expressions into a separate file and import them into URLCleanup.js instead?

That would make URLCleanup.js more difficult to maintain because clean and validate are often closely related to the match regular expressions.

  • If URLCleanup.js builds const FAVICON_CLASSES (from CLEANUPS), exports that separately, and root/static/scripts/common/constants.js imports that only from URLCleanup.js, would it avoid loading the whole URLCleanup.js?
  • Or is there a way to automatically generate a faviconClasses.js file from URLCleanup.js at build time?

@mwiencek
Copy link
Member

If URLCleanup.js builds const FAVICON_CLASSES (from CLEANUPS), exports that separately, and root/static/scripts/common/constants.js imports that only from URLCleanup.js, would it avoid loading the whole URLCleanup.js?

It'd still load the whole file in that case.

Or is there a way to automatically generate a faviconClasses.js file from URLCleanup.js at build time?

This wouldn't be too difficult, actually, since you can access .source to rebuild the expressions.

@reosarevok reosarevok modified the milestones: 2021-10-04, 2021-10-18 Sep 27, 2021
@reosarevok reosarevok marked this pull request as draft September 27, 2021 17:20
@reosarevok reosarevok removed this from the 2021-10-25 milestone Oct 19, 2021
@reosarevok
Copy link
Member Author

We should decide on an option here. IIRC we looked into building at runtime and it was actually a lot more complicated than it sounded, and then I looked at extracting the regexes from URLCleanup with a script and it was also more complicated than needed, so that leaves two reasonable options:

a) have one external file, and use that for match in URLCleanup.
b) have an external file that is just a hand-made copy of the URLCleanup regexes, with a comment to keep them in sync.

Or, if you have a simple idea to parse the file and automaticaly extract the regexes, I'm all ears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
QoL Non-urgent quality of life improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants