MBS-11948: More strict favicon class assignment #2256

reosarevok · 2021-09-08T12:15:30Z

Fix MBS-11948

We have had several issues in the past of favicons being assigned incorrectly. The one triggering this issue was kenjimiura.com
being matched with the residentadvisor "ra.co" string.

There's no reason why we should stick to this naive approach to matching when we have regular expressions and URL objects, so this makes use of both to actually validate the full hostnames of URLs and, when needed, the start of their paths.

The regexes are mostly based on the existing matchup/cleanup we do. The few we don't validate/clean up at all I based on what the structure seems to be via a quick database check, although admittedly I might have missed some edge cases.

As an additional bonus, I added testing for the favicon classes, to make sure that at least all the URLs we were already testing
do match the expected favicon class (that should also serve as a reminder to change the regexes there if we change how a site is cleaned up).

AFAICT there's no reason not to just let URL() do the validation for us here - if it errors, then it's not a valid URL.

yvanzo

It looks very nice overall.

If @mwiencek confirms URL usage is appropriate here, it is probably not useful to add unit tests for the isValidURL function as its new implementation seems to be browser-dependent.
MBS-11948’s implementation looks nice, and I did check some favicon classes but not all of them since you added unit tests for these and any regression here would not be blocking editors. My only comments are about tests’ implementation.

yvanzo · 2021-09-15T11:50:29Z

root/url/utility/isUrlValid.js

+
+export default function isUrlValid(url: string): boolean {
+  try {
+    new URL(url);


AFAICT there's no reason not to just let URL() do the validation

The previous commits 69d185a and c5278c9 don’t explain why it has been implemented like that, but IIRC @mwiencek made a recent comment to another PR about URL usage; Maybe it is unrelated or outdated though?

Well, when I wrote that in 2014 the URL API was brand new (if supported at all outside of Chrome) and we didn't have core-js polyfills or anything back then.

One difference that probably exists, though, is that something like new URL('/haha') is valid, whereas the current function requires a fully-specified URL. This seems like something that should be looked into, if it matters.

That's not quite true - https://developer.mozilla.org/en-US/docs/Web/API/URL/URL "If url is a relative URL, base is required, and will be used as the base URL". Otherwise, it errors anyway.

yvanzo · 2021-09-15T12:02:52Z

root/static/scripts/tests/Control/URLCleanup.js

  },
  {
                     input_url: 'https://ra.co/tracks/544258',
             input_entity_type: 'recording',
    expected_relationship_type: 'otherdatabases',
            expected_clean_url: 'https://ra.co/tracks/544258',
       only_valid_entity_types: ['recording'],
+        expected_favicon_class: 'residentadvisor-favicon',
  },
  {
                     input_url: 'http://www.ra.co/reviews/7636',


Maybe test RA reviews’ favicon too?
Edit: Don’t make extraneous tests if setting the favicon doesn’t depend the path.

I don't think they have a different favicon for reviews? Am I missing something? :)

Right. I was initially looking for a favicon test of kenjimiura.com when I noticed you added a favicon test to all RA tests but that one; I just wondered why. Now looking at the code again (commit 86ef099 - root/static/scripts/common/constants.js - line 156), setting this favicon doesn’t depend on the path indeed. So maybe it’s not useful to have 7 favicon tests for RA URLs, having 1 might be enough?

Oh, now I see :) I added it to all of them but the one with no valid entity types (that's what I generally did elsewhere too) just in case we changed some specific cleanup and it no longer matched the favicon, but admittedly it might be overdoing it :) I can remove it from the rest if desired.

yvanzo · 2021-09-15T12:07:33Z

root/static/scripts/tests/Control/URLCleanup.js

  {
                     input_url: 'http://www.ra.co/exchange/552',
             input_entity_type: 'release',
    expected_relationship_type: 'shownotes',
            expected_clean_url: 'https://ra.co/exchange/552',
       only_valid_entity_types: ['release'],
+        expected_favicon_class: 'residentadvisor-favicon',
  },


I suggest testing the reported URL’s favicon for that bug:

Suggested change

{

input_url: 'http://www.ra.co/exchange/552',

input_entity_type: 'release',

expected_relationship_type: 'shownotes',

expected_clean_url: 'https://ra.co/exchange/552',

only_valid_entity_types: ['release'],

expected_favicon_class: 'residentadvisor-favicon',

},

{

input_url: 'http://www.ra.co/exchange/552',

input_entity_type: 'release',

expected_relationship_type: 'shownotes',

expected_clean_url: 'https://ra.co/exchange/552',

only_valid_entity_types: ['release'],

expected_favicon_class: 'residentadvisor-favicon',

},

{

input_url: 'https://www.kenjimiura.com/',

input_entity_type: 'artist',

expected_relationship_type: undefined,

expected_clean_url: 'https://www.kenjimiura.com/',

expected_favicon_class: 'no-favicon',

},

Sounds sensible, will add

yvanzo · 2021-09-15T12:14:03Z

root/static/scripts/tests/Control/URLCleanup.js

@@ -4848,6 +5405,11 @@ testData.forEach(function (subtest, i) {
      }
      tested = true;
    }
+    if (subtest.expected_favicon_class) {
+      const actualFaviconClass = getFaviconClass(actualCleanUrl);
+      st.equal(actualFaviconClass, subtest.expected_favicon_class, 'Correct favicon');


Should it test the input URL too? For existing unclean URLs.

I was expecting we wouldn't necessarily care if the favicon is not shown for existing URLs that need fixing anyway :) (that said, wonder if we could / should have some sort of either report or automatic script for fixing those).

yvanzo · 2021-09-15T12:18:37Z

root/static/scripts/tests/Control/URLCleanup.js

+    if (subtest.expected_favicon_class) {
+      const actualFaviconClass = getFaviconClass(actualCleanUrl);
+      st.equal(actualFaviconClass, subtest.expected_favicon_class, 'Correct favicon');
+      tested = true;


Does that imply that a test can just consist of input_url and expected_favicon_class ?
If it is tested against the clean URL, expected_clean_url should be mandatory.
Maybe the documentation about test properties in dcabded commit’s message should be added to this file?

Yes, IMO that should be ok, shouldn't it? It's tested against the return of cleanURL(), which is the original if there's no cleanup function, so effectively, it is tested against whatever will actually be added to the DB.

Oh, or is your idea (seeing the miura.com example above) that there should be an expected_clean_url even if the expectation is there's no cleanup and you get the same that came in?

Yes, IMO that should be ok, shouldn't it?

Not sure what “that” and “it” refer to.

It's tested against the return of cleanURL(), which is the original if there's no cleanup function

It is hiding the URL being actually tested for favicon class, from both the test file and the test output.

The clean URL may significantly differ from the input URL, that’s why I suggested to make expected_clean_url mandatory for use with expected_favicon_class.

so effectively, it is tested against whatever will actually be added to the DB.

So effectively, it doesn’t test against actual unclean URLs already in the DB.
Let’s use the above thread #2256 (comment) for follow-up on this point.

Oh, or is your idea (seeing the miura.com example above) that there should be an expected_clean_url even if the expectation is there's no cleanup and you get the same that came in?

Yes. (Did post the previous comment before seeing that one.)

Ok, we can always do that and always have expected_clean_url - I just thought that was a bit misleading because it suggested a cleanup was being run, but maybe not :)

We have had several issues in the past of favicons being assigned incorrectly. The one triggering this issue was kenjimiura.com being matched with the residentadvisor "ra.co" string. There's no reason why we should stick to this naive approach to matching when we have regular expressions and URL objects, so this makes use of both to actually validate the full hostnames of URLs and, when needed, the start of their paths. The regexes are mostly based on the existing matchup/cleanup we do. The few we don't validate/clean up at all I based on what the structure seems to be via a quick database check, although admittedly I might have missed some edge cases. As an additional bonus, I added testing for the favicon classes, to make sure that at least all the URLs we were already testing do match the expected favicon class (that should also serve as a reminder to change the regexes there if we change how a site is cleaned up).

reosarevok · 2021-09-15T12:53:47Z

A note: I did not add tests for the URLs not currently being tested at all in the URLCleanup tests (a bunch of them we don't clean up at all, we just autoselect, and I think there might be one or two stores we don't even autoselect to begin with). We could do that, but then we'd need to decide for sure how to do the testing for non-cleaned URLs :)

yvanzo · 2021-09-15T14:40:32Z

After further discussion on IRC, it seems that existing match properties in URLCleanupcould be used instead of new regular expressions in FAVICON_CLASSES, so deduplicating code and displaying favicon for non-clean URLs too (as it is for now). Favicon tests would then have to run against both input_url and clean URL (expected_clean_url if available, otherwise cleanURL()).

I did investigate for potential edge cases: Found unwelcomeimages in CLEANUPS; So it seems faviconify should be optional.

It would also require to import URLCleanup.js with non-edit pages. @mwiencek: Does that seem reasonable?

mwiencek · 2021-09-22T03:36:08Z

It would also require to import URLCleanup.js with non-edit pages. @mwiencek: Does that seem reasonable?

It's a very large file, so if we can't avoid it. :\ Can we put the match regular expressions into a separate file and import them into URLCleanup.js instead?

yvanzo · 2021-09-22T05:28:48Z

It would also require to import URLCleanup.js with non-edit pages. @mwiencek: Does that seem reasonable?

It's a very large file, so if we can't avoid it. :\ Can we put the match regular expressions into a separate file and import them into URLCleanup.js instead?

That would make URLCleanup.js more difficult to maintain because clean and validate are often closely related to the match regular expressions.

If URLCleanup.js builds const FAVICON_CLASSES (from CLEANUPS), exports that separately, and root/static/scripts/common/constants.js imports that only from URLCleanup.js, would it avoid loading the whole URLCleanup.js?
Or is there a way to automatically generate a faviconClasses.js file from URLCleanup.js at build time?

mwiencek · 2021-09-22T17:03:42Z

If URLCleanup.js builds const FAVICON_CLASSES (from CLEANUPS), exports that separately, and root/static/scripts/common/constants.js imports that only from URLCleanup.js, would it avoid loading the whole URLCleanup.js?

It'd still load the whole file in that case.

Or is there a way to automatically generate a faviconClasses.js file from URLCleanup.js at build time?

This wouldn't be too difficult, actually, since you can access .source to rebuild the expressions.

reosarevok · 2021-11-24T11:10:49Z

We should decide on an option here. IIRC we looked into building at runtime and it was actually a lot more complicated than it sounded, and then I looked at extracting the regexes from URLCleanup with a script and it was also more complicated than needed, so that leaves two reasonable options:

a) have one external file, and use that for match in URLCleanup.
b) have an external file that is just a hand-made copy of the URLCleanup regexes, with a comment to keep them in sync.

Or, if you have a simple idea to parse the file and automaticaly extract the regexes, I'm all ears.

Simplify isUrlValid to use new URL rather than a document

7bfac4a

AFAICT there's no reason not to just let URL() do the validation for us here - if it errors, then it's not a valid URL.

reosarevok added the QoL Non-urgent quality of life improvements label Sep 8, 2021

This was referenced Sep 8, 2021

MBS-11941: Normalize Worldcat identities URLs #2257

Merged

MBS-11957: Clean up twitch.com to twitch.tv #2260

Merged

MBS-11959: Allow RYM links for music video recordings #2261

Merged

reosarevok added this to the 2021-10-04 milestone Sep 13, 2021

reosarevok mentioned this pull request Sep 13, 2021

MBS-11952: Autoselect Vimeo On Demand URLs correctly #2262

Merged

yvanzo reviewed Sep 15, 2021

View reviewed changes

reosarevok force-pushed the MBS-11948 branch from 86ef099 to f240b36 Compare September 15, 2021 12:52

reosarevok mentioned this pull request Sep 17, 2021

MBS-11968: Allow VIAF, DNB, LoC links for series #2268

Merged

reosarevok modified the milestones: 2021-10-04, 2021-10-18 Sep 27, 2021

reosarevok marked this pull request as draft September 27, 2021 17:20

reosarevok removed this from the 2021-10-25 milestone Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MBS-11948: More strict favicon class assignment #2256

MBS-11948: More strict favicon class assignment #2256

reosarevok commented Sep 8, 2021

yvanzo left a comment

yvanzo Sep 15, 2021

mwiencek Sep 22, 2021

reosarevok Nov 24, 2021

yvanzo Sep 15, 2021 •

edited

Loading

reosarevok Sep 15, 2021

yvanzo Sep 15, 2021

reosarevok Sep 15, 2021

yvanzo Sep 15, 2021

reosarevok Sep 15, 2021

yvanzo Sep 15, 2021

reosarevok Sep 15, 2021

yvanzo Sep 15, 2021

reosarevok Sep 15, 2021 •

edited

Loading

reosarevok Sep 15, 2021

yvanzo Sep 15, 2021

yvanzo Sep 15, 2021

reosarevok Sep 15, 2021

reosarevok commented Sep 15, 2021

yvanzo commented Sep 15, 2021

mwiencek commented Sep 22, 2021

yvanzo commented Sep 22, 2021

mwiencek commented Sep 22, 2021

reosarevok commented Nov 24, 2021

MBS-11948: More strict favicon class assignment #2256

Are you sure you want to change the base?

MBS-11948: More strict favicon class assignment #2256

Conversation

reosarevok commented Sep 8, 2021

Fix MBS-11948

yvanzo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yvanzo Sep 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reosarevok Sep 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reosarevok commented Sep 15, 2021

yvanzo commented Sep 15, 2021

mwiencek commented Sep 22, 2021

yvanzo commented Sep 22, 2021

mwiencek commented Sep 22, 2021

reosarevok commented Nov 24, 2021

yvanzo Sep 15, 2021 •

edited

Loading

reosarevok Sep 15, 2021 •

edited

Loading