Skip to content

Conversation

@JonoYang
Copy link
Member

This PR adds a gibberish detector to textcode to avoid processing nonsense copyright strings detected from binaries.

@pombredanne pombredanne changed the title 2402 detect gibberish copyright Detect gibberish copyright #2402 Nov 20, 2025
@pombredanne pombredanne changed the title Detect gibberish copyright #2402 Detect gibberish copyright #2402 Nov 20, 2025
@JonoYang JonoYang requested a review from pombredanne November 20, 2025 21:11
@JonoYang
Copy link
Member Author

JonoYang commented Nov 20, 2025

@pombredanne removing the tests that I linked above, we only fail these data driven tests:

https://github.com/aboutcode-org/scancode-toolkit/blob/2402-detect-gibberish-copyright/tests/cluecode/data/copyrights/scilab-Scilab#L67

  • an instance of Scilab (c) INRIA-ENPC. was not detected
  • c) INRIA-ENPC. is identified as gibberish

https://github.com/aboutcode-org/scancode-toolkit/blob/2402-detect-gibberish-copyright/tests/cluecode/data/copyrights/misco4/linux-copyrights/Documentation/networking/arcnet-hardware.txt#L32

  • this did not detect Copyright Waterloo Microsystems Inc. 1985
  • @Copyright is identified as gibberish

https://github.com/aboutcode-org/scancode-toolkit/blob/2402-detect-gibberish-copyright/tests/cluecode/data/authors/trailing_date#L3C19-L3C59

  • Alexander Kanavin <[email protected]> was not detected
  • * : commit 3debe362faa62e5b381b880e3ba23aee07c85f6e Author: is detected as gibberish

@JonoYang JonoYang force-pushed the 2402-detect-gibberish-copyright branch from f9e70b3 to f3fd656 Compare December 19, 2025 22:49
@@ -0,0 +1,18 @@
about_resource: gibberish.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a fine provenance research!

The original from @rrenaud at https://github.com/rrenaud/Gibberish-Detector references a SO answer

This is a nice (IMO) answer to this guys question on stackoverflow. http://stackoverflow.com/questions/6297991/is-there-any-way-to-detect-strings-like-putjbtghguhjjjanika/6298040#comment-7360747

And the SO author is the same as the GH author: https://stackoverflow.com/users/286449/rob-neuhaus

So this settles the original license to be MIT as per @rrenaud choice.

Then we have this chain of forks and derivations to document:

It would be nice and the right thing to do to keep the credits to each author for this chain of forks and refinements and ... I guess we could either:

  • add that as extra doc in the ABOUT file
  • OR get a full git history for these files with some git fu and git filter repo?

(The license has stayed MIT all the way so this is about credits, not the license itself)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to help with this, if guided as to what to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants