[Scancode] dual license handling in the summarizer #280

dabutvin · 2018-11-13T22:19:14Z

When we are gathering up the file.licenses.license or file.licenses.spdx_license_key data we default to AND them together and this is not always correct.

Given this rust crate

We list these discovered files as Apache-2.0 and MIT, but this is only because scancode found both and we AND them together.

The scancode output has other information we should probably consume for example:

             {
                "key": "apache-2.0",
                "score": 20,
                "short_name": "Apache 2.0",
                "category": "Permissive",
                "owner": "Apache Software Foundation",
                "homepage_url": "http://www.apache.org/licenses/",
                "text_url": "http://www.apache.org/licenses/LICENSE-2.0",
                "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:apache-2.0",
                "spdx_license_key": "Apache-2.0",
                "spdx_url": "https://spdx.org/licenses/Apache-2.0",
                "start_line": 5,
                "end_line": 5,
                "matched_rule": {
                  "identifier": "mit_or_apache-2.0_1.RULE",
                  "license_expression": "mit OR apache-2.0",
                  "licenses": [
                    "mit",
                    "apache-2.0"
                  ],
                  "matcher": "2-aho",
                  "rule_length": 4,
                  "matched_length": 4,
                  "match_coverage": 100,
                  "rule_relevance": 20
                }
              }

cc @pombredanne

The text was updated successfully, but these errors were encountered:

pombredanne · 2018-11-26T16:22:20Z

@dabutvin sorry for the late reply. ScanCode returns detected licenses as expressions.
You get these either:

withe the license_expressions attribute a files object. This is a list of license expressions strings found in a given file. If you want to get the composite expression you could wrap them in parens and AND them together.
Alternatively the licenses attribute lists each license together with the corresponding license matched_rules license_expression attribute which is a single string.

In both cases the expressions are made of ScanCode license keys (which are the keys in https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses)

If you want instead SPDX ids each license object that has one has an spdx_license_key attribute: so you could parse the expression and then replace the key for SPDX IDs (and possibly use SPDX LicenseRef-... for license keys that are not know from SPDX).

Alternatively, we could update ScanCode to return also license_expressions that would be made of SPDX keys.

FWIW, expressions are reasonably common in the JQuery world. Many Rust crates have such expressions. And there are several other cases: about ~20% of the 7K rules in ScanCode return expressions with more than one license

pombredanne · 2018-11-26T16:23:55Z

so in any case AND'ing license objects is unlikely to be correct as you are missing the OR and WITH cases

dabutvin · 2018-12-07T22:41:54Z

Thanks @pombredanne
I'm looking into fixing this now, should I expect a significant difference in scancode 3.0?

fossygirl · 2019-02-21T23:34:35Z

@pombredanne @dabutvin Just checking on this for our upgrade to Scancode 3.0?

dabutvin · 2019-04-26T18:24:49Z

this is still an issue with the summarizer

see https://clearlydefined.io/definitions/git/github/rust-lang/regex/18a71d0a30a6dcdcd86d1af6dd9cb0688b89f2ee

the readme is correctly detecting MIT OR Apache-2.0
but our summarizer is joining them with ANDs for the declared

ignacionr · 2019-10-23T21:01:38Z

Finally I got some good input for you guys, but it will require us to decide how to handle things.

The thing is, our summarizer for Scancode will look into the files array provided by the tool, and not in the (very smart) content.summary.license_expressions.
Going file-by-file, the system finds that there are both Apache and MIT identifiable licenses, and ANDs them together by default. As seen here.

On to the decision-making part.

Is ANDing together different licenses found on files, a good thing to do by default?
Even if it is so, do we want to use the pre-summarized information returned in the content.summary of Scancode? (count me for the Ays, since the tool looks like it's doing some very interesting magic to get to the proper connector); then again if we decide to use it -should we use the most cited license type (greater number of matching files)?

For reference, the package mentioned gives from Scancode the following summarized counts and values (the winner with 117 is the mix we are suggested to take into account):

[{"value":null,"count":159},{"value":"mit OR apache-2.0","count":117},{"value":"public-domain","count":16},{"value":"apache-2.0 OR mit","count":5},{"value":"apache-2.0","count":1},{"value":"mit","count":1},{"value":"mit-synopsys","count":1},{"value":"unknown","count":1}]

Still, how dangerous is it to put it to file count? What if we got a number of equal-file-count license types?

All comments appreciated.

geneh · 2019-10-25T20:26:58Z

The component's license is MIT OR Apache-2.0 and not MIT AND Apache-2.0 as initially reported for https://clearlydefined.io/definitions/crate/cratesio/-/regex/1.0.6
It is, however, MIT AND Apache-2.0 for https://clearlydefined.io/definitions/git/github/rust-lang/regex/18a71d0a30a6dcdcd86d1af6dd9cb0688b89f2ee.
scancode 3.2.2 was run for both of the components.
Any idea why the declared license is different?
As to is it dangerous or not, we would need to analyze tens or hundreds of sample packages to be sure. We probably do not want to invest time in that right now.

ignacionr · 2019-10-31T10:16:37Z

@geneh I am looking at the example you give with cratesio, but initially the harvested info is a lot more complex and for the older version also includes earlier version runs of ScanCode, which may be an explanation (even though I will debug this to make sure this is the case).

ignacionr · 2019-11-27T09:58:28Z

@geneh what would you say if we remove the original data, and recrawl? We would lose existing information from ScanCode, but I think that is exactly what is wrong. Just an idea.

geneh · 2019-11-27T21:20:15Z

@ignacionr Do you mean recrawl all the components? Why would the outcome be any different if the scanocde version is the same? Do you mean we should recompute definitions for the affected scancode results?

tmarble · 2019-12-02T21:06:41Z

I just ran scancode on the most recent version of regex and saved the result as a gist. Please note that scancode has different results for...

Cargo.toml = "license_expressions": ["mit OR apache-2.0" ]
README.md = "license_expressions": ["mit OR apache-2.0", "unicode"]

Compare this to the code snippet above from scancode.js which calls _joinExpressions (that function always combines with AND) for all the Full license files. And thus we would expect that result to be "mit AND apache-2.0 AND unicode", but it is only "mit AND apache-2.0" because the Unicode is only referenced in the README (the full Unicode license is not in the top level directory, but rather in regex-syntax/src/unicode_tables/LICENSE-UNICODE).

Because the resulting package "regex" contains a compilation of Unicode data and Rust source code the likely correct SPDX expression for the ensemble is "(mit OR apache-2.0) AND unicode". NOTE: the parentheses are required because otherwise the AND takes precedence, per 4) Order of Precedence and Parentheses in the SPDX spec.

So here are the questions:

Should we rely on the Scancode licenses for Cargo.toml or README.md ?
Should we rely on the discovered full license text(s) in the top level directory?
Which takes priority between 1) and 2) ?
Should the answer apply to just Cargo (Rust) or other package types?

fossygirl · 2019-12-02T21:36:13Z

@jeffmcaffer @pombredanne @iamwillbar Would love to hear your opinions on this one.

dabutvin pushed a commit to dabutvin/service that referenced this issue Dec 8, 2018

use the license_expression in scancode summarizer clearlydefined#280

28cb126

This comment has been minimized.

Sign in to view

dabutvin added the bug label Apr 26, 2019

ignacionr self-assigned this Jun 24, 2019

storrisi added this to the October 2019 milestone Oct 1, 2019

jeffmendoza modified the milestones: October 2019, November 2019 Nov 4, 2019

nellshamrell added the question label Jan 20, 2021

nellshamrell removed the bug label Jan 20, 2021

capfei added the Declared License Issue label Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Scancode] dual license handling in the summarizer #280

[Scancode] dual license handling in the summarizer #280

dabutvin commented Nov 13, 2018

pombredanne commented Nov 26, 2018

pombredanne commented Nov 26, 2018

dabutvin commented Dec 7, 2018

fossygirl commented Feb 21, 2019

This comment has been minimized.

dabutvin commented Apr 26, 2019

ignacionr commented Oct 23, 2019 •

edited

Loading

geneh commented Oct 25, 2019

ignacionr commented Oct 31, 2019

ignacionr commented Nov 27, 2019

geneh commented Nov 27, 2019

tmarble commented Dec 2, 2019

fossygirl commented Dec 2, 2019

[Scancode] dual license handling in the summarizer #280

[Scancode] dual license handling in the summarizer #280

Comments

dabutvin commented Nov 13, 2018

pombredanne commented Nov 26, 2018

pombredanne commented Nov 26, 2018

dabutvin commented Dec 7, 2018

fossygirl commented Feb 21, 2019

This comment has been minimized.

dabutvin commented Apr 26, 2019

ignacionr commented Oct 23, 2019 • edited Loading

geneh commented Oct 25, 2019

ignacionr commented Oct 31, 2019

ignacionr commented Nov 27, 2019

geneh commented Nov 27, 2019

tmarble commented Dec 2, 2019

fossygirl commented Dec 2, 2019

ignacionr commented Oct 23, 2019 •

edited

Loading