Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Scancode] dual license handling in the summarizer #280

Open
dabutvin opened this issue Nov 13, 2018 · 13 comments
Open

[Scancode] dual license handling in the summarizer #280

dabutvin opened this issue Nov 13, 2018 · 13 comments

Comments

@dabutvin
Copy link
Member

When we are gathering up the file.licenses.license or file.licenses.spdx_license_key data we default to AND them together and this is not always correct.

Given this rust crate

We list these discovered files as Apache-2.0 and MIT, but this is only because scancode found both and we AND them together.

The scancode output has other information we should probably consume for example:

             {
                "key": "apache-2.0",
                "score": 20,
                "short_name": "Apache 2.0",
                "category": "Permissive",
                "owner": "Apache Software Foundation",
                "homepage_url": "http://www.apache.org/licenses/",
                "text_url": "http://www.apache.org/licenses/LICENSE-2.0",
                "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:apache-2.0",
                "spdx_license_key": "Apache-2.0",
                "spdx_url": "https://spdx.org/licenses/Apache-2.0",
                "start_line": 5,
                "end_line": 5,
                "matched_rule": {
                  "identifier": "mit_or_apache-2.0_1.RULE",
                  "license_expression": "mit OR apache-2.0",
                  "licenses": [
                    "mit",
                    "apache-2.0"
                  ],
                  "matcher": "2-aho",
                  "rule_length": 4,
                  "matched_length": 4,
                  "match_coverage": 100,
                  "rule_relevance": 20
                }
              }

cc @pombredanne

@pombredanne
Copy link
Member

@dabutvin sorry for the late reply. ScanCode returns detected licenses as expressions.
You get these either:

  1. withe the license_expressions attribute a files object. This is a list of license expressions strings found in a given file. If you want to get the composite expression you could wrap them in parens and AND them together.

  2. Alternatively the licenses attribute lists each license together with the corresponding license matched_rules license_expression attribute which is a single string.

In both cases the expressions are made of ScanCode license keys (which are the keys in https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses)

If you want instead SPDX ids each license object that has one has an spdx_license_key attribute: so you could parse the expression and then replace the key for SPDX IDs (and possibly use SPDX LicenseRef-... for license keys that are not know from SPDX).

Alternatively, we could update ScanCode to return also license_expressions that would be made of SPDX keys.

FWIW, expressions are reasonably common in the JQuery world. Many Rust crates have such expressions. And there are several other cases: about ~20% of the 7K rules in ScanCode return expressions with more than one license

@pombredanne
Copy link
Member

so in any case AND'ing license objects is unlikely to be correct as you are missing the OR and WITH cases

@dabutvin
Copy link
Member Author

dabutvin commented Dec 7, 2018

Thanks @pombredanne
I'm looking into fixing this now, should I expect a significant difference in scancode 3.0?

dabutvin pushed a commit to dabutvin/service that referenced this issue Dec 8, 2018
@fossygirl
Copy link
Member

@pombredanne @dabutvin Just checking on this for our upgrade to Scancode 3.0?

@dabutvin

This comment has been minimized.

@dabutvin
Copy link
Member Author

this is still an issue with the summarizer

see https://clearlydefined.io/definitions/git/github/rust-lang/regex/18a71d0a30a6dcdcd86d1af6dd9cb0688b89f2ee

the readme is correctly detecting MIT OR Apache-2.0
but our summarizer is joining them with ANDs for the declared

@dabutvin dabutvin added the bug label Apr 26, 2019
@ignacionr ignacionr self-assigned this Jun 24, 2019
@storrisi storrisi added this to the October 2019 milestone Oct 1, 2019
@ignacionr
Copy link
Member

ignacionr commented Oct 23, 2019

Finally I got some good input for you guys, but it will require us to decide how to handle things.

The thing is, our summarizer for Scancode will look into the files array provided by the tool, and not in the (very smart) content.summary.license_expressions.
Going file-by-file, the system finds that there are both Apache and MIT identifiable licenses, and ANDs them together by default. As seen here.

On to the decision-making part.

  • Is ANDing together different licenses found on files, a good thing to do by default?
  • Even if it is so, do we want to use the pre-summarized information returned in the content.summary of Scancode? (count me for the Ays, since the tool looks like it's doing some very interesting magic to get to the proper connector); then again if we decide to use it -should we use the most cited license type (greater number of matching files)?

For reference, the package mentioned gives from Scancode the following summarized counts and values (the winner with 117 is the mix we are suggested to take into account):

[{"value":null,"count":159},{"value":"mit OR apache-2.0","count":117},{"value":"public-domain","count":16},{"value":"apache-2.0 OR mit","count":5},{"value":"apache-2.0","count":1},{"value":"mit","count":1},{"value":"mit-synopsys","count":1},{"value":"unknown","count":1}]

Still, how dangerous is it to put it to file count? What if we got a number of equal-file-count license types?

All comments appreciated.

@geneh
Copy link
Contributor

geneh commented Oct 25, 2019

The component's license is MIT OR Apache-2.0 and not MIT AND Apache-2.0 as initially reported for https://clearlydefined.io/definitions/crate/cratesio/-/regex/1.0.6
It is, however, MIT AND Apache-2.0 for https://clearlydefined.io/definitions/git/github/rust-lang/regex/18a71d0a30a6dcdcd86d1af6dd9cb0688b89f2ee.
scancode 3.2.2 was run for both of the components.
Any idea why the declared license is different?
As to is it dangerous or not, we would need to analyze tens or hundreds of sample packages to be sure. We probably do not want to invest time in that right now.

@ignacionr
Copy link
Member

@geneh I am looking at the example you give with cratesio, but initially the harvested info is a lot more complex and for the older version also includes earlier version runs of ScanCode, which may be an explanation (even though I will debug this to make sure this is the case).

@ignacionr
Copy link
Member

@geneh what would you say if we remove the original data, and recrawl? We would lose existing information from ScanCode, but I think that is exactly what is wrong. Just an idea.

@geneh
Copy link
Contributor

geneh commented Nov 27, 2019

@ignacionr Do you mean recrawl all the components? Why would the outcome be any different if the scanocde version is the same? Do you mean we should recompute definitions for the affected scancode results?

@tmarble
Copy link
Contributor

tmarble commented Dec 2, 2019

I just ran scancode on the most recent version of regex and saved the result as a gist. Please note that scancode has different results for...

  • Cargo.toml = "license_expressions": ["mit OR apache-2.0" ]
  • README.md = "license_expressions": ["mit OR apache-2.0", "unicode"]

Compare this to the code snippet above from scancode.js which calls _joinExpressions (that function always combines with AND) for all the Full license files. And thus we would expect that result to be "mit AND apache-2.0 AND unicode", but it is only "mit AND apache-2.0" because the Unicode is only referenced in the README (the full Unicode license is not in the top level directory, but rather in regex-syntax/src/unicode_tables/LICENSE-UNICODE).

Because the resulting package "regex" contains a compilation of Unicode data and Rust source code the likely correct SPDX expression for the ensemble is "(mit OR apache-2.0) AND unicode". NOTE: the parentheses are required because otherwise the AND takes precedence, per 4) Order of Precedence and Parentheses in the SPDX spec.

So here are the questions:

  1. Should we rely on the Scancode licenses for Cargo.toml or README.md ?
  2. Should we rely on the discovered full license text(s) in the top level directory?
  3. Which takes priority between 1) and 2) ?
  4. Should the answer apply to just Cargo (Rust) or other package types?

@fossygirl
Copy link
Member

@jeffmcaffer @pombredanne @iamwillbar Would love to hear your opinions on this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants