v31.0.1 #3055

pombredanne · 2022-08-18T06:36:09Z

pombredanne
Aug 18, 2022
Maintainer

This is a major release with important bug and security fixes, new and improved
features and API changes.

Note that we no longer support Python 3.6. Use Python 3.7+ instead.

Important API changes:

The data structure of the JSON output has changed for copyrights, authors
and holders. We now use a proper name for attributes and not a generic "value".
The data structure of the JSON output has changed for licenses. We now
return match details once for each matched license expression rather than
once for each license in a matched expression. There is a new top-level
"license_references" attribute that contains the data details for each
detected license only once. This data can contain the reference license text
as an option.
The data structure of the JSON output has changed for packages. We now
return "package_data" package information at the manifest file-level
rather than "packages". This has all the data attributes of a "package_data"
field plus others: "package_uuid", "package_data_files" and "files".
- There is a a new top-level "packages" attribute that contains package
  instances that can be aggregating data from multiple manifests.
- There is a a new top-level "dependencies" attribute that contains each
  dependency instance, these can be standalone or releated to a package.
  These contain a new "extra_data" object.
- There is a new resource-level attribute "for_packages" which refers to
  packages through package_uuids (pURL + uuid string).
The data structure for HTML output has been changed to include emails and
urls under the "infos" object. The HTML template displays output for holders,
authors, emails, and urls into separate tables like "licenses" and "copyrights".
The data structure for CSV output has been changed to rename the Resource
column to "path". "copyright_holder" has been renamed to "holder"
The license clarity scoring plugin has been overhauled to show new license
clarity criteria. More details of the new scoring criteria are provided below.
The functionality of the summary plugin has been imprived to provide declared
origin and license information for the codebase being scanned. The previous
summary plugin functionality has been preserved in the new tallies plugin.
More details are provided below.
ScanCode has adopted the new code skeleton from https://github.com/nexB/skeleton
The key change is the location of the virtual environment. It used to be
created at the root of the scancode-toolkit directory. It is now created
under the venv subdirectory. You mus be aware of this if you use ScanCode
from a git clone
DatafileHandler.assemble(), DatafileHandler.assemble_from_many(), and
the other .assemble()``` methods from the other Package handlers from packagedcode, have been updated to yield Package items before Dependency or Resource items. This is particulary important in the case where we are calling the assemble()`` method outside of the scancode-toolkit context, where we
need to ensure that a Package exists before we assocate a Resource or
Dependency to it.

Copyright detection:

The data structure in the JSON is now using consistently named attributes as
opposed to plain values.
Several copyright detection bugs have been fixed.
French and German copyright detection is improved.
Some spurious trailing dots in holders are not stripped.

License detection:

There have been significant license detection rules and licenses updates:
- 107 new licenses have been added (total is now 1954)
- 6780 new license detection rules have been added (total is now 32259)
- 6753 existing false positive license rules have been removed (see below).
- The SPDX license list has been updated to the latest v3.17
The rule attribute "only_known_words" has been renamed to "is_continuous" and its
meaning has been updated and expanded. A rule tagged as "is_continuous" can only
be matched if there are no gaps between matched words, be they stopwords, extra
unknown or known words. This improves several false positive license detections.
The processing for "is_continous" has been merged in "key phrases" processing
below.
Key phrases can now be defined in a RULE text by surrounding one or more words
with double curly braces {{ and }}. When defined a RULE will only match
when the key phrases match exactly. When all the text of rule is a "key phrase",
this is the same as being "is_continuous".
The "--unknown-licenses" option now also detects unknown licenses using a
simple and effective ngrams-based matching in area that are not matched or
weakly matched. This helps detects things that look like a license but are not
yet known as licenses.
False positive detection of "license lists" like the lists seen in license and
package management tools has been entirely reworked. Rather than using
thousands of small false positive rules, there is a new filter to detect a
long run of license references and tags that is typical of license lists.
As a results, thousands of rules have been replaced by a simpler filter, and
the license detection is more accurate, faster and has fewer false
positives.
The new license flag "is_generic" tags licenses that are "generic" licenses
such as "other-permissive" or "other-copyleft". This is not yet
returned in the JSON API.
When scanning binary files, the detection of single word rules is filtered when
surrounded by gibberish or mixed case. For instance $#%$GpL$ is a false
positive and is no longer reported.
Several rules we tagged as is_license_notice incorrectly but were references
and have been requalified as is_license_reference. All rules made of a single
ord have been requalified as is_license_reference if they were not qualified
this way.
Matches to small license rules (with small defined as under 15 words)
that are scattered over too many lines are now filtered as false matches.
Small, two-words matches that overlap the previous or next match by
by the word "license" and assimilated are now filtered as false matches.
The new --licenses-reference option adds a new "licenses_reference" top
level attribute to a scan when using the JSON and YAML outputs. This contains
all the details and the full text of every license seen in a file or
package license expression of a scan. This can be added added after the fact
using the --from-json option.
New experimental support for non-English licenses. Use the command
./scancode --reindex-licenses-for-all-languages to index all known non-English
licenses and rules. From that point on, they will be detected. Because of this
some licenses that were not tagged with their languages are now correctly
tagged and they may not be detected unless you activate this new indexing
feature.

Package detection:

Major changes in package detection and reporting, codebase-level attribute packages
with one or more package_data and files for the packages are reported.
The specific changes made are:
- The resource level attribute packages has been renamed to package_data,
  as these are really package data that are being detected, such as manifests,
  lockfiles or other package data. This has the data attributes of a package_data
  field plus others: package_uuid, package_data_files and files.
- A new top-level attribute packages has been added which contains package
  instances created from package_data detected in the codebase.
- A new codebase level attribute dependencies has been added which contains dependency
  instances created from lockfiles detected in the codebase.
- The package attribute root_path has been deleted from package_data in favour
  of the new format where there is no root conceptually, just a list of files for each
  package.
- There is a new resource-level attribute for_packages which refers to
  packages through package_uids (pURL + uuid string). A package_adder
  function is now used to associate a Package to a Resource that is part of
  it. This gives us the flexibility to use the packagedcode Package handlers
  in other contexts where for_packages on Resource is not implemented in the
  same way as scancode-toolkit.
- The package_data attribute dependencies (which is a list of DependentPackages),
  now has a new attribute resolved_package with a package data mapping.
  Also the requirement attribute is renamed to extracted_requirement.
  There is a new extra_data to collect extra data as needed.
For Pypi packages, python_requires is treated as a package dependency.

License Clarity Scoring Update:

We are moving away from the original license clarity scoring designed for
ClearlyDefined in the license clarity score plugin. The previous license
clarity scoring logic produced a score that was misleading when it would
return a low score due to the stringent scoring criteria. We are now using
more general criteria to get a sense of what provenance information has been
provided and whether or not there is a conflict in licensing between what
licenses were declared at the top-level key files and what licenses have been
detected in the files under the top-level.
The license clarity score is a value from 0-100 calculated by combining the
weighted values determined for each of the scoring elements:
- Declared license:
  - When true, indicates that the software package licensing is documented at
    top-level or well-known locations in the software project, typically in a
    package manifest, NOTICE, LICENSE, COPYING or README file.
  - Scoring Weight = 40
- Identification precision:
  - Indicates how well the license statement(s) of the software identify known
    licenses that can be designated by precise keys (identifiers) as provided in
    a publicly available license list, such as the ScanCode LicenseDB, the SPDX
    license list, the OSI license list, or a URL pointing to a specific license
    text in a project or organization website.
  - Scoring Weight = 40
- License texts:
  - License texts are provided to support the declared license expression in
    files such as a package manifest, NOTICE, LICENSE, COPYING or README.
  - Scoring Weight = 10
- Declared copyright:
  - When true, indicates that the software package copyright is documented at
    top-level or well-known locations in the software project, typically in a
    package manifest, NOTICE, LICENSE, COPYING or README file.
  - Scoring Weight = 10
- Ambiguous compound licensing:
  - When true, indicates that the software has a license declaration that
    makes it difficult to construct a reliable license expression, such as in
    the case of multiple licenses where the conjunctive versus disjunctive
    relationship is not well defined.
  - Scoring Weight = -10
- Conflicting license categories:
  - When true, indicates that the declared license expression of the software
    is in the permissive category, but that other potentially conflicting
    categories, such as copyleft and proprietary, have been detected in lower
    level code.
  - Scoring Weight = -20

Summary Plugin Update:

The summary plugin's behavior has been changed. Previously, it provided a
count of the detected license expressions, copyrights, holders, authors, and
programming languages from a scan.

We have preserved this functionality by creating a new plugin called tallies.
All functionality of the previous summary plugin have been preserved in the
tallies plugin.
The new summary plugin now attempts to determine a declared license expression,
declared holder, and the primary programming language from a scan. And the
updated license clarity score provides context on the quality of the license
information provided in the codebase key files.
The new summary plugin also returns lists of tallies for the other "secondary"
detected license expressions, copyright holders, and programming languages.

All summary information is provided at the codebase-level attribute named summary.

Outputs:

Added new outputs for the CycloneDx format.
The CLI now exposes options to produce CycloneDx BOMs in either JSON or XML format
A new field warnings has been added to the headers of ScanCode toolkit output
that contains any warning messages that occur during a scan.
The CSV output format --csv option is now deprecated. It will be replaced by
new CSV and tabular output formats in the next ScanCode release.
Visit RFC: Improve tabular output formats #3043 to provide inputs
and feedback.

Output version

Scancode Data Output Version is now 2.0.0.

Changes:

Rename resource level attribute packages to package_data.
Add top-level attribute packages.
Add top-level attribute dependencies.
Add resource-level attribute for_packages.
Remove package-data attribute root_path.
The fields of the license clarity scoring plugin have been replaced with the
following fields. An overview of the new fields can be found in the "License
Clarity Scoring Update" section above.
- score
- declared_license
- identification_precision
- has_license_text
- declared_copyrights
- conflicting_license_categories
- ambigious_compound_licensing
The fields of the summary plugin have been replaced with the following fields.
An overview of the new fields can be found in the "Summary Plugin Update"
section above.
- declared_license_expression
- license_clarity_score
- declared_holder
- primary_language
- other_license_expressions
- other_holders
- other_languages

Documentation Update

Various documentation files have been updated to reflects API changes and
correct minor documentation issues.

Development environment and Code API changes:

The main package API function get_package_infos is deprecated, and
replaced by get_package_data.
The Resources path are always the same regardless of the strip-root or
full-root arguments.
The license cache consistency is not checked anymore when you are using a git
checkout. The SCANCODE_DEV_MODE tag file has been removed entirely. Use
instead the --reindex-licenses option to rebuild the license index.
We can now regenerate test fixtures using the new SCANCODE_REGEN_TEST_FIXTURES
environment variable. There is no need to replace the regen=False with
regen=True in the code.

Miscellaneous

Added support for usage of shortcut flags
- -A or --about
- -q or --quiet
- -v or --verbose
- -V or --version can be used.

What's Changed

Report packages at top level with file level package_manifests by @AyanSinhaMahapatra in Report packages at top level with file level package_manifests #2710
Updated install.rst by @beastrun12j in Updated install.rst #2722
Omnibus fall license improvements by @pombredanne in Omnibus fall license improvements #2706
Improve license detection by @pombredanne in Improve license detection #2737
api.get_licenses: clarify and improve docstring for "min_score" argument by @zacchiro in api.get_licenses: clarify and improve docstring for "min_score" argument #2763
rules with "unqualified" license names are references, not notices by @petergardfjall in rules with "unqualified" license names are references, not notices #2759
Fix invalid license yaml files by resolving duplicated keys by @fangxlmr in Fix invalid license yaml files by resolving duplicated keys #2776
Fix azure pipeline vmimage deprecations by @AyanSinhaMahapatra in Fix azure pipeline vmimage deprecations #2775
Allow license rules to require the presence of certain defining keywords by @mrombout in Allow license rules to require the presence of certain defining keywords #2773
Add first draft ROADMAP by @pombredanne in Add first draft ROADMAP #2736
Add CycloneDx output option by @agschrei in Add CycloneDx output option #2698
Remove regular expression futurewarning by @soimkim in Remove regular expression futurewarning #2788
fix docstring in debian_copyright.py by @adii21-Ux in fix docstring in debian_copyright.py #2786
fixes missing whitespace in prerequisites list by @altsalt in fixes missing whitespace in prerequisites list #2778
Add PackageManifest Class by @AyanSinhaMahapatra in Add PackageManifest Class #2748
Add new licenses and new detection rules by @pombredanne in Add new licenses and new detection rules #2765
Rename first column of csv output to "path" by @JRavi2 in Rename first column of csv output to "path" #2016
Detect unknown licenses RFC: Revamp "unknown" license detection #1675 by @akugarg in Detect unknown licenses #1675 #2592
Improve copyright handling Copyright and URLs data structure: Data structure inconsistency in results #2350 by @pombredanne in Improve copyright handling #2350 #2791
Fixing OSI identifier for BSD-3-Clause; see also SPDX license metadata by @karsten-klein in Fixing OSI identifier for BSD-3-Clause; see also SPDX license metadata #2797
Fix GPL license detection false positive GPL-1.0 false alarm improvement #2793 by @KevinJi22 in Fix GPL license detection false positive #2793 #2799
2789 inconsistent doc html app by @kunalchhabra37 in 2789 inconsistent doc html app #2795
Fixed inconsistency in --html-app FILE in cli-reference by @maynaS in Fixed inconsistency in --html-app FILE in cli-reference #2790
Replace freenode references with libera chat by @purna135 in Replace freenode references with libera chat #2816
Adopt nexB/skeleton and bump dependencies by @pombredanne in Adopt nexB/skeleton and bump dependencies #2818
Fix bug recognizing license as license_notice instead of license_text by @adii21-Ux in Fix bug recognizing license as license_notice instead of license_text #2817
Fix incorrect license detection Incorrect license detection #2777 by @KevinJi22 in Fix incorrect license detection #2777 #2811
Remove skeleton from docs by @AyanSinhaMahapatra in Remove skeleton from docs #2830
Detect SPDX-FileContributor tags as authors by @pombredanne in Detect SPDX-FileContributor tags as authors #2838
New license and copyright rule by @adii21-Ux in New license and copyright rule #2837
Add key phrase tags to GPL detection rule by @pombredanne in Add key phrase tags to GPL detection rule #2821
Make --version output valid YAML for parsing Ensure the --version output is easy to parse #2856 by @KevinJi22 in Make --version output valid YAML for parsing #2856 #2858
Add Direct Note for Windows Users (New Comers) by @OsmiumOP in Add Direct Note for Windows Users (New Comers) #2857
Fixed Typo in Documentation by @OsmiumOP in Fixed Typo in Documentation #2862
Remove version check locally by @adii21-Ux in Remove version check locally #2860
License improvement winter 2022 by @pombredanne in License improvement winter 2022 #2828
Update link to documentation by @AyanSinhaMahapatra in Update link to documentation #2867
Improve license detection by @pombredanne in Improve license detection #2871
Detect dependencies from build.gradle files by @pombredanne in Detect dependencies from build.gradle files #2822
Fix small typo inside notes snippet by @Harshil-Jani in Fix small typo inside notes snippet #2829
Add Package Instances Package model updates: Add new package_instance #2691 by @AyanSinhaMahapatra in Add Package Instances #2691 #2825
Improve license clarity scoring by @pombredanne in Improve license clarity scoring #2875
Do not raise exception on package data mismatch Do not raise exception on name/type mismatch for package instance creation #2886 by @AyanSinhaMahapatra in Do not raise exception on package data mismatch #2886 #2887
Release 31 by @pombredanne in Release 31 #2888
Add primary license in summary by @JonoYang in Add primary license in summary #2884
Remove usage of get_terminal_size in click by @AyanSinhaMahapatra in Remove usage of get_terminal_size in click #2916
Fix doc builds by @AyanSinhaMahapatra in Fix doc builds #2896
Update summary plugin by @JonoYang in Update summary plugin #2914
Shorten long file names by @pombredanne in Shorten long file names #2918
Added new copyright test cases by @abhishak3 in Added new copyright test cases #2891
Add system packages support in the new packages model by @AyanSinhaMahapatra in Add system packages support in the new packages model #2909
Fix typo in summary: ambigous->ambiguous by @pombredanne in Fix typo in summary: ambigous->ambiguous #2922
Add system environment to scan headers by @pombredanne in Add system environment to scan headers #2923
Update METADATA.bzl parser by @JonoYang in Update METADATA.bzl parser #2924
Spring 2022 license updates by @pombredanne in Spring 2022 license updates #2921
Process single package data file correctly by @pombredanne in Process single package data file correctly #2933
Fix package/dependency creation bugs by @AyanSinhaMahapatra in Fix package/dependency creation bugs #2932
Populate for packages field correctly for_packages field not populated for Package Resources #2929 by @JonoYang in Populate for packages field correctly #2929 #2939
Prepare Release 31b4 by @pombredanne in Prepare Release 31b4 #2941
Duplicated dependencies package results by @JonoYang in Duplicated dependencies package results #2944
Prepare Release 31b4 by @pombredanne in Prepare Release 31b4 #2947
Add link to scancode-toolkit-reference-scans by @AyanSinhaMahapatra in Add link to scancode-toolkit-reference-scans #2952
Modify pypi PKG-INFO parse by @AyanSinhaMahapatra in Modify pypi PKG-INFO parse #2953
Prepare Release 31.b5 by @pombredanne in Prepare Release 31.b5 #2962
Add black and isort as testing dependencies Add black and isort to options.extras_require / testing #2969 by @johnmhoran in Add black and isort as testing dependencies #2969 #2970
Rename precise_license_detection field Rename precise_license_detection field to identification_precision in license clarity scoring plugin #2967 by @JonoYang in Rename precise_license_detection field #2967 #2968
Convert package data dict to PackageData Cannot detect system installed packages on distroless rootfs using --system-package option #2971 by @JonoYang in Convert package data dict to PackageData #2971 #2973
Update extractcode --shallow option description by @lf32 in Update extractcode --shallow option description #2959
Support shortcut flags for cli by @lf32 in Support shortcut flags for cli #2951
Consider only copyrights in summry Consider detected copyrights when determining a declared holder from a package manifest in summary plugin #2972 by @JonoYang in Consider only copyrights in summry #2972 #2974
Reimplement get installed packages by @JonoYang in Reimplement get installed packages #2988
Report extracted_requirement correctly by @TG1999 in Report extracted_requirement correctly #2984
Improve packagecode and other release prep by @pombredanne in Improve packagecode and other release prep #2992
Improve npm package processing by @pombredanne in Improve npm package processing #2997
Update license detection by @pombredanne in Update license detection #2998
Add new license rules and license - Early summer 2022 by @pombredanne in Add new license rules and license - Early summer 2022 #2999
Bump version to 31.0.0rc2 by @JonoYang in Bump version to 31.0.0rc2 #3000
Do not fail without packages in cyclonedx Improve --cyclonedx output option #2987 by @AyanSinhaMahapatra in Do not fail without packages in cyclonedx #2987 #3005
Fix relaunching scancode on Apple silicon using Rosetta 2 emulation Update the minimum system requirements #2835 by @MarcelBochtler in Fix relaunching scancode on Apple silicon using Rosetta 2 emulation #2835 #3018
Clarify unknown license keys Provide descriptive Notes text for all unknown/generic licenses in the ScanCode-LicenseDB #2827 by @AyanSinhaMahapatra in Clarify unknown license keys #2827 #3023
Yield Packages before other yieldables Yield Packages before Package Resources in assemble() methods #3028 by @pombredanne in Yield Packages before other yieldables #3028 #3031
Prepare Release 31.0.0rc3 by @pombredanne in Prepare Release 31.0.0rc3 #3029
Release 31 rc4 prep by @pombredanne in Release 31 rc4 prep #3036
Add package_adder argument to assemble() Add new argument package_adder to DatafileHandler.assemble() #3034 by @JonoYang in Add package_adder argument to assemble() #3034 #3035
Report proprietary license if key phrase ScanCode incorrectly identifies BSD-3-Clause text as proprietary license #3039 by @pombredanne in Report proprietary license if key phrase #3039 #3041
Improve release scripts macOS 12 on x86 does not work when installing from an actual app archive #3040 by @pombredanne in Improve release scripts #3040 #3046
Update DatafileHandler default methods by @JonoYang in Update DatafileHandler default methods #3042
Prepare release 31 by @pombredanne in Prepare release 31 #3053

New Contributors

@beastrun12j made their first contribution in Updated install.rst #2722
@zacchiro made their first contribution in api.get_licenses: clarify and improve docstring for "min_score" argument #2763
@fangxlmr made their first contribution in Fix invalid license yaml files by resolving duplicated keys #2776
@mrombout made their first contribution in Allow license rules to require the presence of certain defining keywords #2773
@agschrei made their first contribution in Add CycloneDx output option #2698
@soimkim made their first contribution in Remove regular expression futurewarning #2788
@adii21-Ux made their first contribution in fix docstring in debian_copyright.py #2786
@altsalt made their first contribution in fixes missing whitespace in prerequisites list #2778
@karsten-klein made their first contribution in Fixing OSI identifier for BSD-3-Clause; see also SPDX license metadata #2797
@KevinJi22 made their first contribution in Fix GPL license detection false positive #2793 #2799
@kunalchhabra37 made their first contribution in 2789 inconsistent doc html app #2795
@maynaS made their first contribution in Fixed inconsistency in --html-app FILE in cli-reference #2790
@purna135 made their first contribution in Replace freenode references with libera chat #2816
@OsmiumOP made their first contribution in Add Direct Note for Windows Users (New Comers) #2857
@Harshil-Jani made their first contribution in Fix small typo inside notes snippet #2829
@abhishak3 made their first contribution in Added new copyright test cases #2891
@lf32 made their first contribution in Update extractcode --shallow option description #2959
@MarcelBochtler made their first contribution in Fix relaunching scancode on Apple silicon using Rosetta 2 emulation #2835 #3018

Full Changelog: v30.1.0...v31.0.1

This discussion was created from the release v31.0.1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v31.0.1 #3055

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

v31.0.1 #3055

Uh oh!

pombredanne Aug 18, 2022 Maintainer

Important API changes:

Copyright detection:

License detection:

Package detection:

License Clarity Scoring Update:

Summary Plugin Update:

Outputs:

Output version

Documentation Update

Development environment and Code API changes:

Miscellaneous

What's Changed

New Contributors

Replies: 0 comments

pombredanne
Aug 18, 2022
Maintainer