v31.0.1 #3055
pombredanne
announced in
Announcements
v31.0.1
#3055
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This is a major release with important bug and security fixes, new and improved
features and API changes.
Note that we no longer support Python 3.6. Use Python 3.7+ instead.
Important API changes:
The data structure of the JSON output has changed for copyrights, authors
and holders. We now use a proper name for attributes and not a generic "value".
The data structure of the JSON output has changed for licenses. We now
return match details once for each matched license expression rather than
once for each license in a matched expression. There is a new top-level
"license_references" attribute that contains the data details for each
detected license only once. This data can contain the reference license text
as an option.
The data structure of the JSON output has changed for packages. We now
return "package_data" package information at the manifest file-level
rather than "packages". This has all the data attributes of a "package_data"
field plus others: "package_uuid", "package_data_files" and "files".
There is a a new top-level "packages" attribute that contains package
instances that can be aggregating data from multiple manifests.
There is a a new top-level "dependencies" attribute that contains each
dependency instance, these can be standalone or releated to a package.
These contain a new "extra_data" object.
There is a new resource-level attribute "for_packages" which refers to
packages through package_uuids (pURL + uuid string).
The data structure for HTML output has been changed to include emails and
urls under the "infos" object. The HTML template displays output for holders,
authors, emails, and urls into separate tables like "licenses" and "copyrights".
The data structure for CSV output has been changed to rename the Resource
column to "path". "copyright_holder" has been renamed to "holder"
The license clarity scoring plugin has been overhauled to show new license
clarity criteria. More details of the new scoring criteria are provided below.
The functionality of the summary plugin has been imprived to provide declared
origin and license information for the codebase being scanned. The previous
summary plugin functionality has been preserved in the new
tallies
plugin.More details are provided below.
ScanCode has adopted the new code skeleton from https://github.com/nexB/skeleton
The key change is the location of the virtual environment. It used to be
created at the root of the scancode-toolkit directory. It is now created
under the
venv
subdirectory. You mus be aware of this if you use ScanCodefrom a git clone
DatafileHandler.assemble()
,DatafileHandler.assemble_from_many()
, andthe other
.assemble()``` methods from the other Package handlers from packagedcode, have been updated to yield Package items before Dependency or Resource items. This is particulary important in the case where we are calling the
assemble()`` method outside of the scancode-toolkit context, where weneed to ensure that a Package exists before we assocate a Resource or
Dependency to it.
Copyright detection:
opposed to plain values.
License detection:
There have been significant license detection rules and licenses updates:
The rule attribute "only_known_words" has been renamed to "is_continuous" and its
meaning has been updated and expanded. A rule tagged as "is_continuous" can only
be matched if there are no gaps between matched words, be they stopwords, extra
unknown or known words. This improves several false positive license detections.
The processing for "is_continous" has been merged in "key phrases" processing
below.
Key phrases can now be defined in a RULE text by surrounding one or more words
with double curly braces
{{
and}}
. When defined a RULE will only matchwhen the key phrases match exactly. When all the text of rule is a "key phrase",
this is the same as being "is_continuous".
The "--unknown-licenses" option now also detects unknown licenses using a
simple and effective ngrams-based matching in area that are not matched or
weakly matched. This helps detects things that look like a license but are not
yet known as licenses.
False positive detection of "license lists" like the lists seen in license and
package management tools has been entirely reworked. Rather than using
thousands of small false positive rules, there is a new filter to detect a
long run of license references and tags that is typical of license lists.
As a results, thousands of rules have been replaced by a simpler filter, and
the license detection is more accurate, faster and has fewer false
positives.
The new license flag "is_generic" tags licenses that are "generic" licenses
such as "other-permissive" or "other-copyleft". This is not yet
returned in the JSON API.
When scanning binary files, the detection of single word rules is filtered when
surrounded by gibberish or mixed case. For instance
$#%$GpL$
is a falsepositive and is no longer reported.
Several rules we tagged as is_license_notice incorrectly but were references
and have been requalified as is_license_reference. All rules made of a single
ord have been requalified as is_license_reference if they were not qualified
this way.
Matches to small license rules (with small defined as under 15 words)
that are scattered over too many lines are now filtered as false matches.
Small, two-words matches that overlap the previous or next match by
by the word "license" and assimilated are now filtered as false matches.
The new --licenses-reference option adds a new "licenses_reference" top
level attribute to a scan when using the JSON and YAML outputs. This contains
all the details and the full text of every license seen in a file or
package license expression of a scan. This can be added added after the fact
using the --from-json option.
New experimental support for non-English licenses. Use the command
./scancode --reindex-licenses-for-all-languages to index all known non-English
licenses and rules. From that point on, they will be detected. Because of this
some licenses that were not tagged with their languages are now correctly
tagged and they may not be detected unless you activate this new indexing
feature.
Package detection:
Major changes in package detection and reporting, codebase-level attribute
packages
with one or more
package_data
and files for the packages are reported.The specific changes made are:
The resource level attribute
packages
has been renamed topackage_data
,as these are really package data that are being detected, such as manifests,
lockfiles or other package data. This has the data attributes of a
package_data
field plus others:
package_uuid
,package_data_files
andfiles
.A new top-level attribute
packages
has been added which contains packageinstances created from
package_data
detected in the codebase.A new codebase level attribute
dependencies
has been added which contains dependencyinstances created from lockfiles detected in the codebase.
The package attribute
root_path
has been deleted frompackage_data
in favourof the new format where there is no root conceptually, just a list of files for each
package.
There is a new resource-level attribute
for_packages
which refers topackages through package_uids (pURL + uuid string). A
package_adder
function is now used to associate a Package to a Resource that is part of
it. This gives us the flexibility to use the packagedcode Package handlers
in other contexts where
for_packages
on Resource is not implemented in thesame way as scancode-toolkit.
The package_data attribute
dependencies
(which is a list of DependentPackages),now has a new attribute
resolved_package
with a package data mapping.Also the
requirement
attribute is renamed toextracted_requirement
.There is a new
extra_data
to collect extra data as needed.For Pypi packages, python_requires is treated as a package dependency.
License Clarity Scoring Update:
We are moving away from the original license clarity scoring designed for
ClearlyDefined in the license clarity score plugin. The previous license
clarity scoring logic produced a score that was misleading when it would
return a low score due to the stringent scoring criteria. We are now using
more general criteria to get a sense of what provenance information has been
provided and whether or not there is a conflict in licensing between what
licenses were declared at the top-level key files and what licenses have been
detected in the files under the top-level.
The license clarity score is a value from 0-100 calculated by combining the
weighted values determined for each of the scoring elements:
Declared license:
top-level or well-known locations in the software project, typically in a
package manifest, NOTICE, LICENSE, COPYING or README file.
Identification precision:
licenses that can be designated by precise keys (identifiers) as provided in
a publicly available license list, such as the ScanCode LicenseDB, the SPDX
license list, the OSI license list, or a URL pointing to a specific license
text in a project or organization website.
License texts:
files such as a package manifest, NOTICE, LICENSE, COPYING or README.
Declared copyright:
top-level or well-known locations in the software project, typically in a
package manifest, NOTICE, LICENSE, COPYING or README file.
Ambiguous compound licensing:
makes it difficult to construct a reliable license expression, such as in
the case of multiple licenses where the conjunctive versus disjunctive
relationship is not well defined.
Conflicting license categories:
is in the permissive category, but that other potentially conflicting
categories, such as copyleft and proprietary, have been detected in lower
level code.
Summary Plugin Update:
The summary plugin's behavior has been changed. Previously, it provided a
count of the detected license expressions, copyrights, holders, authors, and
programming languages from a scan.
We have preserved this functionality by creating a new plugin called
tallies
.All functionality of the previous summary plugin have been preserved in the
tallies plugin.
The new summary plugin now attempts to determine a declared license expression,
declared holder, and the primary programming language from a scan. And the
updated license clarity score provides context on the quality of the license
information provided in the codebase key files.
The new summary plugin also returns lists of tallies for the other "secondary"
detected license expressions, copyright holders, and programming languages.
All summary information is provided at the codebase-level attribute named
summary
.Outputs:
Added new outputs for the CycloneDx format.
The CLI now exposes options to produce CycloneDx BOMs in either JSON or XML format
A new field
warnings
has been added to the headers of ScanCode toolkit outputthat contains any warning messages that occur during a scan.
The CSV output format --csv option is now deprecated. It will be replaced by
new CSV and tabular output formats in the next ScanCode release.
Visit RFC: Improve tabular output formats #3043 to provide inputs
and feedback.
Output version
Scancode Data Output Version is now 2.0.0.
Changes:
Rename resource level attribute
packages
topackage_data
.Add top-level attribute
packages
.Add top-level attribute
dependencies
.Add resource-level attribute
for_packages
.Remove
package-data
attributeroot_path
.The fields of the license clarity scoring plugin have been replaced with the
following fields. An overview of the new fields can be found in the "License
Clarity Scoring Update" section above.
score
declared_license
identification_precision
has_license_text
declared_copyrights
conflicting_license_categories
ambigious_compound_licensing
The fields of the summary plugin have been replaced with the following fields.
An overview of the new fields can be found in the "Summary Plugin Update"
section above.
declared_license_expression
license_clarity_score
declared_holder
primary_language
other_license_expressions
other_holders
other_languages
Documentation Update
correct minor documentation issues.
Development environment and Code API changes:
The main package API function
get_package_infos
is deprecated, andreplaced by
get_package_data
.The Resources path are always the same regardless of the strip-root or
full-root arguments.
The license cache consistency is not checked anymore when you are using a git
checkout. The SCANCODE_DEV_MODE tag file has been removed entirely. Use
instead the --reindex-licenses option to rebuild the license index.
We can now regenerate test fixtures using the new SCANCODE_REGEN_TEST_FIXTURES
environment variable. There is no need to replace the regen=False with
regen=True in the code.
Miscellaneous
-A
or--about
-q
or--quiet
-v
or--verbose
-V
or--version
can be used.What's Changed
packages
at top level with file levelpackage_manifests
by @AyanSinhaMahapatra in Reportpackages
at top level with file levelpackage_manifests
#2710package_instance
#2691 by @AyanSinhaMahapatra in Add Package Instances #2691 #2825--cyclonedx
output option #2987 by @AyanSinhaMahapatra in Do not fail without packages in cyclonedx #2987 #3005unknown
license keys Provide descriptive Notes text for all unknown/generic licenses in the ScanCode-LicenseDB #2827 by @AyanSinhaMahapatra in Clarifyunknown
license keys #2827 #3023assemble()
methods #3028 by @pombredanne in Yield Packages before other yieldables #3028 #3031package_adder
toDatafileHandler.assemble()
#3034 by @JonoYang in Add package_adder argument to assemble() #3034 #3035New Contributors
Full Changelog: v30.1.0...v31.0.1
This discussion was created from the release v31.0.1.
Beta Was this translation helpful? Give feedback.
All reactions