-
Notifications
You must be signed in to change notification settings - Fork 7
SWAN Overview
SWAN uses both source code and doc comments to extract features and classify Java methods into security-relevant methods (SRM) and Common Weakness Enumeration (CWE) classes. For this task, SWAN uses several components to obtain the required JAR files, process them and perform machine learning.
Jeka obtains the binary JAR files as well as the equivalent source JAR files that contain doc comments. Soot then processes the binary JAR files in order to extract information necessary for the features based on source code. The source JAR files are processed using two doclets, namely doc-coverage-doclet (calculates the documentation coverage) and doc-xml-exporter-doclet (extracts doc comments from the source code and exports them to XML files). The exported doc comments are processed using Natural Language Processing (NLP) tools, either Stanford CoreNLP or DeepLearning4J .
The information extracted using Soot and the NLP tools are used to create feature sets that are used to train the machine learning models. The selected models can then be used to predict the classifications of methods in a provided test set.
In the following sections, the training data set, feature engineering approach, model selection and test set evaluation are described.
SWAN’s dataset includes labelled Java methods extracted from popular Java frameworks which are stored in a JSON file (swan-dataset.json). The JSON file captures the following information: method signature (fully qualified name, return type, and parameters), method classification (SRM and CWE), Javadoc comments, framework the method was taken from, link to the software documentation, and notes describing the method. Below is the JSON object for one of the methods in the training dataset, String StrSubstitutor.replace(char[])
.
{
"dataOut": {
"parameters": [],
"return": true
},
"link": "https:\/\/commons.apache.org\/proper\/commons-lang\/javadocs\/api-3.1\/org\/apache\/commons\/lang3\/text\/StrSubstitutor.html",
"type": [
"sink",
"sanitizer"
],
"cwe": [
"CWE089",
"CWE078",
"CWE306",
"CWE079",
"CWE601"
],
"javadoc": {
"method": "Replaces all the occurrences of variables with their matching values from the resolver using the given source array as a template.",
"class": "Substitutes variables within a string by values."
},
"framework": "apache",
"discovery": "manual",
"name": "org.apache.commons.lang3.text.StrSubstitutor.replace",
"jar": null,
"comment": "",
"parameters": [
"char[]"
],
"return": "java.lang.String",
"dataIn": {
"parameters": [
0
],
"return": false
}
}
SWAN offers two main categories of features which can be used individually or combined: features based on source code and software documentation (doc comments). The two categories of features are described below.
SWAN uses a framework called Soot which analyses and transforms Java bytecode and source code. Method and class properties information such as method name, modifier, return type, parameter list, and data-flow properties are extracted from Soot and used to populate binary features. These features evaluate various properties of the methods such as method and class names, access control, parameter types, return type, and data-flow. The results after evaluating the features are added to a feature vector which is then used for training and validating the model.
The second category of features are based on information extracted from the doc comments using Natural Language Processing. The software documentation features are further divided into two sub-categories: features based on information extracted manually and also features created automatically using word embeddings.
Exported doc comments contain block and inline tags as well as HTML characters. Before removing this information, regular expressions are used to count the occurrence of code, link, deprecated and see tags.
After removing in-line and block tags, all HTML characters, code examples and symbols, the doc comments are annotated with CoreNLP. Some of the annotators used are: tokenization (tokenize), sentence splitting (ssplit), part of speech (pos), lemmatization (lemma), dependency parsing (parse) and also coreference resolution (dcoref). The annotated doc comments are evaluated using features that evaluate the presence of certain words, parts of speeches and other information. The results of these numeric features are used to populate the feature vector.
Automatic features are generated for the doc comments by using the Paragraph Vectors algorithm’s implementation in the DeepLearning4J library. After performing data cleaning on the doc comments, the method and class documents are provided as an input for the algorithm and the paragraph vector algorithm is then fitted. Afterwards, the vector representations for each of the methods are extracted as a n dimensional vector.
Model selection in SWAN can be done using either of two machine learning toolkits: WEKA and ML-Plan.
Waikato Environment for Knowledge Analysis (WEKA ) is an open source machine learning library which provides implementations of machine learning algorithms and other tools for machine learning tasks. The model selection with WEKA uses a two-phase classification approach. In the first phase, methods are classified into the SRM classes, namely: sources, sinks, sanitizers, and authentication methods (auth-safe-state, auth-unsafe state, and auth-no-change). In the second phase, methods that were classified as any of the SRM classes are further classified into the 7 CWE classes.
ML-Plan is and AutoML framework which uses an efficient best-first search algorithm to identify good pipelines, evaluates nodes using random completions and also includes a strategy to avoid over-fitting. Similar to WEKA, ML-Plan is provided with the feature representations and it selects the best model for the given feature representations.