The goal of this problem set is to develop a malware clustering system suitable for sample triage. In particular, you will implement a version ofthe BitShred feature hashing system.
To complete the problem set, you will need to ssh to your container at $user[@]amplifier.ccs.neu.edu:$port , where $user is your gitlab username and $port is your assigned ssh port (hxxps://seclab-devel.ccs.neu.edu/snippets/6). Authentication is performed using any of your uploaded ssh public keys in gitlab.
The data set you will use to evaluate your clustering system is located on your container at /usr/local/share/samples . A JSON document at /usr/local/share/samples.json indicates the arguments you should use to execute each sample, should you choose to do so.
Note: These samples are not actual malware. It should be safe to execute them on your container using the provided arguments.
For each sample, you will need to extract a feature vector. The feature vector you use is up to you. For instance, one approach you can use is to extract system call sequences and arguments using strace and the provided sample arguments.
Next, you will need to implement feature hashing. For each sample’s features, create a fingerprint using the hashing function of your choice.For further details, refer to the lecture notes and the original paper (/assets/refs/jang2011bitshred.pdf).
Using the sample fingerprints, compute a distance matrix that represents the pairwise Jaccard distance for all samples.
Using the machine learning library of your choice (or, alternatively, your own implementation), perform agglomerative hierarchical clustering on the fingerprint distance matrix. The result should be a dendrogram that indicates the sample clustering hierarchy.
Use a threshold to identify a cut in the dendrogram that represents the most likely set of sample clusters.
Create a repository in gitlab at git[@]seclab-devel.ccs.neu.edu:$user/prset07.git . Commit your clustering system to clustering/ , and include an executable script at clustering/cluster that runs your system with the following command-line interface:
$ ./cluster $path_to_configuration
The configuration file should contain a set of pre-computed feature vectors for each sample on your container in the file format of your choice. These should be the original vectors, not fingerprints.
The output of your tool should be the most likely set of clusters in JSON format
{
"clusters": [
[<sample_c1_1>, <sample_c1_2>, ...],
[<sample_c2_1>, <sample_c2_2>, ...],
// ...
]
}
For example:
{
"clusters": [
["0000", "0001", "0002", "0003"],
["0004", "0005", "0006", "0007"]
]
}
NOTE: Your tool must be executable using the above interface from a fresh git checkout of your repository to receive full credit.
Also, commit a README.md that describes in as much detail as possible the following:
- The features that you extract from the sample set
- The feature hashing strategy you use
- The criterion you use to choose a cluster set
For extra credit, implement co-clustering. Modify your tool’s output to the following:
{
"clusters": [
{
"samples": [<sample_c1_1>, <sample_c1_2>, ...],
"features": [
<shared_feature>,
// ...
]
}
]
}
Add to your README.md a description of your co-clustering implementation.
I used following procedure to extract the features from the malware samples
- Extracted strace for all malware samples using the straceExtractor python script
- From the strace files obtained extracted overlapping set of n consecutive system calls (n-grams)
- Generated universal feature set containing 28148 entries each with n gram method (5 consecutive system calls i.e. n = 5) [5-Grams]
- Finally created unique feature set which contained 431 entries
The feature set generated with the above procedure is very effective as it is extracted from the overlapping system calls from the strace.
I used SHA1 hashing algorithm to hash the each feature. After feature hashing I generated the distance matrix of dimention 824 x 431. The distance matrix containing pairwise Jaccard distance for all samples was generated using function in allFeatureScript. Finally using scipy created generated dendrogram (file named 'clusterToGenerateDendrogram')
I used scipy library to cluster the samples. The scipy library requires a distance matrix as input. I fed scipy linkage function the distance vector matrix of dimentation 824 x 824 which gave me the cluster set which can be used to plot dendrogram.
To generate the cluster json file I created cluster script which takes configuration file as input. The output generated by the cluster script maps the malware samples in terms of 431 unique families. The file named clusters.json in the repo shows the clustering of the malware samples in 431 families.
NOTE: Some features are very common in all malware samples which can be removed. The feature shared among all malware samples can be safely considered benign (like feature 0 in this case).
Please find the files/ programs saved at specified location. Thank you! :-)