Project 7

May 22, 2022

cae7f09 · May 22, 2022

Name	Name	Last commit message	Last commit date
parent directory ..
clustering	clustering	Initial Commit	Jul 21, 2016
ProblemStatement- prset07- Malware Classification and Triage.pdf	ProblemStatement- prset07- Malware Classification and Triage.pdf	Initial Commit	Jul 21, 2016
README.md	README.md	Updated readme file for the projects	May 22, 2022
allFeatureScript	allFeatureScript	Initial Commit	Jul 21, 2016
clusters.json	clusters.json	Initial Commit	Jul 21, 2016
clustersToGenerateDendrogram	clustersToGenerateDendrogram	Initial Commit	Jul 21, 2016
config.json	config.json	Initial Commit	Jul 21, 2016
config.txt	config.txt	Initial Commit	Jul 21, 2016
featureExtractorFromStrace	featureExtractorFromStrace	Initial Commit	Jul 21, 2016
myConfig.json	myConfig.json	Initial Commit	Jul 21, 2016
straceExtractor	straceExtractor	Initial Commit	Jul 21, 2016

README.md

Malware Classification and Triage Problem Set

Description

The goal of this problem set is to develop a malware clustering system suitable for sample triage. In particular, you will implement a version ofthe BitShred feature hashing system.

To complete the problem set, you will need to ssh to your container at $user[@]amplifier.ccs.neu.edu:$port , where $user is your gitlab username and $port is your assigned ssh port (hxxps://seclab-devel.ccs.neu.edu/snippets/6). Authentication is performed using any of your uploaded ssh public keys in gitlab.

Sample Execution

The data set you will use to evaluate your clustering system is located on your container at /usr/local/share/samples . A JSON document at /usr/local/share/samples.json indicates the arguments you should use to execute each sample, should you choose to do so.

Note: These samples are not actual malware. It should be safe to execute them on your container using the provided arguments.

Feature Extraction

For each sample, you will need to extract a feature vector. The feature vector you use is up to you. For instance, one approach you can use is to extract system call sequences and arguments using strace and the provided sample arguments.

Feature Hashing

Next, you will need to implement feature hashing. For each sample’s features, create a fingerprint using the hashing function of your choice.For further details, refer to the lecture notes and the original paper (/assets/refs/jang2011bitshred.pdf).

Using the sample fingerprints, compute a distance matrix that represents the pairwise Jaccard distance for all samples.

Sample Clustering

Using the machine learning library of your choice (or, alternatively, your own implementation), perform agglomerative hierarchical clustering on the fingerprint distance matrix. The result should be a dendrogram that indicates the sample clustering hierarchy.

Use a threshold to identify a cut in the dendrogram that represents the most likely set of sample clusters.

Answer Submission

Create a repository in gitlab at git[@]seclab-devel.ccs.neu.edu:$user/prset07.git . Commit your clustering system to clustering/ , and include an executable script at clustering/cluster that runs your system with the following command-line interface:

$ ./cluster $path_to_configuration

The configuration file should contain a set of pre-computed feature vectors for each sample on your container in the file format of your choice. These should be the original vectors, not fingerprints.

The output of your tool should be the most likely set of clusters in JSON format

{
 "clusters": [
 [<sample_c1_1>, <sample_c1_2>, ...],
 [<sample_c2_1>, <sample_c2_2>, ...],
 // ...
 ]
}

For example:

{
 "clusters": [
 ["0000", "0001", "0002", "0003"],
 ["0004", "0005", "0006", "0007"]
 ]
}

NOTE: Your tool must be executable using the above interface from a fresh git checkout of your repository to receive full credit.

Also, commit a README.md that describes in as much detail as possible the following:

The features that you extract from the sample set
The feature hashing strategy you use
The criterion you use to choose a cluster set

Extra Credit

For extra credit, implement co-clustering. Modify your tool’s output to the following:

{
 "clusters": [
 {
 "samples": [<sample_c1_1>, <sample_c1_2>, ...],
 "features": [
 <shared_feature>,
 // ...
 ]
 }
 ]
}

Add to your README.md a description of your co-clustering implementation.

Answer/ Solution

Features extracted from the sample set

I used following procedure to extract the features from the malware samples

Extracted strace for all malware samples using the straceExtractor python script
From the strace files obtained extracted overlapping set of n consecutive system calls (n-grams)
Generated universal feature set containing 28148 entries each with n gram method (5 consecutive system calls i.e. n = 5) [5-Grams]
Finally created unique feature set which contained 431 entries

The feature set generated with the above procedure is very effective as it is extracted from the overlapping system calls from the strace.

The Feature Hashing strategy Used

I used SHA1 hashing algorithm to hash the each feature. After feature hashing I generated the distance matrix of dimention 824 x 431. The distance matrix containing pairwise Jaccard distance for all samples was generated using function in allFeatureScript. Finally using scipy created generated dendrogram (file named 'clusterToGenerateDendrogram')

The Crieteria Used to Choose Cluster Set

I used scipy library to cluster the samples. The scipy library requires a distance matrix as input. I fed scipy linkage function the distance vector matrix of dimentation 824 x 824 which gave me the cluster set which can be used to plot dendrogram.

To generate the cluster json file I created cluster script which takes configuration file as input. The output generated by the cluster script maps the malware samples in terms of 431 unique families. The file named clusters.json in the repo shows the clustering of the malware samples in 431 families.

NOTE: Some features are very common in all malware samples which can be removed. The feature shared among all malware samples can be safely considered benign (like feature 0 in this case).

Please find the files/ programs saved at specified location. Thank you! :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

Project 7

Project 7

README.md

Malware Classification and Triage Problem Set

Description

Sample Execution

Feature Extraction

Feature Hashing

Sample Clustering

Answer Submission

Extra Credit

Answer/ Solution

Features extracted from the sample set

The Feature Hashing strategy Used

The Crieteria Used to Choose Cluster Set

Files

Project 7

Directory actions

More options

Directory actions

More options

Latest commit

History

Project 7

Folders and files

parent directory

README.md

Malware Classification and Triage Problem Set

Description

Sample Execution

Feature Extraction

Feature Hashing

Sample Clustering

Answer Submission

Extra Credit

Answer/ Solution

Features extracted from the sample set

The Feature Hashing strategy Used

The Crieteria Used to Choose Cluster Set