`src.preprocessing`

extract_single_snapshot(df, day)

source

Extract and format a single snapshot from a DataFrame for a given day.

Parameters

df:pd.DataFrame

The input DataFrame containing network data.
day:int

The specific day for which to extract the snapshot.

Returns

str

A formatted string representing the snapshot for the given day.

aggregate_edges(df, gt)

source

Aggregate edges in a DataFrame while counting packets and adding labels.

Parameters

df:pd.DataFrame

The input DataFrame containing network data.
gt:pd.DataFrame

The ground truth labels for source IPs.

Returns

pd.DataFrame

An aggregated DataFrame with added packet count and labels for edges.

get_contacted_dst_ports(df)

source

Get the total number of contacted destination ports per source IP.

Parameters

df:pd.DataFrame

The input DataFrame containing network data.

Returns

pd.DataFrame

A DataFrame with the total number of contacted destination ports per source IP.

get_stats_per_dst_port(df)

source

Get general statistics of packets per destination port.

Parameters

df:pd.DataFrame

The input DataFrame containing network data.

Returns

pd.DataFrame

A DataFrame with general statistics of packets per destination port per source IP.

get_contacted_src_ips(df)

source

Get the total number of contacted source IPs per destination port.

Parameters

df:pd.DataFrame

The input DataFrame containing network data.

Returns

pd.DataFrame

A DataFrame with the total number of contacted source IPs per destination port.

get_stats_per_src_ip(df)

source

Get general statistics of packets per source IP per destination port.

Parameters

df:pd.DataFrame

The input DataFrame containing network data.

Returns

pd.DataFrame

A DataFrame with general statistics of packets per source IP per destination port.

get_contacted_dst_ips(df, dummy=False)

source

Get the total number of contacted darknet IPs per source IP or destination port.

Parameters

df:pd.DataFrame

The input DataFrame containing network data.
dummy:bool, optional

If True, calculates the total number of contacted darknet IPs per destination port, by default False.

Returns

pd.DataFrame

A DataFrame with the total number of contacted darknet IPs per source IP or destination port.

get_stats_per_dst_ip(df, dummy=False)

source

Get general statistics of packets per destination IP per source IP or destination port.

Parameters

df:pd.DataFrame

The input DataFrame containing network data.
dummy:bool, optional

If True, calculates statistics per destination IP per destination port, by default False.

Returns

pd.DataFrame

A DataFrame with general statistics of packets per destination IP per source IP or destination port.

get_packet_statistics(df, by='src_ip')

source

Get general packet statistics per source IP or destination port.

Parameters

df:pd.DataFrame

The input DataFrame containing network data.
by:str, optional

The column by which to group the packet statistics ('src_ip' or 'dst_port'), by default 'src_ip'.

Returns

pd.DataFrame

A DataFrame with general packet statistics per source IP or destination port.

uniform_features(df, lookup, node_type)

source

Uniformly format and index features DataFrame based on node lookup.

Parameters

df:pd.DataFrame

The input DataFrame containing node features.
lookup:dict

A dictionary mapping node names to IDs.
node_type:str

The type of nodes in the DataFrame (e.g., 'src_ip', 'dst_port').

Returns

pd.DataFrame

A uniformly formatted and indexed DataFrame of node features.

generate_adjacency_matrices(flist, weighted=True)

source

Generate adjacency matrices from a list of DataFrame files.

Parameters

flist:list

A list of file paths, each containing a DataFrame of network data.
weighted:bool, optional

If True, the edges in the generated matrices will be weighted, by default True.

Returns

list

A list of torch sparse tensors representing the adjacency matrices.

drop_duplicates(x)

source

Remove consecutive duplicate elements from a NumPy array.

Parameters

x:numpy.ndarray

The input NumPy array from which consecutive duplicates will be removed.

Returns

numpy.ndarray

A NumPy array with consecutive duplicate elements removed.

split_array(arr, step=1000)

source

Split a NumPy array into smaller sub-arrays of a specified step size.

Parameters

arr:numpy.ndarray

The input NumPy array to be split.
step:int, optional

The size of each sub-array, by default 1000.

Returns

list

A list of NumPy sub-arrays obtained by splitting the input array.

generate_negatives(anomaly_num, active_source, active_dest, real_edges)

source

Generate negative edges for self-supervised training.

Parameters

anomaly_num:int

Number of negative edges to generate.
active_source:numpy.ndarray

Array of active source nodes.
active_dest:numpy.ndarray

Array of active destination nodes.
real_edges:numpy.ndarray

Array of real edges in the graph.

Returns

torch.Tensor

A tensor containing the generated negative edges.

get_self_supervised_edges(X_to_predict, cuda, ns)

source

Get self-supervised edges for training.

Parameters

X_to_predict:torch.Tensor

The input adjacency matrix for which self-supervised edges are generated.
cuda:bool

Indicates whether to use CUDA (GPU) for tensor operations.
ns:int

Number of negative samples to generate for each positive edge.

Returns

tuple

A tuple containing the generated negative edges tensor and the index tensor.

load_single_file(file, day)

source

Load and preprocess a single data file.

Parameters

file:str

The path to the data file to load.
day:int

The day associated with the loaded data.

Returns

pandas.DataFrame

A DataFrame containing the preprocessed data.

apply_packets_filter(df, min_packets)

source

Apply a packet count filter to a DataFrame.

Parameters

df:pandas.DataFrame

The input DataFrame containing packet data.
min_packets:int

The minimum number of packets a source IP must have to be retained.

Returns

pandas.DataFrame

A DataFrame with the packet count filter applied.

apply_port_filter(df, max_ports)

source

Apply a port count filter to a DataFrame.

Parameters

df:pandas.DataFrame

The input DataFrame containing packet data.
max_ports:int

The maximum number of ports to retain in the "dst_port" column.

Returns

pandas.DataFrame

A DataFrame with the port count filter applied.

Files

preprocessing.md

Latest commit

History

preprocessing.md

File metadata and controls

src.preprocessing

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

`src.preprocessing`