Skip to content

Files

Latest commit

f2930b7 · Sep 6, 2023

History

History
400 lines (256 loc) · 7.84 KB

preprocessing.md

File metadata and controls

400 lines (256 loc) · 7.84 KB

src.preprocessing

extract_single_snapshot(df, day)

source

Extract and format a single snapshot from a DataFrame for a given day.

Parameters

  • df:pd.DataFrame

    The input DataFrame containing network data.

  • day:int

    The specific day for which to extract the snapshot.

Returns

  • str

    A formatted string representing the snapshot for the given day.


aggregate_edges(df, gt)

source

Aggregate edges in a DataFrame while counting packets and adding labels.

Parameters

  • df:pd.DataFrame

    The input DataFrame containing network data.

  • gt:pd.DataFrame

    The ground truth labels for source IPs.

Returns

  • pd.DataFrame

    An aggregated DataFrame with added packet count and labels for edges.


get_contacted_dst_ports(df)

source

Get the total number of contacted destination ports per source IP.

Parameters

  • df:pd.DataFrame

    The input DataFrame containing network data.

Returns

  • pd.DataFrame

    A DataFrame with the total number of contacted destination ports per source IP.


get_stats_per_dst_port(df)

source

Get general statistics of packets per destination port.

Parameters

  • df:pd.DataFrame

    The input DataFrame containing network data.

Returns

  • pd.DataFrame

    A DataFrame with general statistics of packets per destination port per source IP.


get_contacted_src_ips(df)

source

Get the total number of contacted source IPs per destination port.

Parameters

  • df:pd.DataFrame

    The input DataFrame containing network data.

Returns

  • pd.DataFrame

    A DataFrame with the total number of contacted source IPs per destination port.


get_stats_per_src_ip(df)

source

Get general statistics of packets per source IP per destination port.

Parameters

  • df:pd.DataFrame

    The input DataFrame containing network data.

Returns

  • pd.DataFrame

    A DataFrame with general statistics of packets per source IP per destination port.


get_contacted_dst_ips(df, dummy=False)

source

Get the total number of contacted darknet IPs per source IP or destination port.

Parameters

  • df:pd.DataFrame

    The input DataFrame containing network data.

  • dummy:bool, optional

    If True, calculates the total number of contacted darknet IPs per destination port, by default False.

Returns

  • pd.DataFrame

    A DataFrame with the total number of contacted darknet IPs per source IP or destination port.


get_stats_per_dst_ip(df, dummy=False)

source

Get general statistics of packets per destination IP per source IP or destination port.

Parameters

  • df:pd.DataFrame

    The input DataFrame containing network data.

  • dummy:bool, optional

    If True, calculates statistics per destination IP per destination port, by default False.

Returns

  • pd.DataFrame

    A DataFrame with general statistics of packets per destination IP per source IP or destination port.


get_packet_statistics(df, by='src_ip')

source

Get general packet statistics per source IP or destination port.

Parameters

  • df:pd.DataFrame

    The input DataFrame containing network data.

  • by:str, optional

    The column by which to group the packet statistics ('src_ip' or 'dst_port'), by default 'src_ip'.

Returns

  • pd.DataFrame

    A DataFrame with general packet statistics per source IP or destination port.


uniform_features(df, lookup, node_type)

source

Uniformly format and index features DataFrame based on node lookup.

Parameters

  • df:pd.DataFrame

    The input DataFrame containing node features.

  • lookup:dict

    A dictionary mapping node names to IDs.

  • node_type:str

    The type of nodes in the DataFrame (e.g., 'src_ip', 'dst_port').

Returns

  • pd.DataFrame

    A uniformly formatted and indexed DataFrame of node features.


generate_adjacency_matrices(flist, weighted=True)

source

Generate adjacency matrices from a list of DataFrame files.

Parameters

  • flist:list

    A list of file paths, each containing a DataFrame of network data.

  • weighted:bool, optional

    If True, the edges in the generated matrices will be weighted, by default True.

Returns

  • list

    A list of torch sparse tensors representing the adjacency matrices.


drop_duplicates(x)

source

Remove consecutive duplicate elements from a NumPy array.

Parameters

  • x:numpy.ndarray

    The input NumPy array from which consecutive duplicates will be removed.

Returns

  • numpy.ndarray

    A NumPy array with consecutive duplicate elements removed.


split_array(arr, step=1000)

source

Split a NumPy array into smaller sub-arrays of a specified step size.

Parameters

  • arr:numpy.ndarray

    The input NumPy array to be split.

  • step:int, optional

    The size of each sub-array, by default 1000.

Returns

  • list

    A list of NumPy sub-arrays obtained by splitting the input array.


generate_negatives(anomaly_num, active_source, active_dest, real_edges)

source

Generate negative edges for self-supervised training.

Parameters

  • anomaly_num:int

    Number of negative edges to generate.

  • active_source:numpy.ndarray

    Array of active source nodes.

  • active_dest:numpy.ndarray

    Array of active destination nodes.

  • real_edges:numpy.ndarray

    Array of real edges in the graph.

Returns

  • torch.Tensor

    A tensor containing the generated negative edges.


get_self_supervised_edges(X_to_predict, cuda, ns)

source

Get self-supervised edges for training.

Parameters

  • X_to_predict:torch.Tensor

    The input adjacency matrix for which self-supervised edges are generated.

  • cuda:bool

    Indicates whether to use CUDA (GPU) for tensor operations.

  • ns:int

    Number of negative samples to generate for each positive edge.

Returns

  • tuple

    A tuple containing the generated negative edges tensor and the index tensor.


load_single_file(file, day)

source

Load and preprocess a single data file.

Parameters

  • file:str

    The path to the data file to load.

  • day:int

    The day associated with the loaded data.

Returns

  • pandas.DataFrame

    A DataFrame containing the preprocessed data.


apply_packets_filter(df, min_packets)

source

Apply a packet count filter to a DataFrame.

Parameters

  • df:pandas.DataFrame

    The input DataFrame containing packet data.

  • min_packets:int

    The minimum number of packets a source IP must have to be retained.

Returns

  • pandas.DataFrame

    A DataFrame with the packet count filter applied.


apply_port_filter(df, max_ports)

source

Apply a port count filter to a DataFrame.

Parameters

  • df:pandas.DataFrame

    The input DataFrame containing packet data.

  • max_ports:int

    The maximum number of ports to retain in the "dst_port" column.

Returns

  • pandas.DataFrame

    A DataFrame with the port count filter applied.