Skip to content

Parsing Root Data

DannyWeitekamp edited this page Aug 19, 2016 · 22 revisions

#Parsing Root Data

Table of contents:

###New Format

  1. Intro
  2. Examples

###Old Format

  1. Titans
  2. Functions
  3. Classes
  4. Examples

##New Format ###Intro To speed up our preprocessing and make it more flexible we first need to parse our data from the ROOT files and put them in pandas tables. In order to do this we must call delphes_parse.py. We can store our pandas tables in two different formats HDF5 or msgpack. HDF5 is the prefered format but some machines (i.e CSCS) do not support this format when used in conjunction with PyTables. Pandas requires PyTables to write in HDF5 so if we want to run our parsing and/or preprocessing steps on a machine that does not support PyTables then we need to resort to the msgpack format, which has the disadvantage of only being readable all at once as opposed to in chunks of our choosing. During our parsing step we read whatever data we need from our ROOT files and then do Track matching and compute our Isolation values. The old parsing format, while flexible cannot handle these processing steps which necessitates this new method of parsing. ###Examples At the moment delphes_parse.py is designed to work with the file structure on the Titans machine, so it does not require a fully qualified path to each directory of ROOT files, this can easily be changed in the code if one wants to use a different file structure.

To write in HDF5 format:

python utils/delphes_parse.py qcd_lepFilter_13TeV

To write in msgpack format:

python utils/delphes_parse.py qcd_lepFilter_13TeV -m

Where qcd_lepFilter_13TeV can be replaced with any other names of directories that contain ROOT data. The output will go in qcd_lepFilter_13TeV/pandas_h5/ or qcd_lepFilter_13TeV/pandas_msg/ depending on what format you used. The msgpack format creates two types of files .msg and .meta. The .msg files contain the parsed tables, and the .meta files contain information about how many rows in each data tables are used for each entry (how many samples per entry). The .meta files are necessary for quickly skipping over data during preprocessing.

In both hdf5 and msgpack formats you can choose the number of samples to parse using the -n flag. It is recommended that you set the -n flag otherwise all of the data will be parsed which could lead to your disk being overfilled.

python utils/delphes_parse.py qcd_lepFilter_13TeV -m -n 200000

##Old Format ###Titans For those of you with access the CMS titans machine, navigate here for a tutorial:

/notebooks/dweitekamp/CMS_SURF_2016/data_parse/Delphi_Parse.ipynb

###Functions ####ROOT_to_pandas

CMS_SURF_2016.utils.data_parse.ROOT_to_pandas

Extracts values from a .root file and writes them to a pandas frame. Essentially takes root data out of its tree format and puts it in a table.
Arguments:

  • inputfilepath- The path to the .root file held locally or in a distributed system
  • leaves- A list of the names of the leaves that need to be extracted
  • trees- A list of the trees in the .root file that need to be parsed. Reads all trees in the file if not set.
  • columns - A list of the column names for the table. Uses the leaf names if not set
  • verbosity - 0:No print output, 1:Some, 2:A lot of the table is printed

Returns:

  • frame - A Pandas data frame

####leaves_from_obj

CMS_SURF_2016.utils.data_parse.leaves_from_obj

Takes the name of an object and a list of properties (Columns). Expands like So leaves_from_obj("ObjectName", ["Prop1, "Prop2", "Prop3"]) -> ["ObjectName.Prop1, "ObjectName.Prop2", "ObjectName.Prop3"] , ["Prop1, "Prop2", "Prop3"] Can also take in DataProcessingProcedures which it will properly expand in the column output
Arguments:

  • objname- The name of the object.
  • input_leaves- The name of its properties.

Returns:

  • (leaves, columns)
    leaves- The a list of fully qualified leaf names
    columns- The a list of column names. Expanded if a DataProcessingProcedure was given.

###Classes ####DataProcessingProcedure CMS_SURF_2016.utils.data_parse.DataProcessingProcedure

class DataProcessingProcedure():
   '''An object that can be passed as a leaf in ROOT_to_pandas. Instead of simply grabbing a
   leaf, it takes in data and applies a function to it. The outputs of that function are
   written instead of the leaf to the pandas frame.'''

   def __init__(self, func, input_leaves, output_names):
   '''func: a function that maps the inputs (a list or tuple) to the outputs (a list or tuple).
     input_leaves: The fully qualifed names of the leaves whose values func will take as inputs
     output_names: The names of the column headers for the outputs'''

###Examples We can extract the data directly

particle_frame = ROOT_to_pandas("../data/ttbar_13TeV_80.root",
                              ["Particle.E","Particle.Px", "Particle.Py", "Particle.Pz",
                               "Particle.PID", "Particle.Charge"],
                              columns=["E", "Px", "Py", "Pz", "PID", "Charge"],
                              verbosity=1)

Or save ourselves some typing with leaves_from_obj()

from CMS_SURF_2016.utils.data_parse import leaves_from_obj
columns= ["PT", "Eta", "Phi", "T"]
leaves, columns = leaves_from_obj("Jet", columns)
jet_frame = ROOT_to_pandas("../data/ttbar_13TeV_80.root",
                             leaves,
                              columns=columns,
                              verbosity=1)

And with DataProcessingProcedures we can get really fancy

import numpy as np
from CMS_SURF_2016.utils.data_parse import DataProcessingProcedure

#Define the speed of light C
C = np.float64(2.99792458e8);  
# Define a function that converts [Energy, Eta, Phi] to [E/c, Px, Py, Pz]
def four_vec_func(inputs):
        E = inputs[0]
        Eta = inputs[1]
        Phi = inputs[2]
        E_over_c = E/C
        px = E_over_c * np.sin(Phi) * np.cos(Eta) 
        py = E_over_c * np.sin(Phi) * np.sin(Eta)
        pz = E_over_c * np.cos(Phi)
        return [E_over_c, px, py, pz]
four_vec_inputs, dummy = leaves_from_obj("Photon", ["E", "Eta", "Phi"])

#Define a procedure that uses four_vec_func to make the conversion from
#    four_vec_inputs (i.e [Photon.E, Photon.Eta, Photon.Phi]) to ["E/c", "Px","Py","Pz"]
four_vec_proc = DataProcessingProcedure(four_vec_func, four_vec_inputs, ["E/c", "Px","Py","Pz"])

# Define a function that takes in nothing [] and outputs [22] the PID of a Photon
def PID_func(inputs):
    return [22]

#Define a procedure that uses PID_func to insert the photon PID=22 in to the table
PID_proc = DataProcessingProcedure(PID_func, [], ["PID"])

#If the processing we need to do is simple we can also just use a lambda
#Define a procedure that just outputs 0
charge_proc = DataProcessingProcedure(lambda x:[0], [], ["Charge"])

#Pass in our column names and procedures. The procedures will be replaced by their outputs
columns=[four_vec_proc, PID_proc, charge_proc]
leaves, columns = leaves_from_obj("Photon", columns)

#Extract the table from the root file
photon_frame = ROOT_to_pandas("../data/ttbar_13TeV_80.root",
                      leaves,
                      columns=columns,
                      verbosity=1)

We can then save this for later in HDF5 format

frame.to_hdf("ttbar_13TeV_80.h5", 'data')

Clone this wiki locally