-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
documentationImprovements or additions to documentationImprovements or additions to documentationenhancementNew feature or requestNew feature or request
Description
Summary:
We need to enhance the current dataset preparation pipeline to support additional observation types and ensure all features are structured in a model-ready format for the OCELOT GNN model.
Current Functionality:
- The current script:
- Filters satellite observations by satellite ID and time range
- Groups data into 12-hour bins
- Extracts sensor and BT features
- Normalizes features using MinMaxScaler
- Adds lat/lon metadata and converts everything to PyTorch tensors
Requested Enhancements:
1. Add More Observation Types:
- Extend
extract_features()to include additional variables from the Zarr dataset (Similar to GraphDOP). Preprocess continuous and categorical variables accordingly.
2. Flexible Normalization:
- Make the normalization method configurable (e.g., MinMaxScaler, StandardScaler, or none). Apply scaling only to continuous variables and exclude categorical or geolocation fields (lat/lon) from normalization.
3. Generalize Time Binning Logic:
- Allow configurable time bin sizes (e.g., 6h, 12h, 24h) rather than hardcoding 12-hour intervals.
4. Add Validation and Logging:
- Add logging for major steps (e.g., time bin creation, feature extraction). Include data validation (e.g., NaN checks, missing value handling) for robustness.
Deliverables:
- Updated and modular Python code
- Preprocessing pipeline that supports additional variables and configurable normalization
- Clear documentation and comments in the code
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentationenhancementNew feature or requestNew feature or request