Skip to content

Extend Dataset Preparation for AI-Ready Format in OCELOT #16

@azadeh-gh

Description

@azadeh-gh

Summary:

We need to enhance the current dataset preparation pipeline to support additional observation types and ensure all features are structured in a model-ready format for the OCELOT GNN model.

Current Functionality:

  • The current script:
  • Filters satellite observations by satellite ID and time range
  • Groups data into 12-hour bins
  • Extracts sensor and BT features
  • Normalizes features using MinMaxScaler
  • Adds lat/lon metadata and converts everything to PyTorch tensors

Requested Enhancements:

1. Add More Observation Types:

  • Extend extract_features() to include additional variables from the Zarr dataset (Similar to GraphDOP). Preprocess continuous and categorical variables accordingly.

2. Flexible Normalization:

  • Make the normalization method configurable (e.g., MinMaxScaler, StandardScaler, or none). Apply scaling only to continuous variables and exclude categorical or geolocation fields (lat/lon) from normalization.

3. Generalize Time Binning Logic:

  • Allow configurable time bin sizes (e.g., 6h, 12h, 24h) rather than hardcoding 12-hour intervals.

4. Add Validation and Logging:

  • Add logging for major steps (e.g., time bin creation, feature extraction). Include data validation (e.g., NaN checks, missing value handling) for robustness.

Deliverables:

  • Updated and modular Python code
  • Preprocessing pipeline that supports additional variables and configurable normalization
  • Clear documentation and comments in the code

Metadata

Metadata

Labels

documentationImprovements or additions to documentationenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions