Skip to content

Conversation

Copy link

Copilot AI commented Sep 20, 2025

This PR introduces additional dimensionality reduction techniques beyond UMAP to handle millions of geospatial features efficiently. The implementation addresses the need for simple, scalable methods that can process outputRaw files from main.py and create meaningful clusters for analysis.

New Features

Core Dimensionality Reduction Methods

  • PCA (Principal Component Analysis): Fast linear method optimal for initial data exploration and understanding variance structure
  • t-SNE (t-Distributed Stochastic Neighbor Embedding): Non-linear technique for high-quality visualizations that reveal local structure
  • Truncated SVD: Ultra-fast linear method designed for very large datasets (millions of points)
  • ICA (Independent Component Analysis): Linear method for finding independent source signals in mixed geospatial data

Processing Pipeline

The new dimensionality_reduction.py script provides:

  • Direct compatibility with existing outputRaw TSV.gz files from main.py
  • Flexible column selection for mixed geospatial data (lat, lon, speed, activity, etc.)
  • Built-in standardization for proper scaling across different measurement units
  • Intelligent sampling for memory-efficient processing of large datasets
  • Integrated K-means clustering for automatic grouping of results

Visualization Tools

  • Python plotting: plot_dim_reduction.py creates comparative visualizations with categorical coloring
  • R integration: R/dim_reduction_plot.R maintains compatibility with existing R plotting workflow
  • Multi-method comparison: Side-by-side plots showing different technique results

Performance Optimizations

The implementation is specifically optimized for geospatial tracking data:

  • Scalable processing: Handles millions of features through efficient algorithms and optional sampling
  • Memory management: Truncated SVD for large datasets, sampling options for t-SNE
  • Speed recommendations: Linear methods (PCA, SVD) for fast exploration, t-SNE for final visualization

Usage Examples

Fast exploration of large datasets:

python3 dimensionality_reduction.py \
    --input output/raw.tsv.gz \
    --output output/results.tsv.gz \
    --columns lat lon Speed \
    --methods pca svd \
    --standardize --add_clusters

High-quality visualization:

python3 dimensionality_reduction.py \
    --input output/raw.tsv.gz \
    --output output/tsne_viz.tsv.gz \
    --columns lat lon Speed Activity \
    --methods tsne \
    --sample 10000 \
    --standardize

Integration with Existing Workflow

The new tools integrate seamlessly with the existing pipeline:

  1. Generate raw data: main.py --outputRaw output/raw.tsv.gz
  2. Apply dimensionality reduction: dimensionality_reduction.py
  3. Visualize results: plot_dim_reduction.py or existing R scripts
  4. Further analysis using the enhanced dataset with clustering results

Documentation

  • Comprehensive guide: DIMENSIONALITY_REDUCTION.md provides detailed usage instructions, performance recommendations, and method selection guidance
  • Working examples: example_usage.py demonstrates end-to-end workflows for different use cases
  • Updated README: Integration instructions and quick start guide

This implementation provides researchers with multiple dimensionality reduction options tailored to different needs: PCA/SVD for fast exploration of massive datasets, t-SNE for publication-quality visualizations, and ICA for signal separation analysis.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • esm.ubuntu.com
    • Triggering command: /usr/lib/apt/methods/https (dns block)

If you need me to access, download, or install something from one of these locations, you can either:


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Copilot AI changed the title [WIP] I'm looking for other simple dimensionality reduction techniques that can collapse millions of geojson features into meaninful clusters. I would like input to be the be the outputRaw file from main.py (iDf.to_csv(args.outputRaw, sep='\t', compression='... Add comprehensive dimensionality reduction techniques for geospatial data analysis Sep 20, 2025
Copilot finished work on behalf of jlanej September 20, 2025 13:59
Copilot AI requested a review from jlanej September 20, 2025 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants