rotblauer · Copilot · Sep 20, 2025 · Sep 20, 2025
diff --git a/DIMENSIONALITY_REDUCTION.md b/DIMENSIONALITY_REDUCTION.md
@@ -0,0 +1,236 @@
+# Dimensionality Reduction for CatUmap
+
+This document describes the additional dimensionality reduction techniques available in the CatUmap repository beyond the existing UMAP implementation.
+
+## Overview
+
+The new dimensionality reduction tools can process the `outputRaw` files generated by `main.py` and apply various techniques to collapse millions of geospatial features into meaningful clusters and 2D/3D visualizations.
+
+## Available Techniques
+
+### 1. Principal Component Analysis (PCA)
+- **Type**: Linear dimensionality reduction
+- **Best for**: Fast processing, interpretable components, initial data exploration
+- **Speed**: Very fast, scales well to millions of points
+- **Use case**: Understanding variance structure in your data
+
+### 2. t-Distributed Stochastic Neighbor Embedding (t-SNE) 
+- **Type**: Non-linear dimensionality reduction
+- **Best for**: High-quality visualizations, revealing local structure
+- **Speed**: Slower, recommended to sample large datasets
+- **Use case**: Creating publication-quality scatter plots
+
+### 3. Truncated Singular Value Decomposition (SVD)
+- **Type**: Linear dimensionality reduction  
+- **Best for**: Very large datasets, fast processing
+- **Speed**: Fastest method available
+- **Use case**: Initial exploration of massive datasets
+
+### 4. Independent Component Analysis (ICA)
+- **Type**: Linear dimensionality reduction
+- **Best for**: Finding independent source signals
+- **Speed**: Fast
+- **Use case**: When you expect independent underlying processes
+
+## Quick Start
+
+### Basic Usage
+
+```bash
+# Apply PCA and SVD to lat, lon, Speed columns
+python3 dimensionality_reduction.py \
+    --input output/raw.tsv.gz \
+    --output output/pca_svd_results.tsv.gz \
+    --columns lat lon Speed \
+    --methods pca svd \
+    --standardize
+
+# Apply t-SNE with sampling for large datasets
+python3 dimensionality_reduction.py \
+    --input output/raw.tsv.gz \
+    --output output/tsne_results.tsv.gz \
+    --columns lat lon Speed Accuracy \
+    --methods tsne \
+    --sample 10000 \
+    --standardize
+```
+
+### Generate Visualizations
+
+```bash
+# Create plots for PCA and SVD results
+python3 plot_dim_reduction.py \
+    --input output/pca_svd_results.tsv.gz \
+    --output pca_svd_plots.png \
+    --methods pca svd \
+    --color_by Activity
+```
+
+### Complete Example
+
+```bash
+# Run the comprehensive example
+python3 example_usage.py
+```
+
+## Command Line Options
+
+### dimensionality_reduction.py
+
+**Input/Output:**
+- `--input`: Input TSV.gz file (outputRaw from main.py)
+- `--output`: Output TSV.gz file with results
+
+**Methods:**
+- `--methods`: Choose from `pca`, `tsne`, `svd`, `ica` (can specify multiple)
+- `--components`: Number of dimensions to reduce to (default: 2)
+
+**Data Processing:**
+- `--columns`: Columns to use for reduction (default: lat, lon, Speed)
+- `--standardize`: Standardize columns before processing (recommended)
+- `--sample`: Sample N rows for faster computation
+
+**t-SNE Parameters:**
+- `--tsne_perplexity`: Perplexity parameter (default: 30)
+- `--tsne_learning_rate`: Learning rate (default: 200)
+- `--tsne_max_iter`: Maximum iterations (default: 1000)
+
+**Clustering:**
+- `--add_clusters`: Add K-means clustering results
+- `--n_clusters`: Number of clusters (default: 8)
+
+### plot_dim_reduction.py
+
+- `--input`: Input TSV.gz file with dimensionality reduction results
+- `--output`: Output plot file (.png)
+- `--methods`: Methods to plot
+- `--color_by`: Column to use for coloring points
+- `--figsize`: Figure size (width, height)
+
+## Performance Recommendations
+
+### For Large Datasets (>100K points)
+
+1. **Start with linear methods**: Use PCA or SVD first for fast exploration
+2. **Sample for t-SNE**: Use `--sample 10000` or similar for t-SNE
+3. **Use standardization**: Always use `--standardize` for mixed-scale data
+4. **Batch processing**: Process subsets of your data separately
+
+### Method Selection Guide
+
+| Dataset Size | Primary Goal | Recommended Method | Notes |
+|-------------|--------------|------------------|-------|
+| <10K points | Visualization | t-SNE | High quality plots |
+| 10K-100K | Exploration | PCA + SVD | Fast, interpretable |
+| 100K-1M | Fast clustering | SVD + clustering | Very fast |
+| >1M points | Initial exploration | PCA with sampling | Use sampling |
+
+## Input Data Format
+
+The script expects TSV.gz files with columns including:
+- `lat`, `lon`: Geographic coordinates
+- `Speed`: Movement speed
+- `Activity`: Activity type (optional, good for coloring)
+- `Name`: Entity identifier (optional, good for coloring)
+- Any other numeric columns of interest
+
+## Output Format
+
+The output files contain:
+- All original columns
+- New columns for each method: `pca_0`, `pca_1`, `tsne_0`, `tsne_1`, etc.
+- Standardized columns (if `--standardize` used): `lat_standardized`, etc.
+- Clustering results (if requested): `kmeans_cluster`
+
+## Integration with Existing Pipeline
+
+This tool is designed to work with the existing CatUmap pipeline:
+
+1. **Generate raw data** with `main.py --outputRaw output/raw.tsv.gz`
+2. **Apply dimensionality reduction** with `dimensionality_reduction.py`
+3. **Create visualizations** with `plot_dim_reduction.py` or use the existing R scripts
+4. **Further analysis** in R using the existing plotting infrastructure
+
+## Examples
+
+### Example 1: Fast Linear Methods
+```bash
+python3 dimensionality_reduction.py \
+    --input output/raw.tsv.gz \
+    --output output/linear_analysis.tsv.gz \
+    --columns lat lon Speed Accuracy Elevation \
+    --methods pca svd \
+    --components 3 \
+    --standardize \
+    --add_clusters --n_clusters 5
+```
+
+### Example 2: High-Quality Visualization
+```bash
+# Sample data for t-SNE
+python3 dimensionality_reduction.py \
+    --input output/raw.tsv.gz \
+    --output output/tsne_visualization.tsv.gz \
+    --columns lat lon Speed \
+    --methods tsne \
+    --sample 5000 \
+    --standardize \
+    --tsne_perplexity 50 \
+    --tsne_learning_rate 200
+
+# Create visualization
+python3 plot_dim_reduction.py \
+    --input output/tsne_visualization.tsv.gz \
+    --output tsne_activity_plot.png \
+    --methods tsne \
+    --color_by Activity
+```
+
+### Example 3: Comprehensive Analysis
+```bash
+# Apply multiple methods
+python3 dimensionality_reduction.py \
+    --input output/raw.tsv.gz \
+    --output output/comprehensive.tsv.gz \
+    --columns lat lon Speed Accuracy \
+    --methods pca svd ica \
+    --standardize \
+    --add_clusters --n_clusters 8
+
+# Visualize results
+python3 plot_dim_reduction.py \
+    --input output/comprehensive.tsv.gz \
+    --output comprehensive_plots.png \
+    --methods pca svd ica \
+    --color_by Name
+```
+
+## Troubleshooting
+
+### Common Issues
+
+1. **Memory errors with t-SNE**: Use `--sample` to reduce dataset size
+2. **Poor clustering results**: Try `--standardize` and experiment with different columns
+3. **Plots look crowded**: Reduce point size or increase figure size
+4. **Methods not found**: Check that column names match your data
+
+### Performance Tips
+
+1. Use SVD for initial exploration of very large datasets
+2. Apply PCA first to reduce dimensions before t-SNE
+3. Experiment with different column combinations
+4. Use clustering results to identify interesting subgroups
+
+## Dependencies
+
+- pandas
+- numpy
+- scikit-learn
+- matplotlib
+- seaborn
+- geopandas (for compatibility with main.py)
+
+Install with:
+```bash
+pip install pandas numpy scikit-learn matplotlib seaborn geopandas
+```
diff --git a/R/dim_reduction_plot.R b/R/dim_reduction_plot.R
@@ -0,0 +1,160 @@
+#!/usr/bin/env Rscript
+library(data.table)
+library(ggplot2)
+library(optparse)
+library(RColorBrewer)
+library(gridExtra)
+
+# Parse command line arguments
+option_list = list(
+  make_option(
+    c("-i", "--input"),
+    type = "character",
+    default = "../output/test_all_methods.tsv.gz",
+    help = "input file with dimensionality reduction results"
+  ),
+  make_option(
+    c("-o", "--output"),
+    type = "character",
+    default = "../output/dim_reduction_plots.png",
+    help = "output plot file"
+  ),
+  make_option(
+    c("-m", "--methods"),
+    type = "character",
+    default = "pca,svd,ica,tsne",
+    help = "comma-separated list of methods to plot"
+  ),
+  make_option(
+    c("-c", "--color_by"),
+    type = "character",
+    default = "Activity",
+    help = "column to use for coloring points"
+  ),
+  make_option(
+    c("-a", "--alpha"),
+    type = "numeric",
+    default = 0.6,
+    help = "point transparency"
+  ),
+  make_option(
+    c("-s", "--point_size"),
+    type = "numeric",
+    default = 0.8,
+    help = "point size"
+  ),
+  make_option(
+    c("-w", "--width"),
+    type = "numeric",
+    default = 16,
+    help = "plot width in inches"
+  ),
+  make_option(
+    c("-h", "--height"),
+    type = "numeric",
+    default = 12,
+    help = "plot height in inches"
+  )
+)
+
+opt_parser = OptionParser(option_list = option_list)
+opt = parse_args(opt_parser)
+
+# Load data
+cat("Loading data from:", opt$input, "\n")
+df <- fread(opt$input)
+cat("Loaded", nrow(df), "rows and", ncol(df), "columns\n")
+cat("Columns:", paste(colnames(df), collapse = ", "), "\n")
+
+# Parse methods
+methods <- strsplit(opt$methods, ",")[[1]]
+methods <- trimws(methods)
+
+# Color palette
+colors <- brewer.pal(min(11, length(unique(df[[opt$color_by]]))), "Spectral")
+
+# Create plots for each method
+plots <- list()
+
+for (method in methods) {
+  x_col <- paste0(method, "_0")
+  y_col <- paste0(method, "_1")
+
+  if (x_col %in% colnames(df) && y_col %in% colnames(df)) {
+    cat("Creating plot for", method, "\n")
+
+    p <- ggplot(df, aes_string(x = x_col, y = y_col, color = opt$color_by)) +
+      geom_point(alpha = opt$alpha, size = opt$point_size) +
+      scale_color_manual(values = colors) +
+      theme_minimal() +
+      theme(
+        legend.position = "bottom",
+        plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
+        axis.title = element_text(size = 12),
+        legend.title = element_text(size = 11),
+        legend.text = element_text(size = 10)
+      ) +
+      labs(
+        title = paste(toupper(method), "Dimensionality Reduction"),
+        x = paste(toupper(method), "Component 1"),
+        y = paste(toupper(method), "Component 2"),
+        color = opt$color_by
+      ) +
+      guides(color = guide_legend(override.aes = list(alpha = 1, size = 3)))
+
+    plots[[method]] <- p
+  } else {
+    cat("Warning: Columns", x_col, "and/or", y_col, "not found for method", method, "\n")
+  }
+}
+
+# Create combined plot
+if (length(plots) > 0) {
+  cat("Creating combined plot with", length(plots), "methods\n")
+
+  # Arrange plots in a grid
+  if (length(plots) == 1) {
+    combined_plot <- plots[[1]]
+  } else if (length(plots) == 2) {
+    combined_plot <- grid.arrange(plots[[1]], plots[[2]], ncol = 2)
+  } else if (length(plots) <= 4) {
+    combined_plot <- grid.arrange(grobs = plots, ncol = 2)
+  } else {
+    combined_plot <- grid.arrange(grobs = plots, ncol = 3)
+  }
+
+  # Save plot
+  cat("Saving plot to:", opt$output, "\n")
+  ggsave(opt$output, combined_plot, width = opt$width, height = opt$height, dpi = 300)
+
+  cat("Plot saved successfully!\n")
+} else {
+  cat("Error: No valid methods found to plot\n")
+  quit(status = 1)
+}
+
+# Print summary statistics
+cat("\n=== Summary Statistics ===\n")
+for (method in methods) {
+  x_col <- paste0(method, "_0")
+  y_col <- paste0(method, "_1")
+
+  if (x_col %in% colnames(df) && y_col %in% colnames(df)) {
+    cat(sprintf("%s - Component 1: mean=%.3f, sd=%.3f\n", 
+                toupper(method), mean(df[[x_col]], na.rm = TRUE), sd(df[[x_col]], na.rm = TRUE)))
+    cat(sprintf("%s - Component 2: mean=%.3f, sd=%.3f\n", 
+                toupper(method), mean(df[[y_col]], na.rm = TRUE), sd(df[[y_col]], na.rm = TRUE)))
+  }
+}
+
+# If clustering results are available, show cluster summary
+if ("kmeans_cluster" %in% colnames(df)) {
+  cat("\n=== K-means Clustering Summary ===\n")
+  cluster_counts <- table(df$kmeans_cluster)
+  for (i in names(cluster_counts)) {
+    cat(sprintf("Cluster %s: %d points (%.1f%%)\n", 
+                i, cluster_counts[i], 100 * cluster_counts[i] / nrow(df)))
+  }
+}
+
+cat("\nDone!\n")