Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
236 changes: 236 additions & 0 deletions DIMENSIONALITY_REDUCTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
# Dimensionality Reduction for CatUmap

This document describes the additional dimensionality reduction techniques available in the CatUmap repository beyond the existing UMAP implementation.

## Overview

The new dimensionality reduction tools can process the `outputRaw` files generated by `main.py` and apply various techniques to collapse millions of geospatial features into meaningful clusters and 2D/3D visualizations.

## Available Techniques

### 1. Principal Component Analysis (PCA)
- **Type**: Linear dimensionality reduction
- **Best for**: Fast processing, interpretable components, initial data exploration
- **Speed**: Very fast, scales well to millions of points
- **Use case**: Understanding variance structure in your data

### 2. t-Distributed Stochastic Neighbor Embedding (t-SNE)
- **Type**: Non-linear dimensionality reduction
- **Best for**: High-quality visualizations, revealing local structure
- **Speed**: Slower, recommended to sample large datasets
- **Use case**: Creating publication-quality scatter plots

### 3. Truncated Singular Value Decomposition (SVD)
- **Type**: Linear dimensionality reduction
- **Best for**: Very large datasets, fast processing
- **Speed**: Fastest method available
- **Use case**: Initial exploration of massive datasets

### 4. Independent Component Analysis (ICA)
- **Type**: Linear dimensionality reduction
- **Best for**: Finding independent source signals
- **Speed**: Fast
- **Use case**: When you expect independent underlying processes

## Quick Start

### Basic Usage

```bash
# Apply PCA and SVD to lat, lon, Speed columns
python3 dimensionality_reduction.py \
--input output/raw.tsv.gz \
--output output/pca_svd_results.tsv.gz \
--columns lat lon Speed \
--methods pca svd \
--standardize

# Apply t-SNE with sampling for large datasets
python3 dimensionality_reduction.py \
--input output/raw.tsv.gz \
--output output/tsne_results.tsv.gz \
--columns lat lon Speed Accuracy \
--methods tsne \
--sample 10000 \
--standardize
```

### Generate Visualizations

```bash
# Create plots for PCA and SVD results
python3 plot_dim_reduction.py \
--input output/pca_svd_results.tsv.gz \
--output pca_svd_plots.png \
--methods pca svd \
--color_by Activity
```

### Complete Example

```bash
# Run the comprehensive example
python3 example_usage.py
```

## Command Line Options

### dimensionality_reduction.py

**Input/Output:**
- `--input`: Input TSV.gz file (outputRaw from main.py)
- `--output`: Output TSV.gz file with results

**Methods:**
- `--methods`: Choose from `pca`, `tsne`, `svd`, `ica` (can specify multiple)
- `--components`: Number of dimensions to reduce to (default: 2)

**Data Processing:**
- `--columns`: Columns to use for reduction (default: lat, lon, Speed)
- `--standardize`: Standardize columns before processing (recommended)
- `--sample`: Sample N rows for faster computation

**t-SNE Parameters:**
- `--tsne_perplexity`: Perplexity parameter (default: 30)
- `--tsne_learning_rate`: Learning rate (default: 200)
- `--tsne_max_iter`: Maximum iterations (default: 1000)

**Clustering:**
- `--add_clusters`: Add K-means clustering results
- `--n_clusters`: Number of clusters (default: 8)

### plot_dim_reduction.py

- `--input`: Input TSV.gz file with dimensionality reduction results
- `--output`: Output plot file (.png)
- `--methods`: Methods to plot
- `--color_by`: Column to use for coloring points
- `--figsize`: Figure size (width, height)

## Performance Recommendations

### For Large Datasets (>100K points)

1. **Start with linear methods**: Use PCA or SVD first for fast exploration
2. **Sample for t-SNE**: Use `--sample 10000` or similar for t-SNE
3. **Use standardization**: Always use `--standardize` for mixed-scale data
4. **Batch processing**: Process subsets of your data separately

### Method Selection Guide

| Dataset Size | Primary Goal | Recommended Method | Notes |
|-------------|--------------|------------------|-------|
| <10K points | Visualization | t-SNE | High quality plots |
| 10K-100K | Exploration | PCA + SVD | Fast, interpretable |
| 100K-1M | Fast clustering | SVD + clustering | Very fast |
| >1M points | Initial exploration | PCA with sampling | Use sampling |

## Input Data Format

The script expects TSV.gz files with columns including:
- `lat`, `lon`: Geographic coordinates
- `Speed`: Movement speed
- `Activity`: Activity type (optional, good for coloring)
- `Name`: Entity identifier (optional, good for coloring)
- Any other numeric columns of interest

## Output Format

The output files contain:
- All original columns
- New columns for each method: `pca_0`, `pca_1`, `tsne_0`, `tsne_1`, etc.
- Standardized columns (if `--standardize` used): `lat_standardized`, etc.
- Clustering results (if requested): `kmeans_cluster`

## Integration with Existing Pipeline

This tool is designed to work with the existing CatUmap pipeline:

1. **Generate raw data** with `main.py --outputRaw output/raw.tsv.gz`
2. **Apply dimensionality reduction** with `dimensionality_reduction.py`
3. **Create visualizations** with `plot_dim_reduction.py` or use the existing R scripts
4. **Further analysis** in R using the existing plotting infrastructure

## Examples

### Example 1: Fast Linear Methods
```bash
python3 dimensionality_reduction.py \
--input output/raw.tsv.gz \
--output output/linear_analysis.tsv.gz \
--columns lat lon Speed Accuracy Elevation \
--methods pca svd \
--components 3 \
--standardize \
--add_clusters --n_clusters 5
```

### Example 2: High-Quality Visualization
```bash
# Sample data for t-SNE
python3 dimensionality_reduction.py \
--input output/raw.tsv.gz \
--output output/tsne_visualization.tsv.gz \
--columns lat lon Speed \
--methods tsne \
--sample 5000 \
--standardize \
--tsne_perplexity 50 \
--tsne_learning_rate 200

# Create visualization
python3 plot_dim_reduction.py \
--input output/tsne_visualization.tsv.gz \
--output tsne_activity_plot.png \
--methods tsne \
--color_by Activity
```

### Example 3: Comprehensive Analysis
```bash
# Apply multiple methods
python3 dimensionality_reduction.py \
--input output/raw.tsv.gz \
--output output/comprehensive.tsv.gz \
--columns lat lon Speed Accuracy \
--methods pca svd ica \
--standardize \
--add_clusters --n_clusters 8

# Visualize results
python3 plot_dim_reduction.py \
--input output/comprehensive.tsv.gz \
--output comprehensive_plots.png \
--methods pca svd ica \
--color_by Name
```

## Troubleshooting

### Common Issues

1. **Memory errors with t-SNE**: Use `--sample` to reduce dataset size
2. **Poor clustering results**: Try `--standardize` and experiment with different columns
3. **Plots look crowded**: Reduce point size or increase figure size
4. **Methods not found**: Check that column names match your data

### Performance Tips

1. Use SVD for initial exploration of very large datasets
2. Apply PCA first to reduce dimensions before t-SNE
3. Experiment with different column combinations
4. Use clustering results to identify interesting subgroups

## Dependencies

- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
- geopandas (for compatibility with main.py)

Install with:
```bash
pip install pandas numpy scikit-learn matplotlib seaborn geopandas
```
160 changes: 160 additions & 0 deletions R/dim_reduction_plot.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
#!/usr/bin/env Rscript
library(data.table)
library(ggplot2)
library(optparse)
library(RColorBrewer)
library(gridExtra)

# Parse command line arguments
option_list = list(
make_option(
c("-i", "--input"),
type = "character",
default = "../output/test_all_methods.tsv.gz",
help = "input file with dimensionality reduction results"
),
make_option(
c("-o", "--output"),
type = "character",
default = "../output/dim_reduction_plots.png",
help = "output plot file"
),
make_option(
c("-m", "--methods"),
type = "character",
default = "pca,svd,ica,tsne",
help = "comma-separated list of methods to plot"
),
make_option(
c("-c", "--color_by"),
type = "character",
default = "Activity",
help = "column to use for coloring points"
),
make_option(
c("-a", "--alpha"),
type = "numeric",
default = 0.6,
help = "point transparency"
),
make_option(
c("-s", "--point_size"),
type = "numeric",
default = 0.8,
help = "point size"
),
make_option(
c("-w", "--width"),
type = "numeric",
default = 16,
help = "plot width in inches"
),
make_option(
c("-h", "--height"),
type = "numeric",
default = 12,
help = "plot height in inches"
)
)

opt_parser = OptionParser(option_list = option_list)
opt = parse_args(opt_parser)

# Load data
cat("Loading data from:", opt$input, "\n")
df <- fread(opt$input)
cat("Loaded", nrow(df), "rows and", ncol(df), "columns\n")
cat("Columns:", paste(colnames(df), collapse = ", "), "\n")

# Parse methods
methods <- strsplit(opt$methods, ",")[[1]]
methods <- trimws(methods)

# Color palette
colors <- brewer.pal(min(11, length(unique(df[[opt$color_by]]))), "Spectral")

# Create plots for each method
plots <- list()

for (method in methods) {
x_col <- paste0(method, "_0")
y_col <- paste0(method, "_1")

if (x_col %in% colnames(df) && y_col %in% colnames(df)) {
cat("Creating plot for", method, "\n")

p <- ggplot(df, aes_string(x = x_col, y = y_col, color = opt$color_by)) +
geom_point(alpha = opt$alpha, size = opt$point_size) +
scale_color_manual(values = colors) +
theme_minimal() +
theme(
legend.position = "bottom",
plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.title = element_text(size = 12),
legend.title = element_text(size = 11),
legend.text = element_text(size = 10)
) +
labs(
title = paste(toupper(method), "Dimensionality Reduction"),
x = paste(toupper(method), "Component 1"),
y = paste(toupper(method), "Component 2"),
color = opt$color_by
) +
guides(color = guide_legend(override.aes = list(alpha = 1, size = 3)))

plots[[method]] <- p
} else {
cat("Warning: Columns", x_col, "and/or", y_col, "not found for method", method, "\n")
}
}

# Create combined plot
if (length(plots) > 0) {
cat("Creating combined plot with", length(plots), "methods\n")

# Arrange plots in a grid
if (length(plots) == 1) {
combined_plot <- plots[[1]]
} else if (length(plots) == 2) {
combined_plot <- grid.arrange(plots[[1]], plots[[2]], ncol = 2)
} else if (length(plots) <= 4) {
combined_plot <- grid.arrange(grobs = plots, ncol = 2)
} else {
combined_plot <- grid.arrange(grobs = plots, ncol = 3)
}

# Save plot
cat("Saving plot to:", opt$output, "\n")
ggsave(opt$output, combined_plot, width = opt$width, height = opt$height, dpi = 300)

cat("Plot saved successfully!\n")
} else {
cat("Error: No valid methods found to plot\n")
quit(status = 1)
}

# Print summary statistics
cat("\n=== Summary Statistics ===\n")
for (method in methods) {
x_col <- paste0(method, "_0")
y_col <- paste0(method, "_1")

if (x_col %in% colnames(df) && y_col %in% colnames(df)) {
cat(sprintf("%s - Component 1: mean=%.3f, sd=%.3f\n",
toupper(method), mean(df[[x_col]], na.rm = TRUE), sd(df[[x_col]], na.rm = TRUE)))
cat(sprintf("%s - Component 2: mean=%.3f, sd=%.3f\n",
toupper(method), mean(df[[y_col]], na.rm = TRUE), sd(df[[y_col]], na.rm = TRUE)))
}
}

# If clustering results are available, show cluster summary
if ("kmeans_cluster" %in% colnames(df)) {
cat("\n=== K-means Clustering Summary ===\n")
cluster_counts <- table(df$kmeans_cluster)
for (i in names(cluster_counts)) {
cat(sprintf("Cluster %s: %d points (%.1f%%)\n",
i, cluster_counts[i], 100 * cluster_counts[i] / nrow(df)))
}
}

cat("\nDone!\n")
Loading