GoogleCloudPlatform
diff --git a/‎projects/hadoop-to-lakehouse-migration-demo/docs/.pages‎
Lines changed: 5 additions & 0 deletions b/‎projects/hadoop-to-lakehouse-migration-demo/docs/.pages‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎projects/hadoop-to-lakehouse-migration-demo/docs/README.md‎
Lines changed: 55 additions & 0 deletions b/‎projects/hadoop-to-lakehouse-migration-demo/docs/README.md‎
Lines changed: 55 additions & 0 deletions
diff --git a/‎projects/hadoop-to-lakehouse-migration-demo/docs/reference_architecture.md‎
Lines changed: 123 additions & 0 deletions b/‎projects/hadoop-to-lakehouse-migration-demo/docs/reference_architecture.md‎
Lines changed: 123 additions & 0 deletions
@@ -0,0 +1,5 @@
+title: "Hadoop-to-Google Cloud Data Platform Journey"
+nav:
+  - "Overview": "README.md"
+  - "Reference Architecture": "reference_architecture.md"
+  - "User Journey": "user_journey.md"
@@ -0,0 +1,55 @@
+# Hadoop to Google Cloud Data Lakehouse Migration Demo
+
+## Overview
+
+This project demonstrates a complete, end-to-end migration of a legacy Hadoop
+data platform to a modern Data Lakehouse on Google Cloud Platform. It simulates
+a source environment with a Managed Spark cluster running Hive and HDFS, and
+provides all the tools and automation to migrate data and workloads to a modern
+target environment leveraging Cloud Storage, Managed Spark Metastore, Managed
+Spark Serverless, Apache Iceberg, and BigQuery.
+
+## Intent
+
+The intent of this demo is to showcase:
+
+1.  **Simulated Legacy Environment**: A realistic starting point with a
+    non-cloud-integrated Hadoop cluster.
+1.  **Assessment**: Using metadata extraction tools to understand the source
+    schema.
+1.  **Cloud-Native Transfer**: Utilizing Storage Transfer Service for efficient
+    data movement.
+1.  **Modern Lakehouse Format**: Converting data to Apache Iceberg for
+    transactional capabilities and performance.
+1.  **SQL Translation**: Modernizing queries from HiveQL to GoogleSQL.
+
+## Target Audience
+
+- **Data Engineers** looking for practical examples of Hadoop-to-Google Cloud
+  migrations.
+- **Solution Architects** designing modern data platform architectures on Google
+  Cloud.
+- **Decision Makers** who want to see the value of modernizing their legacy data
+  lakes.
+
+## Getting Started
+
+To get started with the demo, please follow the step-by-step guide in the
+[User Journey](user_journey.md).
+
+!!! important "Prerequisites" Before you begin, make sure to review the
+prerequisites section in the [User Journey](user_journey.md). You will need
+**two** Google Cloud projects with billing enabled and appropriate permissions.
+
+## Reference Architecture
+
+You can view the reference architecture diagram and component descriptions in
+[reference_architecture.md](reference_architecture.md).
+
+## Project Structure
+
+- `terraform/`: Infrastructure as Code for source and target environments.
+- `scripts/`: Automation scripts for loading data, running jobs, and
+  orchestration.
+- `docs/`: Comprehensive documentation including user journey and reference
+  architecture.
@@ -0,0 +1,123 @@
+# Reference Architecture: Hadoop to Google Cloud Data Lakehouse Migration
+
+This document presents the reference architecture for migrating from a legacy
+Hadoop environment to a modern Data Lakehouse on Google Cloud, as implemented in
+this demo.
+
+## Architecture Diagram
+
+```mermaid
+graph LR
+    subgraph source_env ["Source Environment"]
+      hive[(Hive)]
+      onprem_spark[Spark]
+      hdfs[(HDFS)]
+    end
+
+    subgraph migration_tools ["Migration Tools"]
+        bts[BigQuery Translation Service]
+        ma[Migration Assessment]
+        md[dwh-migration-dumper]
+        sts[Storage Transfer Service]
+    end
+
+
+  subgraph target_env ["Target Environment"]
+    bz[(Bronze Zone / BQ + Cloud Storage)]
+    bronze_spark["Managed Spark"]
+    sz[(Silver Zone / BQ + Cloud Storage)]
+    silver_spark["Managed Spark"]
+    gz[(Gold Zone / BQ + Cloud Storage)]
+    lakehouse[(Lakehouse Runtime Catalog)]
+    bigquery[(BigQuery)]
+    managed_spark[Managed Spark]
+
+  end
+
+  subgraph serving ["Serving Layer"]
+    bi[Analytics & BI]
+    ml["Machine Learning & Notebooks"]
+    agents[Agents]
+    operational[Operational Data]
+  end
+
+    %% Flow
+    hive --> bts
+  hive --> md
+  hive --> ma
+  hdfs --> ma
+  hdfs --> sts
+  onprem_spark --> ma
+
+  bz <--> lakehouse
+  sz <--> lakehouse
+  gz <--> lakehouse
+
+  sts --> bz
+  bz --> bronze_spark
+  bronze_spark --> sz
+  sz --> silver_spark
+  silver_spark --> gz
+
+  gz --> managed_spark
+  gz --> bigquery
+
+  bigquery --> bi
+  bigquery --> operational
+
+  managed_spark --> ml
+  managed_spark --> agents
+
+```
+
+## Component Descriptions
+
+### Source Environment
+
+- **Hadoop Cluster**: Simulated by a standalone cluster of "Managed Service for
+  Apache Spark" (formally Dataproc)
+    - **HDFS**: Distributed file system storing the raw data.
+    - **Hive Metastore**: Manages metadata for Hive tables.
+
+### Migration Tools
+
+- **migration-assesment-tool**: Tool for assessing usage in hadoop, and predict
+  usage on Google Cloud services.
+- **dwh-migration-dumper**: Extracts DDL and metadata from the source Hive
+  system for assessment.
+- **Storage Transfer Service (STS)**: Cloud-native service used to transfer data
+  from HDFS to Google Cloud Storage.
+- **BigQuery Translation Service**: Translates legacy HiveQL queries to modern
+  GoogleSQL for BigQuery.
+
+### Target Environment
+
+Divided into the classic 3 tier "Medallion" architecture:
+
+- **Bronze Zone**: Used for storing raw data, as is.
+- **Silver Zone**: Used for storing processed data, cleaned and verified.
+- **Gold Zone**: Used for storing enriched data, ready for analytics and
+  serving.
+
+Each zone is comprised of:
+
+- **Cloud Storage**: Serves as the storage layer, from CSVs and binary files, to
+  Parquet files with Apache Iceberg format.
+- **Lakehouse Runtime Iceberg Catalog**: Managed service for managing datasets
+  in the Apache Iceberg format.
+- **Managed Service for Apache Spark Serverless**: Runs Spark jobs to convert
+  process data from one zone to the next.
+- **BigQuery**: Enables querying Iceberg tables directly in Cloud Storage with
+  BigQuery performance and security.
+
+### Serving Layer
+
+The serving layer is responsible for serving data to end users. It is comprised
+of:
+
+- **BI & Analytics**: Tools for data visualization and analysis.
+- **Machine Learning processes and Notebook**: Tools for data scientists to
+  develop ML models.
+- **Agents**: Agents deployed that need grounding in enterprise data.
+- **Operational Data**: Data that is being used by operational, user facing
+  systems.