Skip to content

Commit ab2e1d4

Browse files
Eyal Ben Ivricopybara-github
authored andcommitted
feat: Initial commit for Hadoop-to-Lakehouse demo project. ready for review. Follow README and docs/user_journey.md
added fixes from CL comments Post-big CL fixes. Change-Id: If2bdde3d77b00950c1edf3fc69846421fdf96d00 GitOrigin-RevId: dbf0be786186ff81644d2f8f7a0486502010e3fc
1 parent 201c4dd commit ab2e1d4

41 files changed

Lines changed: 3107 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
title: "Hadoop-to-Google Cloud Data Platform Journey"
2+
nav:
3+
- "Overview": "README.md"
4+
- "Reference Architecture": "reference_architecture.md"
5+
- "User Journey": "user_journey.md"
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# Hadoop to Google Cloud Data Lakehouse Migration Demo
2+
3+
## Overview
4+
5+
This project demonstrates a complete, end-to-end migration of a legacy Hadoop
6+
data platform to a modern Data Lakehouse on Google Cloud Platform. It simulates
7+
a source environment with a Managed Spark cluster running Hive and HDFS, and
8+
provides all the tools and automation to migrate data and workloads to a modern
9+
target environment leveraging Cloud Storage, Managed Spark Metastore, Managed
10+
Spark Serverless, Apache Iceberg, and BigQuery.
11+
12+
## Intent
13+
14+
The intent of this demo is to showcase:
15+
16+
1. **Simulated Legacy Environment**: A realistic starting point with a
17+
non-cloud-integrated Hadoop cluster.
18+
1. **Assessment**: Using metadata extraction tools to understand the source
19+
schema.
20+
1. **Cloud-Native Transfer**: Utilizing Storage Transfer Service for efficient
21+
data movement.
22+
1. **Modern Lakehouse Format**: Converting data to Apache Iceberg for
23+
transactional capabilities and performance.
24+
1. **SQL Translation**: Modernizing queries from HiveQL to GoogleSQL.
25+
26+
## Target Audience
27+
28+
- **Data Engineers** looking for practical examples of Hadoop-to-Google Cloud
29+
migrations.
30+
- **Solution Architects** designing modern data platform architectures on Google
31+
Cloud.
32+
- **Decision Makers** who want to see the value of modernizing their legacy data
33+
lakes.
34+
35+
## Getting Started
36+
37+
To get started with the demo, please follow the step-by-step guide in the
38+
[User Journey](user_journey.md).
39+
40+
!!! important "Prerequisites" Before you begin, make sure to review the
41+
prerequisites section in the [User Journey](user_journey.md). You will need
42+
**two** Google Cloud projects with billing enabled and appropriate permissions.
43+
44+
## Reference Architecture
45+
46+
You can view the reference architecture diagram and component descriptions in
47+
[reference_architecture.md](reference_architecture.md).
48+
49+
## Project Structure
50+
51+
- `terraform/`: Infrastructure as Code for source and target environments.
52+
- `scripts/`: Automation scripts for loading data, running jobs, and
53+
orchestration.
54+
- `docs/`: Comprehensive documentation including user journey and reference
55+
architecture.
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# Reference Architecture: Hadoop to Google Cloud Data Lakehouse Migration
2+
3+
This document presents the reference architecture for migrating from a legacy
4+
Hadoop environment to a modern Data Lakehouse on Google Cloud, as implemented in
5+
this demo.
6+
7+
## Architecture Diagram
8+
9+
```mermaid
10+
graph LR
11+
subgraph source_env ["Source Environment"]
12+
hive[(Hive)]
13+
onprem_spark[Spark]
14+
hdfs[(HDFS)]
15+
end
16+
17+
subgraph migration_tools ["Migration Tools"]
18+
bts[BigQuery Translation Service]
19+
ma[Migration Assessment]
20+
md[dwh-migration-dumper]
21+
sts[Storage Transfer Service]
22+
end
23+
24+
25+
subgraph target_env ["Target Environment"]
26+
bz[(Bronze Zone / BQ + Cloud Storage)]
27+
bronze_spark["Managed Spark"]
28+
sz[(Silver Zone / BQ + Cloud Storage)]
29+
silver_spark["Managed Spark"]
30+
gz[(Gold Zone / BQ + Cloud Storage)]
31+
lakehouse[(Lakehouse Runtime Catalog)]
32+
bigquery[(BigQuery)]
33+
managed_spark[Managed Spark]
34+
35+
end
36+
37+
subgraph serving ["Serving Layer"]
38+
bi[Analytics & BI]
39+
ml["Machine Learning & Notebooks"]
40+
agents[Agents]
41+
operational[Operational Data]
42+
end
43+
44+
%% Flow
45+
hive --> bts
46+
hive --> md
47+
hive --> ma
48+
hdfs --> ma
49+
hdfs --> sts
50+
onprem_spark --> ma
51+
52+
bz <--> lakehouse
53+
sz <--> lakehouse
54+
gz <--> lakehouse
55+
56+
sts --> bz
57+
bz --> bronze_spark
58+
bronze_spark --> sz
59+
sz --> silver_spark
60+
silver_spark --> gz
61+
62+
gz --> managed_spark
63+
gz --> bigquery
64+
65+
bigquery --> bi
66+
bigquery --> operational
67+
68+
managed_spark --> ml
69+
managed_spark --> agents
70+
71+
```
72+
73+
## Component Descriptions
74+
75+
### Source Environment
76+
77+
- **Hadoop Cluster**: Simulated by a standalone cluster of "Managed Service for
78+
Apache Spark" (formally Dataproc)
79+
- **HDFS**: Distributed file system storing the raw data.
80+
- **Hive Metastore**: Manages metadata for Hive tables.
81+
82+
### Migration Tools
83+
84+
- **migration-assesment-tool**: Tool for assessing usage in hadoop, and predict
85+
usage on Google Cloud services.
86+
- **dwh-migration-dumper**: Extracts DDL and metadata from the source Hive
87+
system for assessment.
88+
- **Storage Transfer Service (STS)**: Cloud-native service used to transfer data
89+
from HDFS to Google Cloud Storage.
90+
- **BigQuery Translation Service**: Translates legacy HiveQL queries to modern
91+
GoogleSQL for BigQuery.
92+
93+
### Target Environment
94+
95+
Divided into the classic 3 tier "Medallion" architecture:
96+
97+
- **Bronze Zone**: Used for storing raw data, as is.
98+
- **Silver Zone**: Used for storing processed data, cleaned and verified.
99+
- **Gold Zone**: Used for storing enriched data, ready for analytics and
100+
serving.
101+
102+
Each zone is comprised of:
103+
104+
- **Cloud Storage**: Serves as the storage layer, from CSVs and binary files, to
105+
Parquet files with Apache Iceberg format.
106+
- **Lakehouse Runtime Iceberg Catalog**: Managed service for managing datasets
107+
in the Apache Iceberg format.
108+
- **Managed Service for Apache Spark Serverless**: Runs Spark jobs to convert
109+
process data from one zone to the next.
110+
- **BigQuery**: Enables querying Iceberg tables directly in Cloud Storage with
111+
BigQuery performance and security.
112+
113+
### Serving Layer
114+
115+
The serving layer is responsible for serving data to end users. It is comprised
116+
of:
117+
118+
- **BI & Analytics**: Tools for data visualization and analysis.
119+
- **Machine Learning processes and Notebook**: Tools for data scientists to
120+
develop ML models.
121+
- **Agents**: Agents deployed that need grounding in enterprise data.
122+
- **Operational Data**: Data that is being used by operational, user facing
123+
systems.

0 commit comments

Comments
 (0)