-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the Immigration-Data-Engineering wiki! In this project, we will conduct a study on Immigration Data in the united states with regards to World Temperature Data, U.S. City Demographic Data, and Airport Code Data. That is to answer questions such as, what is the percentage of people that immigrate to the U.S from countries with cooler weather?, etc..
The goal is to create ETL pipelines that extract each dataset from the source, apply several levels of transformation, then make it ready to be loaded into an analytics database (data warehouse) with a new consolidated data model that serves the analysis and study needs.
Here is an outline about the datasets:
- I94 Immigration Data: This data comes from the US National Tourism and Trade Office. This is where the data comes from. There's a sample file so you can take a look at the data in csv format before reading it all in.
- World Temperature Data: This dataset came from Kaggle. You can read more about it here.
- U.S. City Demographic Data: This data comes from OpenSoft. You can read more about it here.
- Airport Code Table: This is a simple table of airport codes and corresponding cities. It comes from here.
The project follows the following steps:
- Step 1: Scope the Project and Gather Data
- Step 2: Explore and Assess the Data
- Step 3: Define the Data Model
- Step 4: Run ETL to Model the Data
- Step 5: Complete Project Write Up
Using the mentioned datasets, we have mapped out a conceptual data model concerning immigration data and its different dimensions as follows:
The data model here represents immigration instances with regard to airport data, demographic data, temperature data, and others. The model is designed as a Star schema in the data warehouse, where the desired immigration facts and info are in the fact table, and other perspective information and features that include an object in the fact table are dimension tables.
The star schema is suitable for the analytics purposes of immigration study, since it connects relevant information within a proper number of aggregations and links (fewer joins), and decreases the records duplication at an adequate level with respect to schema complexity.
Fact table:
- Immigration table
Dimension tables:
- Migrant
- Status
- Visa
- Demographics
- Airport
- Country Temperature
- Date