Skip to content

Commit ce7cf2d

Browse files
add data readme
1 parent 031b589 commit ce7cf2d

File tree

8 files changed

+140
-0
lines changed

8 files changed

+140
-0
lines changed

data/README.md

+64
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Data Workflow Process
2+
3+
This document outlines the standard data workflow process from the "Raw" data folder to the "Final" data folder, including the "Processed" data folder. This process ensures that data is properly cleaned, transformed, and prepared for analysis.
4+
5+
## 1. Raw Data
6+
7+
- **Folder:** Raw
8+
- **Purpose:** The "Raw" folder contains the original, untouched data as received from external sources or collected internally. This data serves as the source of truth.
9+
10+
## 2. Data Ingestion
11+
12+
- **Task:** Move data from the "Raw" folder to the "Staging" folder for initial data ingestion and preparation.
13+
- **Folder:** Staging
14+
- **Purpose:** The "Staging" folder is used to prepare data for further processing.
15+
16+
## 3. Data Cleaning and Transformation
17+
18+
- **Task:** Perform data cleaning, including handling missing values, outliers, and data type conversions.
19+
- **Folder:** Staging
20+
- **Purpose:** Prepare data in the "Staging" folder for analysis by ensuring it is accurate and consistent.
21+
22+
## 4. Data Exploration
23+
24+
- **Task:** Explore the data in the "Staging" folder to understand its characteristics and identify potential insights.
25+
- **Folder:** Staging
26+
- **Purpose:** Gain insights into the data, which will inform further data processing steps.
27+
28+
## 5. Intermediate Data Storage
29+
30+
- **Task:** Move cleaned and explored data from the "Staging" folder to the "Processed" folder.
31+
- **Folder:** Processed
32+
- **Purpose:** The "Processed" folder is used to store data that has undergone initial cleaning and exploration.
33+
34+
## 6. Additional Data Transformation
35+
36+
- **Task:** Perform additional data transformations, such as feature engineering or aggregations, on data in the "Processed" folder.
37+
- **Folder:** Processed
38+
- **Purpose:** Create datasets in the "Processed" folder that are tailored for specific analysis goals.
39+
40+
## 7. Data Quality Assurance
41+
42+
- **Task:** Ensure data quality by conducting thorough quality checks and validation.
43+
- **Folder:** Processed
44+
- **Purpose:** Verify that data in the "Processed" folder is accurate and reliable for analysis.
45+
46+
## 8. Intermediate Data Storage (Interim)
47+
48+
- **Task:** Move data from the "Processed" folder to the "Interim" folder as needed for specific analysis steps.
49+
- **Folder:** Interim
50+
- **Purpose:** The "Interim" folder is used to store intermediate datasets that are essential for specific analysis steps.
51+
52+
## 9. Final Data Storage
53+
54+
- **Task:** Move the final analysis-ready data from the "Interim" folder to the "Final" folder.
55+
- **Folder:** Final
56+
- **Purpose:** The "Final" folder contains cleaned and processed data that is ready for analysis, reporting, or sharing with stakeholders.
57+
58+
## 10. Documentation
59+
60+
- **Task:** Document all data processing steps, transformations, and any relevant metadata.
61+
- **Folder:** docs
62+
- **Purpose:** Maintain clear documentation to ensure reproducibility and transparency in the data analysis process.
63+
64+
This workflow provides a structured approach to preparing data for analysis, ensuring that data is accurate, cleaned, and transformed appropriately before it reaches its final state in the "Final" folder.

data/external/README.md

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
### Folder: External
2+
3+
**Purpose:**
4+
5+
> This folder is intended for storing data from external sources. Data placed here should be in its raw, unprocessed form, as received from external providers. It serves as our source repository for external data.
6+
7+
**Usage:**
8+
9+
> - Data in this folder should not be modified directly.
10+
> - When new external data arrives, place it here for further processing.
11+
> - Document the data source and any relevant details in the metadata file.

data/final/README.md

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
### Folder: Final
2+
3+
**Purpose:**
4+
5+
> The "Final" folder contains cleaned and processed data that is ready for analysis, reporting, or sharing with stakeholders. This represents the final state of our data for specific projects.
6+
7+
**Usage:**
8+
9+
> - Data here is considered ready for analysis.
10+
> - Ensure that all necessary data transformations and cleaning have been completed before moving data to this folder.
11+
> - Document any transformations or preprocessing steps applied to the data.

data/interim/README.md

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
### Folder: Interim
2+
3+
**Purpose:**
4+
5+
> The "Interim" folder is used for storing intermediate datasets generated during the data preparation process. These datasets might not be the final data but are crucial for specific analysis steps or for tracking data transformations.
6+
7+
**Usage:**
8+
9+
> - Data here is not considered final but may be required for ongoing analysis.
10+
> - Document the purpose of each interim dataset and its relationship to the overall analysis process.
11+
> - Regularly review and clean up outdated or unused interim data.

data/lookup/README.md

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
### Folder: Lookup
2+
3+
**Purpose:**
4+
5+
> This folder contains reference data and lookup tables. Lookup tables are used to map values from one dataset to another, often for data enrichment or creating meaningful labels.
6+
7+
**Usage:**
8+
9+
> - Store all reference data, such as code lists, dictionaries, or mappings, in this folder.
10+
> - Ensure that lookup tables are well-documented, specifying the keys and values they contain.
11+
> - Keep these tables up to date if any changes occur.

data/processed/README.md

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
### Folder: Processed
2+
3+
**Purpose:**
4+
5+
> The "Processed" folder is for datasets that have gone through some initial processing but are not yet finalized. It might contain partially cleaned or transformed data.
6+
7+
**Usage:**
8+
9+
> - Use this folder for data that is being actively worked on but is not yet ready for analysis.
10+
> - Document the processing steps that have been applied and any remaining tasks.

data/raw/README.md

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
### Folder: Raw
2+
3+
**Purpose:**
4+
5+
> The "Raw" folder stores the original, untouched data as it was received or collected from its source. It serves as the source of truth and can be used to recreate all other data processing steps.
6+
7+
**Usage:**
8+
9+
> - Data in this folder should not be modified or overwritten.
10+
> - Consider creating subfolders for different data sources or projects if needed.
11+
> - Document the source and collection date of the raw data.

data/staging/README.md

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
### Folder: Staging
2+
3+
**Purpose:**
4+
5+
> The "Staging" folder is an intermediary step before data is moved to other folders like "Interim" or "Final." It's used for data that is in the process of being ingested or prepared for analysis.
6+
7+
**Usage:**
8+
9+
> - Data here is temporary and will be moved to other folders once it's ready.
10+
> - Keep this folder organized and regularly clean up data that has been processed and moved to other folders.
11+
> - Document the staging process for data ingestion and preparation.

0 commit comments

Comments
 (0)