whakapai

This project is about Big Data ETL processing on spark. I have various Map reduce jobs in different projects, essentially performing ETL processing. My plan is to consolidate them under one project and have them running on Spark. I will be adding new ETL spark jobs also as I go along.

This project is meant for large data volume ETL processing. Unlike many other ETL tools, there won't be any visual tool for manual processing of data. It will use traditional ETL processing logic and machine learning where necessary.

Here are the planned features for now.

Data validation

Field level validation with regular expression and custom groovy logic
Inter field level or record level validation with custom groovy logic
Isolation of invalid records and merge back after correction

Outlier detection

Field level
Record level
Various statistical and proximity based algorithms

Missing data processing

Isolating records with missing fields
Replace missing fields with imputation

Deduplication or record linkage

Normalizing structured text field according to country
Various free form and structured text matching algorithms
Various dedup or record linkage algorithms

Data preprocessing for predictive modeling

Various stats for feature attributes
Assigning scores to feature variables indicative of a feature's effectiveness in predicting the response variable. Can be used for feature reduction.

Data storage

For batch ETL processing, HDFS or file system data input and output
For realtime ETL processing, Kafka for data input and HDFS or file system data output

Data format

Flat record
JSON
XML

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
spark		spark
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

whakapai

Data validation

Outlier detection

Missing data processing

Deduplication or record linkage

Data preprocessing for predictive modeling

Data storage

Data format

About

Releases

Packages

Languages

softwarevamp/whakapai

Folders and files

Latest commit

History

Repository files navigation

whakapai

Data validation

Outlier detection

Missing data processing

Deduplication or record linkage

Data preprocessing for predictive modeling

Data storage

Data format

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages