This project is about Big Data ETL processing on spark. I have various Map reduce jobs in different projects, essentially performing ETL processing. My plan is to consolidate them under one project and have them running on Spark. I will be adding new ETL spark jobs also as I go along.
This project is meant for large data volume ETL processing. Unlike many other ETL tools, there won't be any visual tool for manual processing of data. It will use traditional ETL processing logic and machine learning where necessary.
Here are the planned features for now.
- Field level validation with regular expression and custom groovy logic
- Inter field level or record level validation with custom groovy logic
- Isolation of invalid records and merge back after correction
- Field level
- Record level
- Various statistical and proximity based algorithms
- Isolating records with missing fields
- Replace missing fields with imputation
- Normalizing structured text field according to country
- Various free form and structured text matching algorithms
- Various dedup or record linkage algorithms
- Various stats for feature attributes
- Assigning scores to feature variables indicative of a feature's effectiveness in predicting the response variable. Can be used for feature reduction.
- For batch ETL processing, HDFS or file system data input and output
- For realtime ETL processing, Kafka for data input and HDFS or file system data output
- Flat record
- JSON
- XML