A collection of data analysis projects demonstrating end-to-end analytics capabilities and business impact.
- π§βπ» About me
- π οΈ Languages and Tools
- π― Skills
- π Awesome Data Analysis
- π§© Building Startup Analytics
- π Deep Sales Analysis of Olist E-Commerce
- π WWI Data Pipeline and Dashboard
- π Frameon - Advanced Analytics for Pandas
- π License
- I hold a higher technical education.
- I specialize in data analysis with a focus on empowering informed decision-making.
- By extracting insights from complex data sets, I help organizations make data-driven decisions that drive business growth and improvement.
- Programming Languages: Python, SQL (PostgreSQL, MySQL, ClickHouse), NoSQL (MongoDB).
- Data Analysis & Visualization:
- Libraries: Pandas, NumPy, SciPy, Statsmodels, Pingouin, Plotly, Matplotlib, Seaborn.
- Tools & Frameworks: Dash, Power BI, Tableau, Redash, DataLens, Superset.
- Big Data & Distributed Computing: Apache Spark, Apache Airflow.
- Machine learning and AI: Scikit-learn, MLlib.
- Time Series Forecasting: Facebook Prophet, Uber Orbit.
- Natural Language Processing: NLTK, SpaCy, TextBlob.
- Web scraping: BeautifulSoup, Selenium, Scrapy.
- DevOps: Linux, Git, Docker.
- IDEs: VS Code, Google Colab, Jupyter Notebook, Zeppelin, PyCharm.
- Deep data analysis:
- Preprocessing, cleaning, and identifying patterns using visualization to support decision-making.
- Writing complex SQL queries:
- Working with nested queries, window functions, CASE and WITH statements for data extraction and analysis.
- Understanding product strategy:
- Knowledge of product development and improvement principles, including analyzing user needs and formulating recommendations for its growth.
- Product metrics analysis:
- LTV, RR, CR, ARPU, ARPPU, MAU, DAU, and other key performance indicators.
- Conducting A/B testing:
- Analyzing results using statistical methods to evaluate the effectiveness of changes.
- Cohort analysis and RFM segmentation:
- Identifying user behavior patterns to optimize marketing strategies.
- End-to-End Data Pipelines:
- Building automated ETL processes from databases to dashboards with Airflow orchestration.
- Data visualization and dashboard development:
- Creating interactive reports in Tableau, Redash, Power BI, and other tools for presenting analytics.
- Web scraping:
- Experience in extracting data from websites using tools and libraries such as BeautifulSoup, Scrapy, and Selenium for information gathering and data analysis.
- Working with big data:
- Experience with tools and technologies for processing large volumes of data (e.g., Hadoop, Spark).
- Machine Learning Applications:
- Capable of building and applying machine learning models for data analysis tasks, including forecasting, classification, and clustering, to uncover deeper insights and enhance decision-making processes.
- Business and Metric Forecasting:
- Building and interpreting time series forecasts for key business metrics using libraries like Uber Orbit and Facebook Prophet for intuitive, robust forecasting to support strategic planning and goal-setting.
- Working with APIs:
- Integrating and extracting data from various sources via APIs.
- Process Automation:
- Automating data workflows and routine tasks using Linux scripting, Apache Airflow and other DevOps tools.
500+ curated resources for data analysis and data science: tools, libraries, roadmaps, cheatsheets, and interview guides.
Key Methods:
- Knowledge Management & Information Architecture:
- Systematic content curation, resource classification, and learning path development
- Research & Critical Thinking:
- Technical content evaluation, accuracy validation, and relevance assessment
- Content Strategy & Curation:
- Quality control implementation, information synthesis, and accessibility optimization
Project Description:
- A curated knowledge hub demonstrating systematic approach to data analysis, reflecting expertise in structuring complex information and evaluating technical content.
Project Goal:
- To create a comprehensive, well-organized resource collection that facilitates learning and professional development in data analysis and data science.
Key Achievements:
- Systematized 500+ resources into logical learning paths and competency areas
- Implemented rigorous quality control by selecting materials based on accuracy and relevance
- Optimized information architecture for quick navigation and knowledge discovery
- Enhanced accessibility through web version development
- Synthesized fragmented knowledge into unified, actionable framework
Business Impact:
- Established trusted reference platform that accelerates learning curve for data professionals and demonstrates expertise in information architecture and knowledge management.
Building analytics process for startup: infrastructure, dashboards, A/B testing, forecasting, automated reports, and anomaly detection.
Stack:
- Data & DB:
PythonPandasClickHouse - Viz & BI:
SupersetYandex DataLensPlotly - ML & Stats:
StatsModelsSciPyPingouinUber Orbit - Automation:
Apache AirflowTelegram API
Key Methods:
- Data Infrastructure Design:
- Star schema modeling, ETL pipeline development, and data quality validation
- Product Analytics:
- Retention analysis, cohort analysis, and engagement metrics tracking
- Business Intelligence:
- Real-time dashboard design, KPI definition, and self-service reporting implementation
- Statistical Hypothesis Testing:
- A/A and A/B test analysis, sample size calculation, and statistical power analysis
- Time Series Forecasting:
- Bayesian structural models, trend/seasonality decomposition, and model validation
- Anomaly Detection:
- MAD-based outlier detection, alert threshold optimization, and real-time monitoring
- Automation Engineering:
- DAG orchestration, API integration, and scheduled reporting systems
- Monte Carlo Simulation:
- Statistical power estimation and sample size determination through simulation
Project Description:
- This project demonstrates the implementation of a complete product analytics system for an early-stage startup that has developed an application merging a messenger with a personalized news feed.
- In this ecosystem, users can browse and interact with posts (views, likes) while simultaneously communicating with each other through direct messages.
- The core challenge was to build the entire analytical infrastructure from scratch to understand user behavior across both features and enable data-driven decision-making.
Project Goal:
- To build complete analytics infrastructure from scratch enabling data-driven product decisions through automated reporting, experimentation, and monitoring.
Key Achievements:
- Built scalable data infrastructure with optimized analytical database in ClickHouse
- Designed interactive dashboards for real-time monitoring of user engagement and retention
- Implemented rigorous A/B testing pipeline with statistical validation framework
- Developed forecasting models for server load prediction and capacity planning
- Created automated reporting system with daily Telegram delivery to stakeholders
- Established real-time anomaly detection for proactive issue resolution
Business Impact:
- Enabled data-driven product decisions and reduced manual reporting overhead through comprehensive analytics ecosystem.
Comprehensive analysis of Brazilian e-commerce data, uncovering key insights and actionable business recommendations.
Stack:
- Data Analysis:
PythonPandasNumPy - Visualization:
PlotlyTableau - Statistics & ML:
StatsModelsSciPySklearnPingouin - NLP & Text Processing:
NLTKTextBlob
Key Methods:
- Exploratory Data Analysis (EDA):
- Statistical summaries, missing value analysis, and outlier detection
- Data Preprocessing:
- Feature engineering, missing value handling, and creation of new metrics and dimensions
- Time Series Analysis:
- Revenue/order trends, seasonality decomposition
- RFM Segmentation:
- Customer value clustering (Recency, Frequency, Monetary)
- Clustering:
- sklearn-based customer behavior segmentation
- Geospatial Analysis:
- Sales heatmaps and delivery performance by region
- NLP Sentiment Analysis:
- Review text processing with NLTK and TextBlob
- Statistical Testing:
- correlation analysis and hypothesis testing
Project Description:
- Comprehensive analysis of Brazilian e-commerce platform Olist, identifying growth opportunities and operational improvements through data-driven insights.
Project Goal:
- To perform deep-dive analysis identifying growth opportunities, operational improvements, and customer behavior patterns.
Key Achievements:
- Conducted time-series analysis of sales dynamics, seasonality, and trend decomposition
- Implemented anomaly detection in orders, payments, and delivery times
- Developed customer profiling through RFM segmentation and clustering analysis
- Performed cohort analysis to track customer retention and lifetime value (LTV)
- Processed customer reviews using NLP for sentiment analysis and insights
- Validated business hypotheses through statistical testing
- Delivered strategic recommendations for logistics optimization and sales growth
Business Impact:
- Provided data-backed insights to optimize logistics, enhance customer retention strategies, and drive revenue growth through targeted improvements.
End-to-end data pipeline and interactive dashboard for Wide World Importers.
Stack:
- Data & Databases:
PythonSQLPostgreSQLSqlalchemyDBLink - Analytics & BI:
Yandex DataLens - Automation:
Airflow
Key Methods:
- Database Management:
- PostgreSQL with OLTP to OLAP transformation
- ETL Pipeline Development:
- Automated data extraction, transformation, and loading processes
- Data Warehouse Design:
- Star schema implementation for analytical queries
- SQL Optimization:
- Complex queries, materialized views, and index optimization
- Business Intelligence:
- Interactive dashboard development in Yandex DataLens
- Automation:
- Airflow DAG design for daily data pipeline execution
Project Description:
- End-to-end data pipeline and business intelligence solution for global distributor Wide World Importers.
Project Goal:
- To transform siloed operational data into unified business intelligence platform supporting sales, procurement, and logistics decisions.
Key Achievements:
- Built automated ETL pipeline transforming OLTP data into optimized star schema data mart
- Designed and implemented interactive dashboard for sales, logistics, and customer analytics
- Developed daily automated data updates with Airflow DAG orchestration
- Created comprehensive business intelligence platform with specialized dashboards
- Enabled cross-departmental data-driven decision making
Business Impact:
- Reduced manual reporting time and provided single source of truth for business performance monitoring across departments.
Frameon extends pandas DataFrame with analysis methods while keeping all original functionality intact.
Stack:
- Core Technologies:
PythonPandasNumPy - Statistics & ML:
StatsmodelsScikit-learnSciPyPingouin - Visualization:
Plotly - NLP & Text:
TextBlob - Documentation:
Sphinx
Key Methods:
- Package Development:
- End-to-end Python package creation and distribution workflow
- Software Engineering:
- Object-oriented programming and pandas extension development
- Testing & Quality:
- Automated testing with GitHub Actions and code quality enforcement
- Documentation:
- Comprehensive documentation generation with Sphinx
- Visualization:
- Automated chart generation and interactive plotting
- Machine Learning:
- Feature analysis and model evaluation techniques
- Text Processing:
- NLP methods for text analysis and sentiment detection
Project Description:
- Powerful pandas extension that enhances DataFrames with production-ready analytics while maintaining native functionality.
Project Goal:
- To create comprehensive analytics toolkit that streamlines exploratory analysis and statistical workflows within pandas ecosystem.
Key Achievements:
- Seamlessly integrates exploratory analysis, statistical testing and visualization into pandas workflows
- Provides instant insights through automated data profiling and quality checks
- Enables cohort analysis with flexible periodization and metric customization
- Offers built-in statistical methods (bootstrap, effect sizes, group comparisons)
- Generates interactive visualizations with single-command access
- Supports both DataFrame-level and column-specific analysis
- Maintains full backward compatibility with native pandas functionality
Business Impact:
- Accelerates data analysis workflows and standardizes analytical methodologies across teams and projects.
This project is shared under MIT License.