Skip to content
/ big_data Public

Big Data essentials: Hadoop, MapReduce, Spark. Explore tutorials and demos in Jupyter notebooks—most are self-contained and live, ready to run with a click.

License

Notifications You must be signed in to change notification settings

groda/big_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

big_data

Big Data for beginners

Explore a variety of tutorials and interactive demonstrations focused on Big Data technologies like Hadoop, Spark, and more, primarily presented in the format of Jupyter notebooks. Most notebooks are self-contained, with instructions for installing all required services. They can be run on Google Colab or in a virtual Ubuntu machine/container.

Setting Up Hadoop: Single-Node Configuration

Running Apache Spark in Standalone Mode

MapReduce Tutorials

PySpark Tutorials

  • PySpark_On_Google_Colab.ipynb Open In Colab Render in nbviewer Explore the inner workings of PySpark on Google Colab recently updated
  • PySpark_miscellanea.ipynb Open In Colab Render in nbviewer Tips, tricks, and insights related to PySpark. recently updated
  • getting_started_with_mrjob.ipynb Open In Colab Render in nbviewer Getting started with mrjob. This demonstration showcases the power and flexibility of the mrjob Python framework for developing and executing scalable data processing jobs, supporting both MapReduce and Spark across different execution backends, culminating in a hybrid approach utilizing Spark on YARN.new
  • demoSparkSQLPython.ipynb Open In Colab Render in nbviewer A hands-on demo showcasing the fundamentals of PySpark SQL — how to create DataFrames, register temporary views, and query data using SQL syntax. recently updated
  • ngrams_with_pyspark.ipynb Open In Colab Render in nbviewer Basic example of n-grams extraction with PySpark recently updated
  • generate_data_with_Faker.ipynb Open In Colab Render in nbviewer Fake It Till You Make It: Generate Test Data with Faker. Create customizable fake data for testing and development using the Faker library. Useful for populating databases, simulating user activity, or prototyping applications without relying on real data. recently updated
  • Encoding+dataframe+columns.ipynb Open In Colab Render in nbviewer DataFrame Column Encoding with PySpark and Parquet Format recently updated
  • Apache_Sedona_with_PySpark.ipynb Open In Colab Render in nbviewer Apache Sedona™ is a high-performance cluster computing system for processing large-scale spatial data, extending the capabilities of Apache Spark for advanced geospatial analytics. Run a basic example with PySpark on Google Colab recently updated

Miscellaneous Tutorials

Virtualization and Cloud Automation

Big Data Learning Pathways

About this repository

Notebooks Testing and CI

Most executable Jupyter notebooks are tested on an Ubuntu virtual machine through a GitHub automated workflow. The log file for successful executions is named: action_log.txt (see also: Google Colab vs. GitHub Ubuntu Runner Open In Colab Render in nbviewer).

Current status:

  • Run Notebooks on Ubuntu
  • Run One Notebook on Ubuntu

The Github workflow is a starting point for what is known as Continuous Integration (CI) in DevOps/Platform Engineering circles.

💡 Keep Learning, Keep Sharing

💌 Be kind, share what you learn, and help others take their first steps too.
If these tutorials helped you, consider giving the repo a ⭐ — it really encourages me to keep creating and improving!

About

Big Data essentials: Hadoop, MapReduce, Spark. Explore tutorials and demos in Jupyter notebooks—most are self-contained and live, ready to run with a click.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •