Skip to content

Commit

Permalink
cutting back content in data-advanced
Browse files Browse the repository at this point in the history
  • Loading branch information
aeturrell committed Aug 16, 2022
1 parent fbb69a3 commit c2fa840
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 108 deletions.
127 changes: 19 additions & 108 deletions data-advanced.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -401,6 +401,8 @@
"\n",
"[**Pydantic**](https://pydantic-docs.helpmanual.io/) has some of the same features as **pandera** but it piggybacks on the ability of Python 3.5+ to have 'typed' variables (if you're not sure what that is, it's a way to declare a variable has a particular data type from inception) and it is really focused around the validation of objects (created from classes) rather than dataframes.\n",
"\n",
"If you've used the [SQLModel](https://sqlmodel.tiangolo.com/) package for writing SQL queries, you may be interested to know that every SQLModel call is also a Pydantic model.\n",
"\n",
"Here's an example of a Pydantic schema that also implements a class:"
]
},
Expand Down Expand Up @@ -549,134 +551,38 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Orchestration\n",
"\n",
"Data orchestration is the automation of data-driven processes from end-to-end, including preparing data, making decisions based on that data, and taking actions based on those decisions. Think of a data pipeline in which you extract data (perhaps from files), transform it somehow, and load it into where you want to put it (downstream in your research process, or perhaps into an app). \n",
"\n",
"There are some truly amazing tools out there to help you do this on a production scale. Perhaps the best known and most suited for production are AirBnB's [**Airflow**](http://airflow.incubator.apache.org/) and Spotify's [**Luigi**](https://github.com/spotify/luigi). Airflow in particular is widely used in the tech industry, and doesn't just schedule data processes in Python: it can run processes in pretty much whatever you like. Both of these tools try to solve the 'plumbing' associated with long-running batch processes on data: chaining tasks, automating them, dealing with failures, and scheduling. Both Luigi and Airflow have fancy interfaces to show you what's going on with your tasks. \n",
"\n",
"Data orchestration tools typically have a directed acyclic graph, a DAG, at the heart of how tasks are strung together. This defines how different tasks depend on each other in order, with one task following from the previous one (or perhaps following from multiple previous tasks). It's an automation dream.\n",
"\n",
"However, for a research project, it's hard to recommend these two tools as they're just a bit too powerful; Airflow in particular can do just about anything but has a steep learning curve. So, instead, to show the power of data orchestration we'll use a more lightweight but also very powerful library: [**dagster**](https://dagster.io/), which bills itself as 'a data orchestrator for machine learning, analytics, and ETL [extract, transform, and load]'. Some of the key features are being able to implement components in various tools, such as Pandas and SQL, define pipelines as DAGs, and to test the same setup on your machine that you then deploy to cloud. Like the other tools, it has a nice visual interface to show what's happening. Necessarily, we'll only be seeing a very brief introduction to it here.\n",
"\n",
"**Dagster** works by using *function decorators* (the `@` symbol) to define the 'solids' that are the nodes in **dagster**'s DAG. The `@pipeline` defines the set of functions that are composed in the pipeline. Here's an example of defining a simple pipeline that produces the same phrase with different capitalisations:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from dagster import execute_pipeline, pipeline, solid\n",
"\n",
"\n",
"@solid\n",
"def get_name(_):\n",
" return \"dagster\"\n",
"\n",
"\n",
"@solid\n",
"def hello(context, name: str):\n",
" return \"hello, {name}!\".format(name=name)\n",
"\n",
"\n",
"@solid\n",
"def leading_caps(context, phrase: str):\n",
" return phrase.title()\n",
"## Fake and Synthetic Data\n",
"\n",
"Fake data can be very useful for testing pipelines and trying out analyses before you have the *real* data. The closer fake data is to real data, the more likely that you're going to fully prepare yourself for the real data.\n",
"\n",
"@solid\n",
"def all_upper(context, phrase: str):\n",
" return phrase.upper()\n",
"\n",
"\n",
"@solid\n",
"def display(context, phrase_to_display: str):\n",
" print(phrase_to_display)\n",
"\n",
"\n",
"@pipeline\n",
"def hello_pipeline():\n",
" hello_text = hello(get_name())\n",
" # Alias distinguishes between the same fn called twice with different params\n",
" display_leading = display.alias(\"leading\")\n",
" display_upper = display.alias(\"upper\")\n",
" display_leading(leading_caps(hello_text))\n",
" display_upper(all_upper(hello_text))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is purposefully both over-engineered and very simple (Dagster pipelines can have a *lot* more complexity of this), but we're just seeing the most basic principles of the library here.\n",
"*Fake* data are data generated according to a schema that bear no statistical relationship to the real data. There are powerful tools for generating fake data. One of the most popular and fast libraries for doing so is [Mimesis](https://mimesis.name/).\n",
"\n",
"With a pipeline defined, we can now execute it in three different ways: in a python script, using the command line, or via graphical user interface. Let's see the script method first:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"outputPrepend"
]
},
"outputs": [],
"source": [
"if __name__ == \"__main__\":\n",
" execute_pipeline(hello_pipeline)"
"*Synthetic* data take this one step further: they do capture some of the statistical properties of the underlying data and are generated from the underlying data. Again, they can be useful for trying out data before using the real thing—especially if the real data are highly confidential. A useful package for comparing real and synthetic data is [SynthGuage](https://datasciencecampus.github.io/synthgauge)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We get a lot of output about how the pipeline executed but it ends with `HELLO, DAGSTER!` and `Hello, Dagster!`, which is the result of our DAG.\n",
"## Data Orchestration\n",
"\n",
"The second way of running it is by using the command line. If this was in a script called `hello_dagster.py` then the command to run it would be:\n",
"Data orchestration is the automation of data-driven processes from end-to-end, including preparing data, making decisions based on that data, and taking actions based on those decisions. Think of a data pipeline in which you extract data (perhaps from files), transform it somehow, and load it into where you want to put it (downstream in your research process, or perhaps into an app). \n",
"\n",
"```\n",
"dagster pipeline execute -f hello_dagster.py\n",
"```\n",
"Going into details of data orchestration is outside of the scope of this book, so we won't be seeing a practical example, but we think it's important enough to mention it and to point you to more resources.\n",
"\n",
"Finally, there's a nice GUI (graphical user interface) that you can use to run jobs too. This is launched from the command line via `dagit -f hello_dagster.py`. Once you have opened the link using a browser, navigate to 'Playground' and use the 'Launch Execution' button to set off a run. When you open the GUI, you'll see the DAG set out for you like this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# TODO update\n",
"from IPython.display import Image\n",
"There are some truly amazing tools out there to help you do this on a production scale. Perhaps the best known and most suited for production are AirBnB's [**Airflow**](http://airflow.incubator.apache.org/) and Spotify's [**Luigi**](https://github.com/spotify/luigi). Airflow in particular is widely used in the tech industry, and doesn't just schedule data processes in Python: it can run processes in pretty much whatever you like. Both of these tools try to solve the 'plumbing' associated with long-running batch processes on data: chaining tasks, automating them, dealing with failures, and scheduling. Both Luigi and Airflow have fancy interfaces to show you what's going on with your tasks. \n",
"\n",
"Image(os.path.join(\"img\", \"data_dag_gui.png\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are a lot of benefits to thinking about your data processing in terms of pipelines and DAGs. You can see the last time any particular node executed and, if node execution fails, you can see why, where, and how. The GUI has tons of extra functionality too; for instance, you can track what’s produced by your pipelines with the 'Asset Manager', so you can understand how your data was generated and trace issues when it doesn’t look how you expect. There's also optional typing of data types on input and output which helps you to catch bugs in data like you'd catch bugs in code (in the above, we declared all of the input types to be strings but didn't declare what the output types were so they appeared as 'any').\n",
"Data orchestration tools typically have a directed acyclic graph, a DAG, at the heart of how tasks are strung together. This defines how different tasks depend on each other in order, with one task following from the previous one (or perhaps following from multiple previous tasks). It's an automation dream.\n",
"\n",
"This has barely scratched the surface, but hopefully it's given you an insight into what data orchestration tools do and how they might be helpful for complex data workflows."
"However, for a research project, it's hard to recommend these two tools as they're just a bit too powerful; Airflow in particular can do just about anything but has a steep learning curve. So, instead, to show the power of data orchestration we'll use a more lightweight but also very powerful library: [**dagster**](https://dagster.io/), which bills itself as 'a data orchestrator for machine learning, analytics, and ETL [extract, transform, and load]'. Some of the key features are being able to implement components in various tools, such as Pandas and SQL, define pipelines as DAGs, and to test the same setup on your machine that you then deploy to cloud. Like the other tools, it has a nice visual interface to show what's happening."
]
}
],
"metadata": {
"celltoolbar": "Tags",
"interpreter": {
"hash": "671f4d32165728098ed6607f79d86bfe6b725b450a30021a55936f1af379a247"
},
"kernelspec": {
"display_name": "Python 3.8.12 64-bit ('codeforecon': conda)",
"display_name": "Python 3.8.12 ('codeforecon')",
"language": "python",
"name": "python3"
},
"language_info": {
Expand All @@ -690,6 +596,11 @@
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
},
"vscode": {
"interpreter": {
"hash": "c4570b151692b3082981c89d172815ada9960dee4eb0bedb37dc10c95601d3bd"
}
}
},
"nbformat": 4,
Expand Down
Binary file removed img/data_dag_gui.png
Binary file not shown.

0 comments on commit c2fa840

Please sign in to comment.