Airflow 2 to 3 auto migration rules #41641

Lee-W · 2024-08-21T11:26:52Z

Description

Why

As we're introducing breaking changes to the main branch, it would be better to begin recording the things we could use migration tools to help our users migrate from Airflow 2 to 3.

The breaking changes can be found at https://github.com/apache/airflow/pulls?q=is%3Apr+label%3Aairflow3.0%3Abreaking.

Rules

airflow.sensors.external_task.ExternalTaskSensorLink → airflow.sensors.external_task.ExternalDagLink
- Remove deprecated ExternalTaskSensorLink #41391 (comment)
airflow.models.taskMixin.TaskMixin → airflow.models.taskMixin.DependencyMixin
- Remove deprecated TaskMixin class #41394
airflow.contrib.*
- Remove contrib #41366
airflow.models.ImportError -> airflow.models.errors.ParseImportError
- Remove deprecated ImportError from airflow.models #41367
Deprecated imports from various places
- Remove support for deprecated imports like operators/hooks/sensors #41368

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

Lee-W · 2024-08-21T11:30:06Z

The Rules now is an example of how these changes can be recorded. I will check the existing breaking changes and update the rules. It would be great if folks could help update this list if you know there are breaking changes.

potiuk · 2024-08-21T13:48:22Z

I pinned the issue - this way it will show up at the top of "Issues" list in the repo

potiuk · 2024-08-21T13:49:05Z

Lee-W · 2024-08-21T14:50:44Z

Thanks!

eladkal · 2024-08-24T17:43:33Z

We can just go over all the significant newsfragments and create a rule for them or keep some reasoning why it doesn't require one

kaxil · 2024-10-24T15:37:17Z

We should add something for the public API change too. API v1 won't work anymore. Those are being changed as part of AIP-84 to a new FastApi based app. GitHub project for it: https://github.com/orgs/apache/projects/414

pierrejeambrun · 2024-10-25T12:21:27Z

Issue here to regroup Rest API breaking changes #43378

tirkarthi · 2024-10-27T06:58:02Z

I have started prototyping a small package based on LibCST to build a Python 2to3 like tool for Airflow 2to3 that does simple and straight forward replacements. My main motivation was around lot of our users in our Airflow instance using schedule_interval in Airflow 2 that was deprecated and renamed to schedule in Airflow 3. It would require updating thousands of dags manually and some automation could help. This could also help in places with import statements changes .E.g. Task SDK need to be updated from from airflow import DAG to from airflow.sdk import DAG. Something like this could eventually become part of Airflow cli so that users can run airflow migrate /airflow/dags for migration or serve as a starter point for migration. It can update the file in place or show diff. Currently it does the following changes :

Dags

schedule_interval -> schedule
timetable -> schedule
concurrency -> max_active_tasks
Removal of unused full_filepath parameter
default_view (tree -> grid)

Operators

task_concurrency -> max_active_tis_per_dag
trigger_rule (none_failed_or_skipped -> none_failed_min_one_success)

Sample file

import datetime

from airflow import DAG
from airflow.decorators import dag, task
from airflow.operators.empty import EmptyOperator
from airflow.timetables.events import EventsTimetable


with DAG(
    dag_id="my_dag_name",
    default_view="tree",
    start_date=datetime.datetime(2021, 1, 1),
    schedule_interval="@daily",
    concurrency=2,
):
    op = EmptyOperator(
        task_id="task", task_concurrency=1, trigger_rule="none_failed_or_skipped"
    )


@dag(
    default_view="graph",
    start_date=datetime.datetime(2021, 1, 1),
    schedule_interval=EventsTimetable(event_dates=[datetime.datetime(2022, 4, 5)]),
    max_active_tasks=2,
    full_filepath="/tmp/test_dag.py"
)
def my_decorated_dag():
    op = EmptyOperator(task_id="task")


my_decorated_dag()

Sample usage

python -m libcst.tool codemod dag_fixer.DagFixerCommand -u 1 tests/test_dag.py
Calculating full-repo metadata...
Executing codemod...
reformatted -

All done! ✨ 🍰 ✨
1 file reformatted.
--- /home/karthikeyan/stuff/python/libcst-tut/tests/test_dag.py
+++ /home/karthikeyan/stuff/python/libcst-tut/tests/test_dag.py
@@ -10,6 +10,6 @@
     dag_id="my_dag_name",
-    default_view="tree",
+    default_view="grid",
     start_date=datetime.datetime(2021, 1, 1),
-    schedule_interval="@daily",
-    concurrency=2,
+    schedule="@daily",
+    max_active_tasks=2,
 ):
@@ -23,5 +23,4 @@
     start_date=datetime.datetime(2021, 1, 1),
-    schedule_interval=EventsTimetable(event_dates=[datetime.datetime(2022, 4, 5)]),
+    schedule=EventsTimetable(event_dates=[datetime.datetime(2022, 4, 5)]),
     max_active_tasks=2,
-    full_filepath="/tmp/test_dag.py"
 )
Finished codemodding 1 files!
 - Transformed 1 files successfully.
 - Skipped 0 files.
 - Failed to codemod 0 files.
 - 0 warnings were generated.

Repo : https://github.com/tirkarthi/Airflow-2to3

potiuk · 2024-10-27T12:47:28Z

NICE! @tirkarthi -> you should start a thread about it at devlist and propose adding it to the repo. The sooner we start working on it and let poeple test it, the better it will be. And we can already start adding not only the newsfragments but also rules to the migration tools (cc: @vikramkoka @kaxil ) - we can even think about keeping a database of old-way-dags and running such migration tool on them and letting airflow scheduler from Airflow 3 process them (and maybe even execute) as part of our CI. This would tremendously help with maintaining and updating such a tool if we will make it a part of our CI pipeline.

potiuk · 2024-10-27T12:54:00Z

BTW. I like it a lot how simple it is with libCST - we previously used quite a bit more complex tool from Facebook that allowed to do refactoring at scale in parallell (https://github.com/facebookincubator/Bowler) , but it was rather brittle to develop rules for it and it had some weird problems and missing features. One thing that was vere useful - is that it had a nice "parallelism" features - which allowed to refactor 1000s of files in seconds (but also made it difficult to debug).

I think if we get it working with libCST - it will be way more generic and maintainable, also we can easily add parallelism later on when/if we see it is slow.

potiuk · 2024-10-27T13:03:55Z

One small watchout though - such a tool should have a way to isolate rules - so that they are not in a single big method - some abstraction that will allow us to easily develop and selectively apply (or skip) different rules - see https://github.com/apache/airflow/tree/v1-10-test/airflow/upgrade where we have documentation and information about the upgrade check we've done in Airflow 1 -> 2 migration.

Also we have to discuss, whether it should be a separate repo or whether it should be in airflow's monorepo. Both have pros and cons - in 1.10 we chose to keep it 1.10 branch of airflow, because it imported some of the airflow code and it was easier, but we could likely create a new repo for it, add CI there and keep it there.

We even have this archived repo https://github.com/apache/airflow-upgrade-check which we never used and archived, we could re-open it. We also have https://pypi.org/project/apache-airflow-upgrade-check/ - package in PyPI - and we could release new upgrade check versions (2.* ?) with "apache-airflow>=2.11.0" as dependency.

All that should likely be discussed at devlist :)

tirkarthi · 2024-10-27T13:25:34Z

Thanks @potiuk for the details. I will start a discussion on this at the devlist and continue there. Bowler looks interesting. Using libcst.tool from cli parallelizes the process. Right now this needs python -m libcst.tool to execute it as a codemod. Initially I had designed them as standalone Transformer for each category like (dag, operator) where the updated AST from one transformer can be passed to another. The codemod looked like a recommended abstraction for running it and changed it that way to later find cli accepts only one codemod at a time. I need to check how composable they are.

python -m libcst.tool codemod --help | grep -i -A 1 'jobs JOBS'
  -j JOBS, --jobs JOBS  Number of jobs to use when processing files. Defaults to number of cores

time python -m libcst.tool codemod dag_fixer.DagFixerCommand -u 1 ~/airflow/dags > /dev/null 2>&1 
python -m libcst.tool codemod dag_fixer.DagFixerCommand -u 1 ~/airflow/dags >  
6.95s user 0.61s system 410% cpu 1.843 total

# Single core
time python -m libcst.tool codemod dag_fixer.DagFixerCommand -u 1 -j 1 ~/airflow/dags > /dev/null 2>&1
python -m libcst.tool codemod dag_fixer.DagFixerCommand -u 1 -j 1  > 
/dev/nul  4.66s user 0.38s system 99% cpu 5.035 total

# 4 core
python -m libcst.tool codemod dag_fixer.DagFixerCommand -u 1 -j 4 ~/airflow/dags > /dev/null 2>&1
python -m libcst.tool codemod dag_fixer.DagFixerCommand -u 1 -j 4  > 
/dev/nul  5.45s user 0.54s system 253% cpu 2.358 total

potiuk · 2024-10-27T20:09:16Z

Bowler looks interesting.

Don't be deceived by it :).

It was helpful for Provider's migration at some point in time, but I had many rough edges - like debugging a problem was a nightmare until we learned how to do it properly, also it had some annoying limitations - you had to learn a completely new non-standard abstractions (an SQLAlchemy-like DSL to perform modifications) - which did not cover all the refactorings we wanted to do. We had to really dig-deep into the code an find some workarounds for things we wanted to do, when authors of Bowler have not thoght about them. And sometimes those were nasty workarounds.

query = (
    Query(<paths to modify>)
    .select_function("old_name")
    .rename("new_name")
    .diff(interactive=True)
)

Example that I remember above is that we could not rename some of the object types easily because it was not "foreseen" (can't remember exactly) - we had a few surprises there.

Also Bowler seems to be not maintained for > 3 years and it means that it's unlikely to handle some constructs even in 3.9+ Airflow.

What I like about libcst is that it is really "low-level" interface that you have to program in Python rather than in abstract DSL - similar to "ast". You write actual python code to perform what you want to perform rather than rely on incomplete abstractions, even if you have to copy&paste rename code between different "rules" (for example) (which you can then abstract away as 'common` util if you need, so no big deal).

potiuk · 2024-10-27T20:15:47Z

BTW. Codemod .... is also 5 years not maintained. Not that it is disqualification - but they list python2 as their dependency ... so .....

Lee-W · 2024-10-28T01:22:07Z

I tried to use libcst in airflow as a tiny POC of this issue here

airflow/scripts/ci/pre_commit/check_deferrable_default.py

Line 34 in 5b7977a

import libcst as cst

. It mostly works great except for its speed. I was also thinking about whether to add these migrations thing info ruff airflow linter but not yet explore much on the rust/ruff side.

potiuk · 2024-10-28T09:42:16Z

👀 👀 rust project :) ...

Me ❤️ it (but I doubt we want to invest in it as it might be difficult to maintain, unless we find quite a few committers who are somewhat ruff profficient to at least be able to review the code) . But it's tempting I must admit.

But to be honest - while I'd love to finally get a serious rust project, it's not worth it I think we are talking of one-time migration for even a 10.000 dags it will take at most single minutes and we can turn it maybe in under one minute with rust - so not a big gain for a lot of pain :) . Or at lest this is what my intuition tells me.

I think parallelism will do the job nicely. My intuition tells me (but this is just intuition and understanding on some limits ans speed of certain operation) - that we will get from multiple 10s of minutes (when running such migration sequentially) to single minutes when we allow to run migration in parallel using multiple processors and processes - even with Python and libcst. This task is really suitable for such parallelisation because each file is complete, independent task that can be run in complete isolation from all other tasks - so spawning multiple paralllel interpreters, ideally forking them right after all the imports and common code is loaded so that they use shared memory for those - this should do the job nicely (at least intuitively).

Using RUST for that might be classic premature optimisation - we might likely not need it :). But would be worth to make some calculations and get some "numbers" for big installation - i.e. how many dags of what size are out there, and how long it will be to parse them all with libcst and write back (even unmodified or with a simple modification). I presume that parsing and writing back will be the bulk of the job - and modifications will add very little overhead as they will be mostly operating on in memory data structures.

Lee-W · 2024-10-30T01:16:07Z

Me ❤️ it (but I doubt we want to invest in it as it might be difficult to maintain, unless we find quite a few committers who are somewhat ruff profficient to at least be able to review the code) . But it's tempting I must admit.

But to be honest - while I'd love to finally get a serious rust project, it's not worth it I think we are talking of one-time migration for even a 10.000 dags it will take at most single minutes and we can turn it maybe in under one minute with rust - so not a big gain for a lot of pain :) . Or at lest this is what my intuition tells me.

Yep, totally agree. I just want to raise this idea which might be interesting. 👀

I presume that parsing and writing back will be the bulk of the job - and modifications will add very little overhead as they will be mostly operating on in memory data structures.

Yep, I think you're right. My previous default deferrable script took around 10 sec to process ~400 operators. Using ast for checking took around 1 sec

Lee-W added this to the Airflow 3.0.0 milestone Aug 21, 2024

Lee-W self-assigned this Aug 21, 2024

Lee-W mentioned this issue Aug 21, 2024

Unify DAG schedule args and change default to None #41453

Merged

potiuk pinned this issue Aug 21, 2024

potiuk mentioned this issue Oct 27, 2024

refactor(utils/decorators): rewrite remove task decorator to use ast #43383

Open

2 tasks

potiuk mentioned this issue Oct 29, 2024

Explore and add static checks for DAGs for early detection of common issues #43176

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Airflow 2 to 3 auto migration rules #41641

Airflow 2 to 3 auto migration rules #41641

Lee-W commented Aug 21, 2024 •

edited by phanikumv

Loading

Lee-W commented Aug 21, 2024

potiuk commented Aug 21, 2024

potiuk commented Aug 21, 2024

Lee-W commented Aug 21, 2024

eladkal commented Aug 24, 2024

kaxil commented Oct 24, 2024 •

edited

Loading

pierrejeambrun commented Oct 25, 2024

tirkarthi commented Oct 27, 2024

potiuk commented Oct 27, 2024

potiuk commented Oct 27, 2024

potiuk commented Oct 27, 2024 •

edited

Loading

tirkarthi commented Oct 27, 2024 •

edited

Loading

potiuk commented Oct 27, 2024 •

edited

Loading

potiuk commented Oct 27, 2024

Lee-W commented Oct 28, 2024

potiuk commented Oct 28, 2024 •

edited

Loading

Lee-W commented Oct 30, 2024

Airflow 2 to 3 auto migration rules #41641

Airflow 2 to 3 auto migration rules #41641

Comments

Lee-W commented Aug 21, 2024 • edited by phanikumv Loading

Description

Why

Rules

Related issues

Are you willing to submit a PR?

Code of Conduct

Lee-W commented Aug 21, 2024

potiuk commented Aug 21, 2024

potiuk commented Aug 21, 2024

Lee-W commented Aug 21, 2024

eladkal commented Aug 24, 2024

kaxil commented Oct 24, 2024 • edited Loading

pierrejeambrun commented Oct 25, 2024

tirkarthi commented Oct 27, 2024

potiuk commented Oct 27, 2024

potiuk commented Oct 27, 2024

potiuk commented Oct 27, 2024 • edited Loading

tirkarthi commented Oct 27, 2024 • edited Loading

potiuk commented Oct 27, 2024 • edited Loading

potiuk commented Oct 27, 2024

Lee-W commented Oct 28, 2024

potiuk commented Oct 28, 2024 • edited Loading

Lee-W commented Oct 30, 2024

Lee-W commented Aug 21, 2024 •

edited by phanikumv

Loading

kaxil commented Oct 24, 2024 •

edited

Loading

potiuk commented Oct 27, 2024 •

edited

Loading

tirkarthi commented Oct 27, 2024 •

edited

Loading

potiuk commented Oct 27, 2024 •

edited

Loading

potiuk commented Oct 28, 2024 •

edited

Loading