Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Danielsola/se 187 experiment with auto cache versioning v2 #2750

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dansola
Copy link
Contributor

@dansola dansola commented Sep 12, 2024

Why are the changes needed?

Automatic cache versioning might make UX easier in that cache versions don't need to be bumped manually each time a task is changed. Forgetting to manually change the cache version could result in confusing behavior as Flyte will hit the cache and not execute new code.

This code was thrown together to prototype the idea - its a bit messy and could be optimized a lot. Would like to get thoughts and opinions from the community here. There're a lot of considerations including non-python dependencies but this might be a starting point for scoping out this feature.

What changes were proposed in this pull request?

Added a new argument to the task decorator with is a AutoCacheConfig:

@dataclass
class AutoCacheConfig:
    check_task: bool = False
    check_packages: bool = False
    check_modules: bool = False
    image_spec: Optional[ImageSpec] = None

This let's you determine what is checked when automatically generating a cache version for a task.

check_task -> this just checks the code contents of the task ignoring any comments or formatting changes

check_packages -> this traverses the imports in the task and checks the package version of any externally imported packages

check_modules -> this traverses the imports in the task and checks the contents of any modules that were imported and are defined in the current repo

image_spec -> this checks the tag of an imagespec

How was this patch tested?

Test case has this structure:

.
├── __init__.py
├── example_wf.py
├── something_to_import.py
└── something_to_import_2.py
# example_wf.py

from flytekit import task, workflow
from flytekit.auto_cache.auto_cache import AutoCacheConfig

auto_cache_config = AutoCacheConfig(<some values here>)

@task(auto_cache_config=AutoCacheConfig(check_task=True, check_modules=True, check_packages=True))
def some_task():
    import pandas as pd
    from something_to_import import my_function

    # some random comment
    my_function()
    print(pd.DataFrame())


@workflow
def wf():
    some_task()


if __name__ == "__main__":
    wf()
# something_to_import.py

from something_to_import_2 import VAL, my_other_function


def my_function():
    print("Hello", VAL)
    my_other_function()
# something_to_import_2.py

import numpy as np

VAL = 2

def my_other_function():
    print(np.array([1,2,3]))

Some example of cache version given different conditions:

  1. With auto_cache_config = AutoCacheConfig():

No matter the change to any module, the cache version doesn't change.

  1. With auto_cache_config = AutoCacheConfig(check_task=True):

No matter the change to something_to_import.py or something_to_import_2.py the config version doesn't change. The following changes to the task affect the cache version:

@task(auto_cache_config=auto_cache_config)
def some_task():
    import pandas as pd
    from something_to_import import my_function

    # some random comment
    my_function()
    print(pd.DataFrame())

config version = 346f4b191fa0471e397496636b1b5df04cbbb3332eb9847edc9aa5b435883109

@task(auto_cache_config=auto_cache_config)
def some_task():
    import pandas as pd
    from something_to_import import my_function

    # some random comment with formatting change
    my_function()

    print(pd.DataFrame())

config version = 346f4b191fa0471e397496636b1b5df04cbbb3332eb9847edc9aa5b435883109 (same as original)

@task(auto_cache_config=auto_cache_config)
def some_task():
    import pandas as pd
    from something_to_import import my_function

    # some random comment
    my_function()
    print(pd.DataFrame())
    print("new code")

config version = 5b6d0622517b6d68bd162bc49ac0eeede688579eaa8b1c5e9230e7d06d14ad73 (different than original)

The above behavior will be consistent no matter the other arguments to AutoCacheConfig.

  1. With auto_cache_config = AutoCacheConfig(check_modules=True):

Changes to something_to_import.py and something_to_import_2.py will affect cache version (formatting does have an affect right now).

# something_to_import.py

from something_to_import_2 import VAL, my_other_function


def my_function():
    print("Hello", VAL)
    my_other_function()
# something_to_import_2.py

import numpy as np

VAL = 2

def my_other_function():
    print(np.array([1,2,3]))

config version = 30ad245f8785b25188f303a199d5155c552ab4b08274c8f9dc858dab12f04513

# something_to_import.py

from something_to_import_2 import VAL, my_other_function


def my_function():
    print("New Print Statement", VAL)
    my_other_function()

config version = e07532a55732b01cd71821d6731c5596660d88ce40ecb68505a17c8c5fc30275 (different than original)

# something_to_import_2.py

import numpy as np

VAL = 10000  # new number

def my_other_function():
    print(np.array([1,2,3]))

config version = d6747baae014c66c14d320e8d02d374409668df25da97df79c144f5687a055b6 (different than original)

The above behavior will be consistent no matter the other arguments to AutoCacheConfig.

  1. With auto_cache_config = AutoCacheConfig(check_packages=True):

Changes to imported packages affect cache version:

pip freeze | grep -E 'pandas|numpy'
numpy==1.26.2
pandas==2.1.2

config version = 0fc6300d289f6cfbc66179dc07c48fe5ede106f9e29455a83733892e2b7981cc

pip install pandas==2.2.2
pip freeze | grep -E 'pandas|numpy'
numpy==1.26.2
pandas==2.2.2

config version = e3d453686c0ea4f60a1839a026f4613b5946f8518ec8e5c8ac7c219d727bc369 (different than original)

pip install pandas==2.1.2
pip install numpy==1.26.3
pip freeze | grep -E 'pandas|numpy'
numpy==1.26.3
pandas==2.1.2

config version = 0fc6300d289f6cfbc66179dc07c48fe5ede106f9e29455a83733892e2b7981cc (different than original)

The above behavior will be consistent no matter the other arguments to AutoCacheConfig.

  1. With auto_cache_config = AutoCacheConfig(image_spec=True):
image_spec = ImageSpec(
    name="test-image",
    packages=["pandas", "numpy"],
    registry="ghcr.io/dansola",
)

auto_cache_config = AutoCacheConfig(image_spec=image_spec)

config version = 04f01bb5e35b47eee2a928c125f1ccb8ab65ca17a61859b38b64e5bfde17e823

image_spec = ImageSpec(
    name="test-image",
    packages=["pandas==2.2.2", "numpy"], # pin pandas version
    registry="ghcr.io/dansola",
)

auto_cache_config = AutoCacheConfig(image_spec=image_spec)

config version = 336a800c4d1c16726b6e1de10c28e60264d1c58d587302c3a5f2b918da53f3b0 (different than original)

The above behavior will be consistent no matter the other arguments to AutoCacheConfig.

See the auto_cache_examples dir for the examples used here.

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Copy link
Member

@pingsutw pingsutw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any overhead when hashing all the modules? If so, I think we need

  1. Move hash_module_recursive, hash_set, and other methods out from AutoCache, and a @lru_cache for each function, so we won't hash same module twice if there are multiple tasks in the workflow.
  2. only calculate the hash at compilation time.

If I install flytekit from a local directory (pip install -e .). will it generate a new hash whenever I change the flytekit code?

imported_module = importlib.import_module(alias.name)
imported_modules.append(imported_module)

def visit_ImportFrom(self, node):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def visit_ImportFrom(self, node):
def visit_import_from(self, node):

from flytekit.auto_cache.auto_cache import AutoCacheConfig


@task(auto_cache_config=AutoCacheConfig(check_task=True, check_modules=True, check_packages=True))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a default AutoCacheConfig to the task and set all the parameters to true, so users don't need to specify it by default. However, I'm not sure if it's too aggressive.

@eapolinario
Copy link
Collaborator

This is pretty cool. There's enough interest in a feature like this that we have at least 3 different PRs tackling this problem at different levels (See the linked PRs in the original issue).

@dansola , do you want to take a stab at writing an RFC before deciding on one approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants