Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Benchmarking custom models #92

Closed

Conversation

prateekdesai04
Copy link
Contributor

@prateekdesai04 prateekdesai04 commented Feb 5, 2025

Issue #, if available:

Description of changes:

The current version has warmpool implemented, hence if max_concurrent_jobs are 30 and 90 jobs are launched, the remaining 60 will be queued and SageMaker instances will be reused.
This has been tested on D244_F3_C1530_30 over all available folds [0, 1, 2], for LightGBM_c1_BAG_L1_Reproduced_AWS config
max_concurrent_jobs should be less than account limit which is 34 for now

To setup and run:

  1. Copy the tabflow folder into a parent directory - NOTE this parent directory must also contain tabrepo, autogluon-benchmark and autogluon-bench folders, make sure all 3 are installed before installing tabflow
  2. If you change anything in autogluon or tabrepo, you will need to re-build the image - navigate to parent folder followed by tabflow/docker, and run ./build_docker.sh {ecr_repo_name} {tag} {source_account} {target_account} {region} - AWS credentials required
  3. In your IDE make the necessary changes inside launch_jobs.py: like entering your docker image URI which you just pushed to ECR, make the change here - DOCKER_IMAGE_ALIASES. (I plan to make these as args in the future edits)
  4. Assuming you are in the parent folder - pip install tabflow
  5. Input your AWS credentials
  6. Read Example on how to run
  7. If you want to import any new model then import it in evaluate.py

Example:

To run one or several datasets over certain folds (datasets and folds are space separated)
tabflow --datasets Australian --folds 0 1 --methods_file ~/method_configs.yaml --s3_bucket test-bucket --experiment_name test-experiment --max-concurrent-jobs 30 --wait

To run all datasets in a context over all folds for that context
tabflow --datasets run_all --folds -1 --methods_file ~/method_configs.yaml --s3_bucket test-bucket --experiment_name test-experiment --max-concurrent-jobs 30 --wait

Note:

  1. For new experiment_names, caching won't come into play
  2. Max concurrent jobs must always be less than your account limit, expect failures otherwise

To Do (mostly prioritized order):

  1. Add requirements.txt or pyproject.toml [x]
  2. Handle more than 100 listTraining jobs using exponential back-off to avoid throttling [x]
  3. Do a code clean up, and modularize everything [x] (this is also incremental based on feed-back)
  4. Logging every task, fetch from sagemaker and save them to s3 along with results.pkl [x]
  5. Multi-threading for instantaneous job launch [WIP]
  6. Clean up docker instructions, add wait flags and misc. items [x]
  7. Get results from s3 and store from local to s3 and other convenience functions
  8. Give args for Dockerfile name and build etc., add docker building step to pipeline
  9. Change date time format of experiment, currently it is not in sorted order in S3 (if required, not necessary)
  10. Adopt model register implementation from tabrepo when available [x]

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant