Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it easier to work with datasets locally #2

Open
wmeints opened this issue Oct 1, 2020 · 1 comment
Open

Make it easier to work with datasets locally #2

wmeints opened this issue Oct 1, 2020 · 1 comment

Comments

@wmeints
Copy link
Member

wmeints commented Oct 1, 2020

Currently, we force the user to use a dataset on Azure. This isn't ideal for debugging. I think we can introduce a switch based on context so we have a local dataset when training locally and a remote dataset when training on Azure ML.

@EmielStoelinga
Copy link

To make this work, first of all I started working with a ScriptRunConfig object instead of an SKLearn object such that I can parse my own virtualenv to the script. This improvement, I have also suggested in #12. Furthermore, I introduced the possibility to change the compute_target argument that is parsed to the ScriptRunConfig object to 'local'.
I saved the dataset locally and I introduced a parameter to tasks/train_model.py named data_folder which refers to the location of the dataset on my local machine. This parameter is parsed to the training script through the arguments parameter of the ScriptRunConfig object. In case the training is not performed locally, the dataset is parsed using dataset.as_mount().

I realize that my solution does not suffice completely for the current template, but we could use it as an example for further development. Please see my code below:

@click.command()
@click.option(
    '--experiment_name',
    help='The experiment for which to execute the run'
)
@click.option('--environment', help='The remote environment to use')
@click.option('--dataset_name',
              help='The name of the input dataset to use for training')
@click.option('--script', help='Name of the training script in /my_project')
@click.option('--curated_environment', help='Curated environment to train on',
              default='AzureML-tensorflow-2.4-ubuntu18.04-py37-cuda11-gpu',
              required=False)
@click.option('--data_folder',
              help='Data folder (specify when training locally)')
def main(experiment_name, environment, dataset_name, script,
         curated_environment, output_model_name, data_folder):
    ws = Workspace.from_config()
    experiment = Experiment(ws, experiment_name)
    dataset = Dataset.get_by_name(ws, dataset_name)

    args = ['--batch-size', 128,
            '--n-epochs', 50]

    if(environment == 'local'):
        compute_target = environment
        tf_env = Environment('venv')
        tf_env.python.user_managed_dependencies = True
        tf_env.python.interpreter_path = os.path.join(
                                            os.getcwd(),
                                            'venv\\Scripts\\python.exe'
                                            )
        args.append('--data-folder')
        args.append(data_folder)
    else:
        compute_target = ComputeTarget(ws, environment)
        tf_env = Environment.get(workspace=ws, name=curated_environment)
        args.append('--data-folder')
        args.append(dataset.as_mount())

    src = ScriptRunConfig(source_directory='my_project',
                          script=script,
                          arguments=args,
                          compute_target=compute_target,
                          environment=tf_env)

    run = experiment.submit(src)

    run.wait_for_completion(show_output=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants