Make it easier to work with datasets locally #2

wmeints · 2020-10-01T07:44:53Z

Currently, we force the user to use a dataset on Azure. This isn't ideal for debugging. I think we can introduce a switch based on context so we have a local dataset when training locally and a remote dataset when training on Azure ML.

EmielStoelinga · 2021-09-30T09:04:03Z

To make this work, first of all I started working with a ScriptRunConfig object instead of an SKLearn object such that I can parse my own virtualenv to the script. This improvement, I have also suggested in #12. Furthermore, I introduced the possibility to change the compute_target argument that is parsed to the ScriptRunConfig object to 'local'.
I saved the dataset locally and I introduced a parameter to tasks/train_model.py named data_folder which refers to the location of the dataset on my local machine. This parameter is parsed to the training script through the arguments parameter of the ScriptRunConfig object. In case the training is not performed locally, the dataset is parsed using dataset.as_mount().

I realize that my solution does not suffice completely for the current template, but we could use it as an example for further development. Please see my code below:

@click.command()
@click.option(
    '--experiment_name',
    help='The experiment for which to execute the run'
)
@click.option('--environment', help='The remote environment to use')
@click.option('--dataset_name',
              help='The name of the input dataset to use for training')
@click.option('--script', help='Name of the training script in /my_project')
@click.option('--curated_environment', help='Curated environment to train on',
              default='AzureML-tensorflow-2.4-ubuntu18.04-py37-cuda11-gpu',
              required=False)
@click.option('--data_folder',
              help='Data folder (specify when training locally)')
def main(experiment_name, environment, dataset_name, script,
         curated_environment, output_model_name, data_folder):
    ws = Workspace.from_config()
    experiment = Experiment(ws, experiment_name)
    dataset = Dataset.get_by_name(ws, dataset_name)

    args = ['--batch-size', 128,
            '--n-epochs', 50]

    if(environment == 'local'):
        compute_target = environment
        tf_env = Environment('venv')
        tf_env.python.user_managed_dependencies = True
        tf_env.python.interpreter_path = os.path.join(
                                            os.getcwd(),
                                            'venv\\Scripts\\python.exe'
                                            )
        args.append('--data-folder')
        args.append(data_folder)
    else:
        compute_target = ComputeTarget(ws, environment)
        tf_env = Environment.get(workspace=ws, name=curated_environment)
        args.append('--data-folder')
        args.append(dataset.as_mount())

    src = ScriptRunConfig(source_directory='my_project',
                          script=script,
                          arguments=args,
                          compute_target=compute_target,
                          environment=tf_env)

    run = experiment.submit(src)

    run.wait_for_completion(show_output=True)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make it easier to work with datasets locally #2

Make it easier to work with datasets locally #2

wmeints commented Oct 1, 2020

EmielStoelinga commented Sep 30, 2021

Make it easier to work with datasets locally #2

Make it easier to work with datasets locally #2

Comments

wmeints commented Oct 1, 2020

EmielStoelinga commented Sep 30, 2021