Skip to content
This repository has been archived by the owner on May 31, 2024. It is now read-only.

feat: Toil engine for CWL #317

Conversation

adamnovak
Copy link
Contributor

Issue #, if available: N/A

Description of Changes

This PR contains almost the minimum changes necessary to add Toil as an engine to run CWL.

It also adds the ability to override the Docker image names used when deploying a context (because this is necessary to actually use the Toil engine with Docker images with names in one's own namespace), and the ability to pass only relevant Docker images to the CDK (because otherwise the CDK commands grow with each engine and each new engine has to touch the part of the tests that counts how long the commands are).

Description of how you validated changes

I can't actually run this as submitted, because it doesn't have #231.

You can't actually read the logs of the successful workflow out, because pulling logs over WES is a separate feature (partly blocked by #314), and I'm going to include what I have for that in a separate PR.

But I tested commit 822e51a and I was able to run the included hello world workflow.

Checklist

  • If this change would make any existing documentation invalid, I have included those updates within this PR
  • I have added unit tests that prove my fix is effective or that my feature works
  • I have linted my code before raising the PR
  • Title of this Pull Request follows Conventional Commits standard: https://www.conventionalcommits.org/en/v1.0.0/

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@adamnovak
Copy link
Contributor Author

Looks like my attempt to resolve the merge conflict on Github broke the tests; I will have to pull down the branch and fix the merge.

Any advice for how to deal with the CDK call argument count tests? My changes conflict with any changes in main, and the argument counts in main change quite often.

@@ -0,0 +1,20 @@
---
name: Demo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use DemoCwl or similar so that the deployed artifacts have naming that is easier to follow in testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change this to be unique.

jobQueueArn: string;
// This is actually a pattern that matches all ARNs for potentially relevant
// definitions, since Toil makes its own definitions.
toilJobArn: string;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toilJobDefinitionArnPattern would be a more meaningful name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will rename this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, a similar change may want to be made to cromwellJobArn, which is also a pattern for engine-managed job definitions:

const cromwellJobArn = Arn.format(
{
account: Aws.ACCOUNT_ID,
region: Aws.REGION,
partition: Aws.PARTITION,
resource: "job-definition/*",
service: "batch",
},
scope as Stack
);
. I don't think that would make sense to do in this PR.

new PolicyStatement({
effect: Effect.ALLOW,
actions: ["batch:DescribeJobDefinitions", "batch:ListJobs", "batch:DescribeJobs", "batch:DescribeJobQueues", "batch:DescribeComputeEnvironments"],
resources: ["*"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some of these can have a more limited scope. For example, you know the job queue so possibly batch:DescribeJobQueues can be limited to that arn. You may also be able to limit batch:DescribeJobDefinitions to the toilJobDeinitionArnPattern. You may have to split the policies to do this but in general the number of actions with resources: ["*"] should be as small as possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a copy-paste of the policy that Cromwell is using:

new PolicyStatement({
effect: Effect.ALLOW,
actions: ["batch:DescribeJobDefinitions", "batch:ListJobs", "batch:DescribeJobs", "batch:DescribeJobQueues", "batch:DescribeComputeEnvironments"],
resources: ["*"],
}),

Testing any change here is going to take something like an hour of turnaround time to tear down and redeploy the stack, so I'm very reluctant to undertake myself the project of determining what the minimum permission set to use Batch actually is.

I think instead of copy-pasting the code, I will try and base this off the Cromwell policy more directly, so that when the Cromwell permission set gets tightened, Toil will also be fixed.

new PolicyStatement({
effect: Effect.ALLOW,
actions: ["sdb:*"],
resources: ["*"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to know this arn and restrict to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ARNs that Toil needs follow a pattern that I will restrict to.

statements: [
new PolicyStatement({
effect: Effect.ALLOW,
actions: ["sdb:*"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably more comments on this later but why not use DynamoDB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now Toil has code that requires SimpleDB. Now that S3 is strongly consistent, we have a project to remove the SimpleDB dependency and just use S3, but it is not yet complete, so currently Toil needs SimpleDB to work.

statements: [
new PolicyStatement({
effect: Effect.ALLOW,
actions: ["s3:*"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is far too permissive. It could be used to do literally anything to a customers buckets. This must be limited to a pattern or single bucket.

Copy link
Contributor Author

@adamnovak adamnovak Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Toil is going to make a buckets and a SimpleDB domains with names starting with toil-cwl- for each workflow, for scratch. (I think we might also use toil- if the workflow isn't a CWL workflow.) It also I believe uses a toil SimpleDB domain.

We probably should be able to grant permissions for just those prefixes.

This will cause trouble if the user wants to specify their own --jobStore and tell Toil to use a different bucket, but we don't necessarily have to support that.


return {
...commonBatchProps,
// We only use one Batch from the stack for the Toil jobs. The server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only use one Batch compute environment and queue


export interface ToilEngineConstructProps extends EngineOptions {
/**
* AWS Batch JobQueue to use for running workflows.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it used for running workflows or workflow tasks (or both)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for the tasks; I can note that.


export class ToilEngineConstruct extends EngineConstruct {
public readonly engine: SecureService;
public readonly adapterRole: IRole;
Copy link
Contributor

@markjschreiber markjschreiber Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Toil can speak WES without needing to use a Lambda why is there an adapterRole and adapterLogGroup? Perhaps these need renaming?

Is adapterRole used at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left adapterRole and adapterLogGroup in under the impression that the type system required them to be there. Toil doesn't use them, and on looking at this again it seems like adapterRole (now?) never makes it out to the base engine interface.

But EngineOutputs always has an adapterLogGroup right now, and the base EngineConstruct defines a method that has to return an EngineOutputs. I think to get rid of it I would need to not make ToilEngineConstruct an EngineConstruct, or else remove the field from the base EngineOutputs and duplicate it and any code that uses it in all the other engines. Or else add a new intermediate AdaptedEngineOutputs and AdaptedEngineConstruct and change the other engines to use those.

TOIL_AWS_BATCH_JOB_ROLE_ARN: this.jobRole.roleArn,
});

// TODO: Move log group creation into service construct and make it a property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything blocking this from happening right now? If not either move it or remove the TODO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment came along with the other code I duplicated from the Cromwell engine.

I could probably jut drop it here. There won't be an adapter log group to move if I manage to defeat the type system and get rid of it.

// TODO: Move log group creation into service construct and make it a property
this.engine = this.getEngineServiceDefinition(props.vpc, engineContainer, this.engineLogGroup);
// This is unused because we have no adapter, but a log group is required.
this.adapterLogGroup = new LogGroup(this, "AdapterLogGroup");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to think about the adapterLogGroup would be the LogGroup for WES logs. The name reflects the notion that all of our engines don't understand WES natively. Is there a way that WES related logging from Toil can go here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would involve finding some kind of Python logging module backend that could send logs to a CloudWatch log group instead of standard error, and adding it to Toil, and configuring it to pick up messages from WES-related modules, I think. It seems like making the notion of an adapter optional would be a lot easier.

protected getOutputs(): EngineOutputs {
return {
accessLogGroup: this.apiProxy.accessLogGroup,
adapterLogGroup: this.adapterLogGroup,
Copy link
Contributor

@markjschreiber markjschreiber Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you cannot make an adapterLogGroup from Toil then you could make this field of EngineOutputs optional or alternatively make a super class that doesn't have this property?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I suppose fields can be optional, maybe that is the right approach to take.

// their Docker images.
engine := m.contextEnv.EngineName
// Each engine has its own section in imageRefs, and for now we assume they
// all care about a WES adapter image.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should assume this. Toils is an example of an engine with no WES Adapter image so rather than faking it we should handle the edge case where there is no WES adapter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make it support passing the WES adapter for the engines that need it only.

Copy link
Contributor

@markjschreiber markjschreiber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Some comments on improvements that could be made. Most important is paring back the s3:* on * permission. This won't fly with security or with account admins who need to install AGC.

We will also need some basic additions to the github pages docs. This really just means adding a toil.md file in site/content/en/docs/Workflow engines. You can follow the basic pattern of the other engine pages.

@@ -46,7 +46,7 @@ func TestManager_Deploy(t *testing.T) {
mockClients.ssmMock.EXPECT().GetCustomTags().Return(testTags)
mockClients.ecrClientMock.EXPECT().VerifyImageExists(environment.CommonImages["NEXTFLOW"]).Return(nil)
clearContext := mockClients.cdkMock.EXPECT().ClearContext(filepath.Join(testHomeDir, ".agc/cdk/apps/context")).Return(nil)
mockClients.cdkMock.EXPECT().DeployApp(filepath.Join(testHomeDir, ".agc/cdk/apps/context"), gomock.Len(42), testContextName3).After(clearContext).Return(mockClients.progressStream1, nil)
mockClients.cdkMock.EXPECT().DeployApp(filepath.Join(testHomeDir, ".agc/cdk/apps/context"), gomock.Len(30), testContextName3).After(clearContext).Return(mockClients.progressStream1, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a constant so you don't have to change it multiple times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that.

@@ -125,7 +126,7 @@ func (o *initProjectOpts) validateProject() error {
func BuildProjectInitCommand() *cobra.Command {
vars := initProjectVars{}
cmd := &cobra.Command{
Use: "init project_name --workflow-type {wdl|nextflow|snakemake}",
Use: "init project_name --workflow-type {cwl|wdl|nextflow|snakemake}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor ux nit, consider putting in alphabetic order

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that.

# Add rabbitmq repository
ADD rabbitmq.repo /etc/yum.repos.d/rabbitmq.repo

# Sadly pre-importing keys doesn't seem to save any time whan we use yum later, so don't so it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/whan/when/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix this.

@@ -0,0 +1,48 @@
## Toil AWS Mirror

A Toil mono-container WES server for use with Amazon AGC.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our marketing people prefer we use "Amazon Genomics CLI" in full in all user facing docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll call it that.

To push this to an Amazon ECR repo, where AGC can get at it, you can do something like:

```bash
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 318423852362.dkr.ecr.us-west-2.amazonaws.com
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend changing --region us-west-2 to --region <your-deployment-region> and remove account numbers and replace with either <your-account-number> or 123456789012

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do this, although then I won't be able to copy-paste this and execute it anymore.

```bash
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 318423852362.dkr.ecr.us-west-2.amazonaws.com
docker build -t adamnovak/toil-agc .
docker tag adamnovak/toil-agc:latest 318423852362.dkr.ecr.us-west-2.amazonaws.com/adamnovak/toil-agc:latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace account numbers and regions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this.

@adamnovak adamnovak force-pushed the toil-integration-only-2-no-features-squashed branch from 33d9f46 to 216499e Compare March 30, 2022 20:41
@adamnovak adamnovak force-pushed the toil-integration-only-2-no-features-squashed branch 2 times, most recently from c35cd8a to 4a90087 Compare April 26, 2022 21:27
@adamnovak
Copy link
Contributor Author

@markjschreiber I've added some Toil documentation like you asked for. I generated the documentation and it looks right to me.

@adamnovak
Copy link
Contributor Author

As discussed at the meeting today, the AGC project is unwilling to take this without locking down the IAM actions granted to the Toil engine, beyond just restricting access to certain buckets/domains.

I've added what I think is going to be close to the least-privilege set of permissions. I got these by looking at the full lists of possible permissions, and keeping the ones that appeared to correspond to Boto3 Client calls we make, or to Boto3 Resource objects or collections we access or things we do to those objects and collections.

I still need to:

  1. Test to make sure that these permissions are sufficient, at least for an example workflow.
  2. Somehow test to see if they are minimal, or at least close enough to minimal to be getting on with.

I don't think we'll be able to guarantee actual minimality with a reasonable time investment. We'd need to put together a workflow set that exercises every Boto3 call in Toil's job store, and try it once per permission, with that permission left out. Or we'd need to inventory every call into Boto3, trace them through the Boto3 Resource code to the API calls that will be made (accounting for any laziness), and then determine what IAM actions those API calls need allowed, either from the documentation or by experiment.

For example, Toil at one point says:

bucket.objects.all().delete()

Presumably we need s3:DeleteObject for that, and something for listing the bucket objects. We'd need to figure out if the permission needed for listing here is s3:ListBucket or s3:ListObjects if we really wanted a minimal set of permissions, and the answer (or indeed the distinction) doesn't appear to be documented.

It was mentioned that CloudWatch can log API operations, but I'm not sure that's the same as logging IAM action checks hit.

@adamnovak adamnovak force-pushed the toil-integration-only-2-no-features-squashed branch from 5e276c4 to c017154 Compare May 3, 2022 15:59
@adamnovak
Copy link
Contributor Author

OK, I had to make some changes to the permission set I initially proposed, but I've now tested this and it can successfully run the example nontrivial CWL workflow.

@markjschreiber
Copy link
Contributor

A rebased version of this PR has been merged to HEAD.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants