From 12976d4753a6c270371b9db36126897ce94b5f4e Mon Sep 17 00:00:00 2001 From: Mackenzie-OO7 Date: Tue, 31 Jan 2023 13:20:13 +0100 Subject: [PATCH 1/3] add cwl best practices with docker --- src/topics/best-practices.md | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/src/topics/best-practices.md b/src/topics/best-practices.md index 4c2e18ab..e6c998d0 100644 --- a/src/topics/best-practices.md +++ b/src/topics/best-practices.md @@ -96,6 +96,28 @@ all are required. - Software containers should be made to be conformant to the ["Recommendations for the packaging and containerizing of bioinformatics software"][containers] (also useful to other disciplines). +The following are a set of recommended good practices to keep in mind when running CWL workflows within Docker: + +- Make sure you are using the latest version of both CWL and Docker, as this will ensure that you have access to the latest features and bug + fixes. + +- It is good practice to keep your Dockerfiles in Git, just like your workflow definitions, because they are also scripts and should be + managed and tracked with version control. + +- When creating a Dockerfile, it is important to specify the exact version of the software you want to install and the base image you + want to use. This helps ensure that your Docker image builds are consistent and reproducible. Additionally, when using the `FROM` command, specify a tag for the base image, otherwise it will default to "latest" which can change at any time. + +- To ensure that the user specified in the Dockerfile is actually used to run the tool, it is best to avoid using the `USER` + instruction in the Dockerfile. This is because cwltool will override the `USER` instruction and match the user instead, which means that the user specified in the `USER` instruction may not be the user that is actually used to run the tool. + +- Keep your container images as small as possible, this speeds up the download time and consumes less storage space. Also, when using bioinformatics tools, reference data should be supplied externally (as workflow inputs), rather than including it in the container image. This way, it is easier to update the reference data without the need to rebuild the Docker image. + +- Avoid using the `ENTRYPOINT` command in your Dockerfile because it changes the command line that runs inside the container. + This can cause confusion when the command line that supplied to the container and the command that actually runs are different. + +- Docker has a feature that can save you time during development by reusing a previous command and its base layer, instead of running it + again. However, this can also cause problems if a file being downloaded changes, but the command remains the same. In that case, the cached version of the file will be used instead of the updated one. To avoid this, use the `--no-cache` option to force Docker to re-run the steps. + [containers]: https://doi.org/10.12688/f1000research.15140.1 [apache-license]: https://spdx.org/licenses/Apache-2.0.html [license-example]: https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl/blob/master/workflows/emg-assembly.cwl#L200 @@ -112,4 +134,3 @@ all are required. % % - Writing CWL workflows (include existing docs from https://github.com/common-workflow-library/cwl-patterns/blob/main/README.md) % - FAIR best practices with CWL -% - Docker best practices with CWL - https://github.com/common-workflow-language/common-workflow-language/issues/347 From dd25b6b9e198097462e4e06369dfbfebbf044c7a Mon Sep 17 00:00:00 2001 From: Mackenzie-OO7 Date: Tue, 31 Jan 2023 13:43:19 +0100 Subject: [PATCH 2/3] add cwl best practices with docker --- src/topics/best-practices.md | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/src/topics/best-practices.md b/src/topics/best-practices.md index e6c998d0..24c77336 100644 --- a/src/topics/best-practices.md +++ b/src/topics/best-practices.md @@ -98,25 +98,19 @@ all are required. The following are a set of recommended good practices to keep in mind when running CWL workflows within Docker: -- Make sure you are using the latest version of both CWL and Docker, as this will ensure that you have access to the latest features and bug - fixes. +- Make sure you are using the latest version of both CWL and Docker, as this will ensure that you have access to the latest features and bug fixes. -- It is good practice to keep your Dockerfiles in Git, just like your workflow definitions, because they are also scripts and should be - managed and tracked with version control. +- It is good practice to keep your Dockerfiles in Git, just like your workflow definitions, because they are also scripts and should be managed and tracked with version control. -- When creating a Dockerfile, it is important to specify the exact version of the software you want to install and the base image you - want to use. This helps ensure that your Docker image builds are consistent and reproducible. Additionally, when using the `FROM` command, specify a tag for the base image, otherwise it will default to "latest" which can change at any time. +- When creating a Dockerfile, it is important to specify the exact version of the software you want to install and the base image you want to use. This helps ensure that your Docker image builds are consistent and reproducible. Additionally, when using the `FROM` command, specify a tag for the base image, otherwise it will default to "latest" which can change at any time. -- To ensure that the user specified in the Dockerfile is actually used to run the tool, it is best to avoid using the `USER` - instruction in the Dockerfile. This is because cwltool will override the `USER` instruction and match the user instead, which means that the user specified in the `USER` instruction may not be the user that is actually used to run the tool. +- To ensure that the user specified in the Dockerfile is actually used to run the tool, it is best to avoid using the `USER` instruction in the Dockerfile. This is because cwltool will override the `USER` instruction and match the user instead, which means that the user specified in the `USER` instruction may not be the user that is actually used to run the tool. - Keep your container images as small as possible, this speeds up the download time and consumes less storage space. Also, when using bioinformatics tools, reference data should be supplied externally (as workflow inputs), rather than including it in the container image. This way, it is easier to update the reference data without the need to rebuild the Docker image. -- Avoid using the `ENTRYPOINT` command in your Dockerfile because it changes the command line that runs inside the container. - This can cause confusion when the command line that supplied to the container and the command that actually runs are different. +- Avoid using the `ENTRYPOINT` command in your Dockerfile because it changes the command line that runs inside the container. This can cause confusion when the command line that supplied to the container and the command that actually runs are different. -- Docker has a feature that can save you time during development by reusing a previous command and its base layer, instead of running it - again. However, this can also cause problems if a file being downloaded changes, but the command remains the same. In that case, the cached version of the file will be used instead of the updated one. To avoid this, use the `--no-cache` option to force Docker to re-run the steps. +- Docker has a feature that can save you time during development by reusing a previous command and its base layer, instead of running it again. However, this can also cause problems if a file being downloaded changes, but the command remains the same. In that case, the cached version of the file will be used instead of the updated one. To avoid this, use the `--no-cache` option to force Docker to re-run the steps. [containers]: https://doi.org/10.12688/f1000research.15140.1 [apache-license]: https://spdx.org/licenses/Apache-2.0.html From c425ed9552638ab5022be543cff98bf434128014 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Levai=20Mackenzie=20=C3=81gb=C3=A0r=C3=A0?= <97461848+Mackenzie-OO7@users.noreply.github.com> Date: Sat, 4 Mar 2023 03:17:23 +0100 Subject: [PATCH 3/3] update cwl best practices with docker --- src/topics/best-practices.md | 59 ++++++++++++++++++++++++++++-------- 1 file changed, 46 insertions(+), 13 deletions(-) diff --git a/src/topics/best-practices.md b/src/topics/best-practices.md index de842d8d..e7b88faf 100644 --- a/src/topics/best-practices.md +++ b/src/topics/best-practices.md @@ -98,19 +98,52 @@ all are required. The following are a set of recommended good practices to keep in mind when running CWL workflows within Docker: -- Make sure you are using the latest version of both CWL and Docker, as this will ensure that you have access to the latest features and bug fixes. - -- It is good practice to keep your Dockerfiles in Git, just like your workflow definitions, because they are also scripts and should be managed and tracked with version control. - -- When creating a Dockerfile, it is important to specify the exact version of the software you want to install and the base image you want to use. This helps ensure that your Docker image builds are consistent and reproducible. Additionally, when using the `FROM` command, specify a tag for the base image, otherwise it will default to "latest" which can change at any time. - -- To ensure that the user specified in the Dockerfile is actually used to run the tool, it is best to avoid using the `USER` instruction in the Dockerfile. This is because cwltool will override the `USER` instruction and match the user instead, which means that the user specified in the `USER` instruction may not be the user that is actually used to run the tool. - -- Keep your container images as small as possible, this speeds up the download time and consumes less storage space. Also, when using bioinformatics tools, reference data should be supplied externally (as workflow inputs), rather than including it in the container image. This way, it is easier to update the reference data without the need to rebuild the Docker image. - -- Avoid using the `ENTRYPOINT` command in your Dockerfile because it changes the command line that runs inside the container. This can cause confusion when the command line that supplied to the container and the command that actually runs are different. - -- Docker has a feature that can save you time during development by reusing a previous command and its base layer, instead of running it again. However, this can also cause problems if a file being downloaded changes, but the command remains the same. In that case, the cached version of the file will be used instead of the updated one. To avoid this, use the `--no-cache` option to force Docker to re-run the steps. +- Make sure you are using the latest version of both CWL and Docker, + as this will ensure that you have access to the latest features and bug fixes. + +- Use meaningful tags on your own Docker image + so you can tell versions of your Docker image apart as it is updated over time. + These can reflect the version of the underlying software, + or a version you assign to the Dockerfile itself. + These can be manually assigned version numbers (e.g. 1.0, 1.1, 1.2, 2.0), + timestamps (e.g. YYYYMMDD like 20220126) or the hash of a git commit. + +- It is good practice to keep your Dockerfiles in Git, just like your workflow definitions, + because they are also scripts and should be managed and tracked with version control. + +- When creating a Dockerfile, it is important to specify the exact version + of the software you want to install and the base image you want to use. + This helps ensure that your Docker image builds are consistent and reproducible. + Additionally, when using the `FROM` command, specify a tag for the base image, + otherwise it will default to "latest" which can change at any time. + +- To ensure that the user specified in the Dockerfile is actually used to run the tool, + it is best to avoid using the `USER` instruction in the Dockerfile. + This is because cwltool will override the `USER` instruction and match the user instead, + which means that the user specified in the `USER` instruction + may not be the user that is actually used to run the tool. + To avoid this, use the `--no-match-user` cwltool flag + to disable passing the current user ID to `docker run --user`. + +- Keep your container images as small as possible, + this speeds up the download time and consumes less storage space. + Also, when using bioinformatics tools, reference data should be supplied externally + (as workflow inputs), rather than including it in the container image. + This way, it is easier to update the reference data without the need to rebuild the Docker image. + +- Avoid using the `ENTRYPOINT` command in your Dockerfile + because it changes the command line that runs inside the container. + This can cause confusion when the command line that supplied to the container + and the command that actually runs are different. + +- Docker has a feature that can save you time during development by + reusing a previous command and its base layer, instead of running it again. + However, this can also cause problems if a file being downloaded changes, + but the command remains the same. In that case, the cached version of the file will be used + instead of the updated one. To avoid this, use the `--no-cache` option to force Docker to re-run the steps. + + To learn more about creating workflows with Docker, + see this [tutorial](https://doc.arvados.org/rnaseq-cwl-training/08-supplement-docker/index.html). [containers]: https://doi.org/10.12688/f1000research.15140.1 [apache-license]: https://spdx.org/licenses/Apache-2.0.html