Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create workflow_overview.md #83

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions workflow_overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@

# 📄 General Overview of Project Workflow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate that you noticed and kept the emojis here! There's a small glitch about sphinx, all the top-level pages need to have their emojis wrapped like this to get them to align properly

Suggested change
# 📄 General Overview of Project Workflow
# <span>📄</span> General Overview of Project Workflow


Once your [onboarding](https://intranet.neuro.polymtl.ca/onboarding/README.html) is complete, you will be ready to tackle your project!
Copy link
Contributor

@kousu kousu Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to use relative links everywhere, because the domain this is on is not promised to last forever. So

Suggested change
Once your [onboarding](https://intranet.neuro.polymtl.ca/onboarding/README.html) is complete, you will be ready to tackle your project!
Once your [onboarding](onboarding) is complete, you will be ready to tackle your project!

or maybe that doesn't work, tbh I forget how sphinx handles folder links. Maybe better to be explicit:

Suggested change
Once your [onboarding](https://intranet.neuro.polymtl.ca/onboarding/README.html) is complete, you will be ready to tackle your project!
Once your [onboarding](onboarding/README.md) is complete, you will be ready to tackle your project!

Note that the build process -- sphinx -- is smart enough to map .md to .html in the final build, but by keeping the link as a .md then if you've done the relative links right, GitHub's markdown renderer will render the whole site as a fully functional backup. Take a gander at https://github.com/neuropoly/intranet.neuro.polymtl.ca/blob/4f322012f35de2f170111e325110431bda94a7d6/workflow_overview.md, you will see in this version this link is broken. Even once this is published to the live site, I'd still count that link as broken in spirit 😲 because it'd go to an external site and so the docs aren't self-contained.

You can double-check how your work will render in the end with our theme and all the links fixed up properly. It's a bit more work, but you can just check out this branch on your computer and follow these instructions:

To build the docs:
pip install .[sphinx]
make html
They will end up in _build/html/

(these are the same commands that are used by GitHub to publish the live copy of the site)


## 🖥️ Setting up 🖥️

**Step 1.**
* Make sure that your VPN connection is established or that you are connected to the Polytechnique wifi.

**Step 2.**
* Log in to one of the available [Neuropoly compute nodes](https://intranet.neuro.polymtl.ca/computing-resources/neuropoly/README.html):
```
ssh <POLYGRAMES_USERNAME>@<STATION>.neuro.polymtl.ca
```
Comment on lines +8 to +15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates the instructions at

### SSH (command line)
Once the VPN connection established, connect via ssh using the `STATION` you want:
```bash
ssh <POLYGRAMES_USERNAME>@<STATION>.neuro.polymtl.ca
```

I think those instructions could be clearer -- it would be nice if that page was broken up -- but in any case it's best not to duplicate the work.


**Step 3.**
* Create your project working directory:
```
cd data_nvme_<POLYGRAMES_USERNAME>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most systems don't have /mnt/nvme or ~/data_nvme_$USER. That's really just a hack we added for joplin at some point, and it's already documented here

| **Hostname** | `joplin.neuro.polymtl.ca` |
For fast I/O, use the NVMe hard drive, which is automatically available: `~/data_nvme_$USER`

I would say there's no need to specify this. Anywhere someone has access that has enough space is fine, and if you just don't mention it then most people will by default end up working in their home directories, which should work perfectly well on most systems, at least to start out. The big gotcha is that combining duke and git is a bad idea, but I think documenting that is probably out of scope of this.

mkdir <PROJECT_NAME>
cd <PROJECT_NAME>
Comment on lines +21 to +22
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can help people out by being explicit here:

Suggested change
mkdir <PROJECT_NAME>
cd <PROJECT_NAME>
mkdir <PROJECT_NAME>
cd <PROJECT_NAME>
git init

```

**Step 4. Developing version-controlled software**
* Ideally, you are working on code in Github repository (either a branch of an existing repo, or a new one that you created).
* After adding your NeuroPoly workstation [SSH key to your Github account](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account?platform=linux), you are ready to make a local fork of that remote repository:
```
cd data_nvme_<POLYGRAMES_USERNAME>/<PROJECT_NAME>
git clone -b "<YOUR_WORKING_BRANCH>" [email protected]:<REPOSITORY>.git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want everything we do to be tracked by git by default. Things that should not be under git, like private data, or test files, can either go in /tmp or can be added to .gitignore explicitly. That is to say, just have <PROJECT_NAME>/, not <PROJECT_NAME>/<REPOSITORY>/.

Then, we can either git init locally and git push, or do the reverse and click the New Repo button on GitHub (which runs git init on their side) and then git clone. Either way works and GitHub gives helpful pointers to guide people through either workflow. Once you've done either two or three times they become obvious and most developers don't bother to.

We already have a page that's supposed to cover this:

and while it needs a lot of love at the moment, it'd be better to work these tips into it than repeat them here. Can we arrange it so this section is just a link to that page? Maybe that page needs to get some edits in this PR as well to bring it up to speed with our current practices and/or to slim out the advice we don't use.

```

**Step 5. The data**
* It is critical to make sure that you know what data you are working with.
* Ideally, it should be in [BIDS](https://bids-specification.readthedocs.io/en/stable/) format on the [`data.neuro`](https://intranet.neuro.polymtl.ca/data/git-datasets.html) storage node: `data.neuro:datasets/<PROJECT_DATASET>`.
Copy link
Contributor

@kousu kousu Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto with the relative links. Also I'd shorten the BIDS link to be safer against link rot -- their domain will change less frequently than their subfolders, hopefully.

Suggested change
* Ideally, it should be in [BIDS](https://bids-specification.readthedocs.io/en/stable/) format on the [`data.neuro`](https://intranet.neuro.polymtl.ca/data/git-datasets.html) storage node: `data.neuro:datasets/<PROJECT_DATASET>`.
Ideally, it should be in [BIDS](https://bids-specification.readthedocs.io) format. We have many of these on the private [`data`](data/git-datasets.md) server.

Copy link
Contributor

@kousu kousu Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also think about the workflow for combining datasets. In principle it's easy to include multiple datasets -- each of them is just a different folder, and we can do that with git submodules or almost any other tool. We should think about how to leave the door open to this and avoid tricking people into thinking that there's a 1-to-1 mapping between analyses and datasets, but while remembering that the 1-to-1 situation is the common case.

* Thanks to `git annex`, the following command will copy the directory structure and some small files of your dataset on `data.neuro`:
Copy link
Contributor

@kousu kousu Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line doesn't use git annex, it's just git. You're right that this will copy the structure and small files, and that's because we store small files in git because that's what it's better at.

Of course, git annex is important too. You mention it below but I think it makes more sense to have the third line here: cd <PROJECT_DATASET> && git annex get.

There is also the subtlety to consider that you can also request specific subsets of the images like git annex get sub-001 sub-002 derivatives/sub-001. But for an overview I would stick with the most basic command.

So can this become a link to

and/or

?

```
cd data_nvme_<POLYGRAMES_USERNAME>/<PROJECT_NAME>
git clone [email protected]:datasets/<PROJECT_DATASET>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the most interesting part to me. This is the part where we might address neuropoly/data-management#136.

One option here would be

Suggested change
git clone [email protected]:datasets/<PROJECT_DATASET>
git submodule add [email protected]:datasets/<PROJECT_DATASET>

This would be the Datalad YODA recommendation. They don't use the word "submodule" on that page but it is what they have in mind.

We can also do

Suggested change
git clone [email protected]:datasets/<PROJECT_DATASET>
git submodule add -b v1.0.3 [email protected]:datasets/<PROJECT_DATASET>

to pick out a specific older version (in this case v1.0.3).

Then when you git push the code to GitHub it looks like this:

Screenshot 2022-12-08 at 18-10-24 kousu_proj1

The dataset folder is unclickable because it's tagged as a submodule reference, and moreover it's a submodule that is on the private, intentionally-inaccessible, storage node:

Screenshot 2022-12-08 at 18-10-37 kousu_proj1

but if I'm on one of our internal processing nodes with permission to the datasets, I can reproduce my entire project with git clone --recurse-submodules

git clone --recurse-submodules
p115628@bireli:~/src$ git clone --recurse-submodules https://github.com/kousu/proj1
Clonage dans 'proj1'...
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 6 (delta 0), reused 6 (delta 0), pack-reused 0
Réception d'objets: 100% (6/6), fait.
Sous-module 'canproco' ([email protected]:datasets/canproco) enregistré pour le chemin 'canproco'
Clonage dans '/home/GRAMES.POLYMTL.CA/p115628/src/proj1/canproco'...
remote: Énumération des objets: 51098, fait.        
remote: Décompte des objets: 100% (51098/51098), fait.        
remote: Compression des objets: 100% (27474/27474), fait.        
remote: Total 51098 (delta 22242), réutilisés 39149 (delta 14297), réutilisés du pack 0        
Réception d'objets: 100% (51098/51098), 4.44 Mio | 11.12 Mio/s, fait.
Résolution des deltas: 100% (22242/22242), fait.
Chemin de sous-module 'canproco' : '42d424b0b56b9269fe0c5058130f5e3bc7c9e941' extrait
p115628@bireli:~/src$ cd proj1/
p115628@bireli:~/src/proj1$ ls canproco/
dataset_description.json  sub-cal135  sub-cal192  sub-edm086  sub-edm183  sub-mon109  sub-mon174  sub-tor031  sub-tor103  sub-van111  sub-van176
derivatives               sub-cal136  sub-cal194  sub-edm087  sub-edm184  sub-mon111  sub-mon175  sub-tor032  sub-tor106  sub-van112  sub-van177
participants.json         sub-cal137  sub-cal195  sub-edm088  sub-mon001  sub-mon113  sub-mon176  sub-tor033  sub-tor107  sub-van116  sub-van178
participants.tsv          sub-cal138  sub-cal197  sub-edm089  sub-mon002  sub-mon118  sub-mon180  sub-tor035  sub-tor109  sub-van123  sub-van180
sub-cal056                sub-cal140  sub-cal198  sub-edm094  sub-mon003  sub-mon119  sub-mon181  sub-tor036  sub-tor110  sub-van124  sub-van181
sub-cal072                sub-cal142  sub-cal199  sub-edm095  sub-mon004  sub-mon121  sub-mon183  sub-tor037  sub-tor112  sub-van125  sub-van182
sub-cal073                sub-cal143  sub-cal200  sub-edm098  sub-mon005  sub-mon124  sub-mon185  sub-tor038  sub-tor114  sub-van129  sub-van183
sub-cal078                sub-cal144  sub-cal201  sub-edm105  sub-mon006  sub-mon125  sub-mon186  sub-tor039  sub-tor115  sub-van131  sub-van184
sub-cal080                sub-cal145  sub-cal202  sub-edm107  sub-mon007  sub-mon126  sub-mon187  sub-tor040  sub-tor118  sub-van133  sub-van185
sub-cal083                sub-cal146  sub-cal206  sub-edm113  sub-mon009  sub-mon129  sub-mon189  sub-tor041  sub-tor121  sub-van134  sub-van186
sub-cal084                sub-cal149  sub-cal207  sub-edm114  sub-mon010  sub-mon131  sub-mon190  sub-tor043  sub-tor123  sub-van135  sub-van189
sub-cal085                sub-cal150  sub-cal209  sub-edm118  sub-mon011  sub-mon132  sub-mon191  sub-tor044  sub-tor124  sub-van136  sub-van191
sub-cal088                sub-cal151  sub-cal210  sub-edm123  sub-mon013  sub-mon133  sub-mon192  sub-tor049  sub-tor125  sub-van137  sub-van192

which is pretty cool! And a lot more reliable than most other reproduction methods, which say "go dig up this DOI and try to find the matching dataset on Zenodo, and then the paper, and some code that went with the paper on some obscure university FTP site" (and don't say, but should, "and make sure you're running such and such a version of such and such an OS and using such and such a version of python and using such and such a version of nvidia's GPU hardware"...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sandrinebedard gave me a tour of how she handled https://github.com/sct-pipeline/ukbiobank-spinalcord-csa last year.

In short, she wrote a detailed process guide in

https://github.com/sct-pipeline/ukbiobank-spinalcord-csa/blob/2f2cec3b91294153a635a8f725c0fbc749d172ca/README.md?plain=1#L42-L43

https://github.com/sct-pipeline/ukbiobank-spinalcord-csa/blob/2f2cec3b91294153a635a8f725c0fbc749d172ca/README.md?plain=1#L94-L100

https://github.com/sct-pipeline/ukbiobank-spinalcord-csa/blob/2f2cec3b91294153a635a8f725c0fbc749d172ca/README.md?plain=1#L111-L120

While explaining, she realized some parts were left undocumented: she set up a conda environment to avoid surprise breakage, which may cause different results since requirements.txt doesn't specify versions, and there is a git tag named "r20210928" attached to the precise dataset she analysed

p115628@joplin:~/datasets/uk-biobank-processed$ git remote -v
origin  [email protected]:datasets/uk-biobank-processed (fetch)
origin  [email protected]:datasets/uk-biobank-processed (push)
p115628@joplin:~/datasets/uk-biobank-processed$ git log HEAD~..
commit 7ed28cd0aacaab0ce1570bccdba3bff495b5f496 (HEAD -> master, tag: show, tag: r20210928, origin/master, origin/HEAD)
Author: Sandrine Bedard <[email protected]>
Date:   Sun Sep 5 17:49:04 2021 -0400

    change suffix in derivatives from labels-manual to labels-disc-manual

but she hasn't documented that that's the tag to go with the project. Otherwise her procedure fits option 3: write down all the steps and expect people will read all your docs and follow all your instructions.

Maybe you can already tell but I'm leaning against option 3. I think we need something more automated because if there's one thing I know about human-computer interaction it's that people don't read instructions. I don't know if git submodule (option 1) is the right answer but I think we need something automated. And we will know if we have it when we can run a weekly cronjob on each of our papers that reproduces the final paper including all figures and statistical analyses from just source code and access to the data server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh perf, I just poked around a bit and noticed I guess that you were trying to reproduce this project: https://github.com/NadiaBlostein/Chemical_Structure_Reconstruction

The very first thing it says pretty much is:

  1. Download the Mordred compound sets (1 through 3) and place them in a data/ folder

I have no idea where to get the Mordred compound sets (1 through 3) and what version (if any) of them I'm supposed to get and I don't know if they assume the data/ folder is supposed to be in the same folder as the code, or if they mean in the current working directory (read it in the unix convention, I would read that as the current working directory!) which isn't necessarily the same.

And

  1. Run the main function in src/thresholding.py.

There is no main function in that file. As a programmer I can read the code and interpret that they must mean "run python src/thesholding.py", but if working with a large series of projects it would be impossible to track down and correct every small gap like that.

Their repo also depends on pandas and some other third-party python packages, but doesn't declare a setup.py or requirements.txt or a conda environment or anything. It doesn't even say what version of python it was written against. All of that is key to write down for reproducibility.

Those are some prime examples of the mistakes I want us to be able to avoid.

```

## 🌊 Workflow 🌊

### ⌨️ Code
Any changes you make to the code should be added in small commits and pushed to your github branch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good advice! But again better kept in


### 💿 Data
* If you need to access your data files directly, you can use `git annex` to download the larger files to the [Neuropoly computer](https://intranet.neuro.polymtl.ca/computing-resources/neuropoly/README.html) you are working from:
```
cd data_nvme_<POLYGRAMES_USERNAME>/<PROJECT_NAME>/<PROJECT_DATASET>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto on data_nvme_$USER

git annex get .
```
* However, in order to save space, make sure to "undownload" those big files once you are done working with them with:
```
git annex drop .
```
Comment on lines +53 to +56
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good tip, but I think it would fit better as a new subsection over in https://intranet.neuro.polymtl.ca/data/git-datasets.html#drop.

* Any data derivatives that you output should be added to `data.neuro:datasets/<PROJECT_DATASET>` according to the [BIDS](https://bids-specification.readthedocs.io/en/stable/) data standard! More documentation on how to version control your data on `data.neuro` can be found [here](https://intranet.neuro.polymtl.ca/data/git-datasets.html#update).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto on the absolute link: better to make it relative.