Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data format #1

Open
freeman-lab opened this issue Aug 1, 2016 · 7 comments
Open

data format #1

freeman-lab opened this issue Aug 1, 2016 · 7 comments

Comments

@freeman-lab
Copy link
Member

freeman-lab commented Aug 1, 2016

Opening a discussion for how to format both the input data and the results / submissions.

According to @philippberens the raw data will be both calcium fluorescence and spike rates both sampled to 100 Hz.

formatting the raw data

The raw data are basically just time series, continuous valued (for fluorescence) and possibly sparse (for spike rates). The key thing here is that the format is generic and easy to load in multiple environments. I kinda prefer csv files for simplicity, so long as they don't get too large. Then, for each dataset, we provide either a single csv file or two csv files, depending on whether it's training or testing. And we include example scripts in this repo to load in Python, Matlab, and any other language.

training / testing

How many datasets / neurons do we have? If it's less than 10-20, it might be easiest to just treat each neuron as a separate "dataset", and pair them up so we have e.g. 00.00 and 00.00.test, then 00.01, 01.00, etc, where the first number is the source lab and the second number is the neuron.

formatting the results

Using JSON here is useful because it can easily be read/write in multiple environments (for comparison to ground truth), and is easily handled for web submissions. It's been successful so far in neurofinder for representing spatial regions.

The results are likely to be sparse in time, so one option would be a structure like this

[
  {
    "dataset": "00.00.test",
    "time": [0, 10, 14, 100, ...],
    "rate": [1, 2, 1, 1.5, ...]
  },
...
]

For each dataset we basically have a sparse array, where you're storing the times of all detected events, and the corresponding numerical value. For algorithms that return binary events, we could assume that if no rate is specified all values are 1.

@philippberens
Copy link
Collaborator

@freeman-lab, I just shared a dropbox folder with all the data with you.

The datasets are numbered as in table 1 in Theis et al. 2016. For basic visualization, run the gettingstarted-notebook (which we prepared for a CRCNS submission of the training data).

I would go with the preprocessed versions, which are resampled to 100 Hz. I would rather store all data in dense arrays, as the calcium is dense anyway and the data isn't that large.

We have a total of 72 cells, for some of them multiple segments, so a total of 90 traces. Of these we allocated 32 to the test set. This allocation makes sure that multiple segments of the same cell are either in the training or the testing set and that all previously shared data is in the training set.

The naming of the files is fine with me, I would use the first number for the dataset to refer to the Table 1 in Theis 2016 and the second for the cell/trace.

The JSON file format is fine with me. Do I read the documentation correctly that you basically supply a dictionary and then use dump? (https://simplejson.readthedocs.io/en/latest/)

@philippberens
Copy link
Collaborator

Ah, one more thing. For some of the datasets, we don't have exact spike times, so just providing spike times unfortunately is not an options.

@freeman-lab
Copy link
Member Author

freeman-lab commented Aug 3, 2016

@philippberens great! I went through the datasets and example scripts, this is really, really well put together!

Given that the spike times won't be enough on their own, and that we'll use dense arrays, I now think CSV makes more sense for submissions than JSON. (My only preference for CSV over mat / pickle btw is that we can use one language agnostic format, instead of supporting two).

And given that the size is small, we can just put all cells for each dataset in a single file. Can definitely use the same numbering as Table 1 from the paper.

In that case the full set should look like

1.train.calcium.csv
1.train.spikes.csv
2.train.calcium.csv
2.train.spikes.csv
...
1.test.calcium.csv
2.test.calcium.csv
...

where each csv file has as header the cell number (for that dataset), and the rows are the values at each time point (calcium or spikes, depending on the file), e.g.

1 2 3
3.123 0.151 8.123
1.972 0.195 8.519
1.412 5.012 4.123

For submissions, we can just require that people make five spikes.csv files, one for each testing dataset, and submit all five at once. So a complete submission would basically be

1.test.spikes.csv
2.test.spikes.csv
3.test.spikes.csv
4.test.spikes.csv
5.test.spikes.csv

@philippberens
Copy link
Collaborator

fine with me - the columns then will have different lengths, I guess that doesn't matter.
do you want to convert the data or should I, @freeman-lab ?

@freeman-lab
Copy link
Member Author

@philippberens i'm on it! Just about done actually, will post links here for you to download and check.

@freeman-lab
Copy link
Member Author

@philippberens ok, want to grab this and make sure it looks right?

https://s3.amazonaws.com/neuro.datasets/challenges/spikefinder/spikefinder.train.zip

While doing this I updated README.md and added an example.py script for loading in Python, which should be included in the zip, and is also in this repository. I confirmed that loading and plotting the first couple neurons from dataset 1 looked identical to what I got running your notebook.

I also have a spikefinder.test with just the calcium test data, and a spikefinder.test.private with the spike test data, which we'll of course keep private!

@philippberens
Copy link
Collaborator

I agree, I checked a couple examples as well, looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants