Skip to content

Conversation

@johnmarktaylor91
Copy link
Collaborator

I have added a function, sigmak_from_measurements, to noise.py. The relevant considerations were as follows:

  1. noise.py seemed like the right home for this function, but open to putting it elsewhere. I'm not sure if adding more sigma_k functionality could add the possibility for confusion—across most of the toolbox "noise" is used to refer to the channel-by-channel precision matrix, but sigma_k also measures noise in the data (albeit between conditions, rather than channels). Is this a nomenclature issue we wish to resolve, or do folks think it'll be fine?

  2. I used the formula from Diedrichsen et al. (2016), specifically the one that assumes constant sigma_k across partitions. There is a more complex formula if sigma_k varies across partitions; this may be challenging to implement, first since it requires an estimate of the temporal autocorrelation of the BOLD signal, and second because I think other functionality in the toolbox assumes an unchanging sigma_k. That said, if we did add a function for this it could be nicely parallel to cov_from_measurements and cov_from_residuals for computing the voxel covariance.

  3. One complication is that every partition may not contain every condition. To deal with this, I compute a separate sigma_k for each partition (leaving nan entries for condition pairs that don't appear), compute a nansum over the individual sigma_ks, and divide each entry by M-1, where M is the number of partitions for which both conditions are present. I think this is the kosher way of handling this, though let me know if I am wrong.

  4. Another wrinkle is cases where a condition appears more than once in a partition. To handle such cases, currently I first compute the mean pattern of that condition for the partition before computing the sigma_k for that partition. This method is definitely wrong, since this will obviously reduce the variance. Is there a kosher way to do this (indeed, does it even make sense)?

  5. The function takes an obs_descriptor and a cv_descriptor. Currently, the toolbox notation uses the variable names obs_desc and cv_descriptor—so "desc" in one and "descriptor" for the other. Minor point, but any preferences for which to go with? They are user-facing argument names so won't be invisible to them.

  6. I've done simulations and the code at least seems to work for the case where there are not more than one repeats of a condition per crossvalidation fold (that is, the wrinkle in 3): the code reproduces the ground-truth sigma_k. It also still works in cases where not every condition appears in every partition. Will we want to make unit-tests for this, and if so, what tests do we want?

@JasperVanDenBosch
Copy link
Contributor

  1. I think this fits well in noise.py. Function naming and docstrings should clarify the distinction.
  2. Best to go for this implementation now, and can add more later.
  3. Seems reasonable. Should we raise a Warning in such cases?
  4. (same as 3.)
  5. I vote for the non-abbreviated variant.
  6. If you still have the simulation code; I'd say make a test for each of these two cases. Generate the data (with dimensions as small as makes sense), call sigmak_from_measurements, sample some values and test that they are as expected.

@johnmarktaylor91
Copy link
Collaborator Author

Regarding #3 (what to do with partitions that don't contain a given condition pair): I think this is not an actual problem; any partitions that don't contain a given pair of conditions simply won't count towards the sigma_k tally for that condition pair. Unless others object I'm fine with no warning here.

Regarding #4 (what if a partition has repeats of a condition): Here I really can't think of a sensible way to do it (or again, whether this even makes sense). A warning sounds reasonable, but I'm not sure what the behavior should be in such cases: should we omit such partitions from the sigma_k calculation (as with #3), or should we do some kludge (e.g. average the repeats for a condition as described above)? Leaning towards the former just to avoid adding bias to the calculation.

Regarding #6: sounds good, will do.

@HeikoSchuett
Copy link
Contributor

Hi @johnmarktaylor91

I am with you both on all points, so I won't repeat them. This seems well placed were it is now and like a good addition to have. Once you add the basic tests, I think we can just merge this one.

On 3.: I think resolving this properly would be a major step and you would need many cases where conditions are repeated within the same fold to do it successfully. The reason for that is that we are currently mixing the (co-)variance between the folds and the (co-)variance among measurements within the same fold. If we see at most one observation per condition and fold, there is no way to separate these two types of variability. Once we have multiple observations in a fold, this has a strong influence though: If all the variability was between folds for example, all these measurements should be the same. If all of it was noise within the folds, they would be as different from each other as the ones in separate folds. As we do not deal with the two types of noise anywhere else, I don't think we want to things like that here.

So as my conclusion: Averaging the patterns with each fold actually appears to be the right thing to do to match our later analyses like crossvalidated distances where we average per fold, too. We cannot really take a more complicated model of the noise into account anyway & this procedure should be good enough for all cases where the number of repetitions per condition in a fold is not varied too much.

@johnmarktaylor91
Copy link
Collaborator Author

Hi Heiko, thanks so much for the careful breakdown! I'll go with averaging condition repetitions within folds then and add the tests.

@johnmarktaylor91
Copy link
Collaborator Author

Okay, made final tweaks and added a new test class, TestSigmaK, to test_data.py. There is testing for the output shape, and also how precisely it can recover the ground-truth sigma_k from simulated data both in the case where every condition appears once in every partition, and in the case where sometimes a condition doesn't appear in every partition. The tests pass on my workstation. Let me know if anything else needed before closing out this PR.

@JasperVanDenBosch
Copy link
Contributor

It looks like the test fails sporadically (outliers?). Maybe we could compare say the median or the 90th percentile of the difference between true and actual. I can set this up on wednesday.

@johnmarktaylor91
Copy link
Collaborator Author

The line of code for randomly discarding trials (for testing if it still works when not every condition is in every fold) didn't have a fixed random seed. I've now added this for reproducibility so it should work every time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants