-
Notifications
You must be signed in to change notification settings - Fork 49
Added function for computing sigma_k from dataset #460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
Regarding #3 (what to do with partitions that don't contain a given condition pair): I think this is not an actual problem; any partitions that don't contain a given pair of conditions simply won't count towards the sigma_k tally for that condition pair. Unless others object I'm fine with no warning here. Regarding #4 (what if a partition has repeats of a condition): Here I really can't think of a sensible way to do it (or again, whether this even makes sense). A warning sounds reasonable, but I'm not sure what the behavior should be in such cases: should we omit such partitions from the sigma_k calculation (as with #3), or should we do some kludge (e.g. average the repeats for a condition as described above)? Leaning towards the former just to avoid adding bias to the calculation. Regarding #6: sounds good, will do. |
|
I am with you both on all points, so I won't repeat them. This seems well placed were it is now and like a good addition to have. Once you add the basic tests, I think we can just merge this one. On 3.: I think resolving this properly would be a major step and you would need many cases where conditions are repeated within the same fold to do it successfully. The reason for that is that we are currently mixing the (co-)variance between the folds and the (co-)variance among measurements within the same fold. If we see at most one observation per condition and fold, there is no way to separate these two types of variability. Once we have multiple observations in a fold, this has a strong influence though: If all the variability was between folds for example, all these measurements should be the same. If all of it was noise within the folds, they would be as different from each other as the ones in separate folds. As we do not deal with the two types of noise anywhere else, I don't think we want to things like that here. So as my conclusion: Averaging the patterns with each fold actually appears to be the right thing to do to match our later analyses like crossvalidated distances where we average per fold, too. We cannot really take a more complicated model of the noise into account anyway & this procedure should be good enough for all cases where the number of repetitions per condition in a fold is not varied too much. |
|
Hi Heiko, thanks so much for the careful breakdown! I'll go with averaging condition repetitions within folds then and add the tests. |
|
Okay, made final tweaks and added a new test class, TestSigmaK, to test_data.py. There is testing for the output shape, and also how precisely it can recover the ground-truth sigma_k from simulated data both in the case where every condition appears once in every partition, and in the case where sometimes a condition doesn't appear in every partition. The tests pass on my workstation. Let me know if anything else needed before closing out this PR. |
|
It looks like the test fails sporadically (outliers?). Maybe we could compare say the median or the 90th percentile of the difference between true and actual. I can set this up on wednesday. |
|
The line of code for randomly discarding trials (for testing if it still works when not every condition is in every fold) didn't have a fixed random seed. I've now added this for reproducibility so it should work every time. |
I have added a function, sigmak_from_measurements, to noise.py. The relevant considerations were as follows:
noise.py seemed like the right home for this function, but open to putting it elsewhere. I'm not sure if adding more sigma_k functionality could add the possibility for confusion—across most of the toolbox "noise" is used to refer to the channel-by-channel precision matrix, but sigma_k also measures noise in the data (albeit between conditions, rather than channels). Is this a nomenclature issue we wish to resolve, or do folks think it'll be fine?
I used the formula from Diedrichsen et al. (2016), specifically the one that assumes constant sigma_k across partitions. There is a more complex formula if sigma_k varies across partitions; this may be challenging to implement, first since it requires an estimate of the temporal autocorrelation of the BOLD signal, and second because I think other functionality in the toolbox assumes an unchanging sigma_k. That said, if we did add a function for this it could be nicely parallel to
cov_from_measurementsandcov_from_residualsfor computing the voxel covariance.One complication is that every partition may not contain every condition. To deal with this, I compute a separate sigma_k for each partition (leaving nan entries for condition pairs that don't appear), compute a nansum over the individual sigma_ks, and divide each entry by M-1, where M is the number of partitions for which both conditions are present. I think this is the kosher way of handling this, though let me know if I am wrong.
Another wrinkle is cases where a condition appears more than once in a partition. To handle such cases, currently I first compute the mean pattern of that condition for the partition before computing the sigma_k for that partition. This method is definitely wrong, since this will obviously reduce the variance. Is there a kosher way to do this (indeed, does it even make sense)?
The function takes an obs_descriptor and a cv_descriptor. Currently, the toolbox notation uses the variable names
obs_descandcv_descriptor—so "desc" in one and "descriptor" for the other. Minor point, but any preferences for which to go with? They are user-facing argument names so won't be invisible to them.I've done simulations and the code at least seems to work for the case where there are not more than one repeats of a condition per crossvalidation fold (that is, the wrinkle in 3): the code reproduces the ground-truth sigma_k. It also still works in cases where not every condition appears in every partition. Will we want to make unit-tests for this, and if so, what tests do we want?