Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review of codebase and docs - Probabilities and Encodings- Datseris #213

Merged
merged 36 commits into from
Dec 25, 2022

Conversation

Datseris
Copy link
Member

@Datseris Datseris commented Dec 22, 2022

This is a review of all probabilities estimators except the Transfer Operator for v2. Once this PR is merged, all probabilities estimators except the Transfer Operator are "set in stone" for v2 and will not get any further changes. So it is important @kahaaga once we merge this to both agree that this is "done" and commit to this doneness so that we can actually get some v2 out at some point in during our lifespans, because if we don't commit I am sure I will be finding out things to change every week given my history with the repo...

Alright, here I outline the major changes did:

  • Improved (in my opinion) heading and outline in docs
  • Simplified and reduced in length several docstrings, mainly for the estimators that use encodings
  • Removed entropy! method coimpletely.
  • Added an Encodings documentation page
  • Adjusted the symbolic encoding: it now operates on the elements (the vectrors from which to estimat the permutations) and in general obeys the encoding API as declared.
  • Complete re-write of the symbolic permutation estimators. it is amazing how little code they occupy now. They are all in one file, and they have the ordinal pattern encoding as their field. They share almost all their lines of code besides 4 lines of code that define the weights of the permutations. The simplification is insane!!!
  • Spatial symbolc estimator also has encoding as a field now, similarly with the others.
  • Similar rewrite of the Dispersion estimator and the Gaussian encoding. The estimator now has the encoding as its field.
  • Fix all tests

@Datseris
Copy link
Member Author

Datseris commented Dec 22, 2022

Name sortperm is used in the docstring of OrdinalPattern encoding but is not anywhere in the docs. I'll probably re-think how all of this is exposed in this PR.

@codecov
Copy link

codecov bot commented Dec 22, 2022

Codecov Report

Merging #213 (7f77993) into main (3de9294) will decrease coverage by 0.16%.
The diff coverage is 91.66%.

@@            Coverage Diff             @@
##             main     #213      +/-   ##
==========================================
- Coverage   84.63%   84.47%   -0.17%     
==========================================
  Files          49       45       -4     
  Lines        1191     1140      -51     
==========================================
- Hits         1008      963      -45     
+ Misses        183      177       -6     
Impacted Files Coverage Δ
src/encoding/rectangular_binning.jl 88.52% <ø> (ø)
src/entropy.jl 79.31% <ø> (+11.66%) ⬆️
src/probabilities.jl 80.76% <ø> (ø)
.../probabilities_estimators/dispersion/dispersion.jl 89.28% <ø> (ø)
...abilities_estimators/histograms/value_histogram.jl 87.50% <ø> (ø)
src/probabilities_estimators/spatial/utils.jl 94.11% <ø> (ø)
src/encoding/ordinal_pattern.jl 80.00% <75.00%> (-6.05%) ⬇️
src/encoding/gaussian_cdf.jl 61.53% <80.00%> (+4.39%) ⬆️
.../spatial_permutation/SpatialSymbolicPermutation.jl 70.27% <81.25%> (-0.19%) ⬇️
...mators/permutation_ordinal/symbolic_permutation.jl 94.00% <94.00%> (ø)
... and 3 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@kahaaga
Copy link
Member

kahaaga commented Dec 22, 2022

Name sortperm is used in the docstring of OrdinalPattern encoding but is not anywhere in the docs. I'll probably re-think how all of this is exposed in this PR.

Maybe just say Base.sortperm, and assume that a user would understand to look at Julia's Base docs?

@Datseris
Copy link
Member Author

aaaaaaaaaaaaaaaaah haha my bad. I did'think about it and thought it was the internal function that does the lehmer code thingy...

@Datseris
Copy link
Member Author

Datseris commented Dec 22, 2022

@kahaaga I would like to remove entropy!(s, e::Entropy, p::ProbEst, x) from the API. It just calls probabilities!(s, ...) and then calls the normal entropy function. It doesn't serve any actual optimization purpose. A user that cares about getting this small optimization can live with calling probabilities! and explicitly passing the result to entropy.

@Datseris
Copy link
Member Author

I would also remove encodings_from_permutations it is an internal function without purpose, as it just becomes encodings_from_permutations! later. Only encodings_from_permutations! should exist. This is rather simple to remove.

@Datseris
Copy link
Member Author

Datseris commented Dec 22, 2022

@kahaaga the OrdinalPatternEncoding still does not match the declared api. The API says that encode takes in an element of the input data. Here should take in an SVector, an element of the input dataset (or delay embedded timeseries). But that's not what it takes in. I want to change this to do what the API says, which is the most useful for a user, who shouldn't have to care of creating the intermedaite sortperm step. Which is also super expensive.

EDIT: In fact, I am convinced that once the encoding does as its supposed to do, we will save a lot of lines of source code. Just like we do in the histogram encoding. In essense, applyling the estimator SymbolicPermutation will be a straightfrward call of map(x -> encode(c, x), input_dataset). with c an instance of OrdinalPatternEncoding.


Here is my plan on how to change it. First, the entirety of the common.jl file is deleted. Then, perm = @MVector zeros(Int, m) becomes a field of OrdinalPatternEncoding. Then, we define:

function encode(encoding::OrdinalPatternEncoding, x::SVector{m}) # x already element of input dataset 
    perm = sortperm!(encoding.dummyperm, x, lt = est.lt)
    m = encoding.m
    n = 0
    for i = 1:m-1
        for j = i+1:m
            n += perm[i] > perm[j] ? 1 : 0
        end
        n = (m-i)*n
    end
    # The Lehmer code actually results in 0 being an encoded symbol. Shift by 1, so that
    # encodings are positive integers.
    return n + 1
end

and BOOM, we saved 200 lines of code.


hence, like in ValueHistogram the source code of the estimator becomes: (1) initialize the encoding and make it your only field, (2) call encode mapping over elements of input. (3) call probabilities(encoded_result).

@kahaaga
Copy link
Member

kahaaga commented Dec 22, 2022

I would like to remove entropy!(s, e::Entropy, p::ProbEst, x) from the API. It just calls probabilities!(s, ...) and then calls the normal entropy function. It doesn't serve any actual optimization purpose. A user that cares about getting this small optimization can live with calling probabilities! and explicitly passing the result to entropy.

Ok with me! We just need to remember to remove all references to entropy! from all docstrings.

I would also remove encodings_from_permutations it is an internal function without purpose, as it just becomes encodings_from_permutations! later. Only encodings_from_permutations! should exist. This is rather simple to remove.

Also fine with me.

...
Here is my plan on how to change it. First, the entirety of the common.jl file is deleted. Then, perm = @mVector zeros(Int, m) becomes a field of OrdinalPatternEncoding. Then, we define:

Excellent. A much better solution. Go ahead!

docs/src/index.md Outdated Show resolved Hide resolved
@Datseris
Copy link
Member Author

Datseris commented Dec 22, 2022

@kahaaga have a look at the new SymbolicPermutation.jl and ordinal_pattern files in latest commit. I think the simplification is massive. The only thing left now is to think about how to incorporate the "weighted" or "amplitude aware" versions to the picture.

maybe the loop

    # TODO: The following loop can probably be parallelized!
    @inbounds for (i, χ) in enumerate(x)
        πs[i] = encode(est.encoding, χ)
    end

is altered to add weights to πs? like, something as simple as πs[i] = encode(est.encoding, χ) * permweight(est, χ)? Or if all of x is required, precoompute the weigts as pweights = permweights(est, x) and then multiply πs[i] with pweights[i]?

EDIT: Seems to me that the weight of any estimator is obtained from the data point itself and does not need additional data points. So, each data point χ generates its own weight without needing the remaining data points in the Dataset. So this is super easy, I just make a function permweight!(π, χ, est) that adds weight to π = πs[i] according to the estimator.

@kahaaga kahaaga mentioned this pull request Dec 22, 2022
@kahaaga
Copy link
Member

kahaaga commented Dec 22, 2022

@kahaaga have a look at the new SymbolicPermutation.jl and ordinal_pattern files in latest commit. I think the simplification is massive. The only thing left now is to think about how to incorporate the "weighted" or "amplitude aware" versions to the picture.

maybe the loop

    # TODO: The following loop can probably be parallelized!
    @inbounds for (i, χ) in enumerate(x)
        πs[i] = encode(est.encoding, χ)
    end

is altered to add weights to πs? like, something as simple as πs[i] = encode(est.encoding, χ) * permweight(est, χ)? Or if all of x is required, precoompute the weigts as pweights = permweights(est, x) and then multiply πs[i] with pweights[i]?

EDIT: Seems to me that the weight of any estimator is obtained from the data point itself and does not need additional data points. So, each data point χ generates its own weight without needing the remaining data points in the Dataset. So this is super easy, I just make a function permweight!(π, χ, est) that adds weight to π = πs[i] according to the estimator.

Yes, the point of the weighted and amplitude-adjusted variants is that they use the same encoded symbols, but weight each symbol differently according some property of the state vector from which the symbol was constructed.

Now, the weights are computed using the permutation_weights method (one in SymbolicAmplitudeAwarePermutation.jl and one in SymbolicWeightedPermutation). These weights are then combined with the encoded symbols using the symprobs function in the utils.jl file.

I just make a function permweight!(π, χ, est) that adds weight to π = πs[i] according to the estimator.

Notice that there is sorting going on in symprobs. Any new method must respect the same ordering, so that there is correspondence between the outcomes and the probabilities returned by probabilites_and_outcomes(::SymbolicWeightedPermutation, x).

@Datseris
Copy link
Member Author

What's the point of the in-place method if weights are recomputed in every call to the inplace method? one would need an inplace method with pre-allocated weihts and symbol vector...? E.g., I see in the current SymbolicWeightedPermutation it doesn't even define an in-place method for probabilities!.

@kahaaga
Copy link
Member

kahaaga commented Dec 23, 2022

EDIT: No, I guess it does? I am confused by all the broadcasting that happens on y in the encode of gaussian. It's a real number, right?

Yes, the returned value from encode(::GaussianCDFEncoding) should be a real number. The broadcasting is actually not needed anymore. This is a remnant from when we demanded the encoding of the entire input timeseries at the same time. Since we now pre-compute the empirical mean and standard deviations for the gaussian CDF, encode can now operate on one element at a time, so we don't need to do elementwise operations in the last step.

The question is: should we be strict (only allow scalar input encode(::GaussianCDFEncoding, x::Real)) or should we keep it as is, which would allow you to do encode(::GaussianCDFEncoding, x::AbstractVector too (which would give elementwise mappings to the cdf)?

@Datseris
Copy link
Member Author

The question is: should we be strict (only allow scalar input encode(::GaussianCDFEncoding, x::Real)) or should we keep it as is, which would allow you to do encode(::GaussianCDFEncoding, x::AbstractVector too (which would give elementwise mappings to the cdf)?

No just call map or broadcast manually with .

@kahaaga
Copy link
Member

kahaaga commented Dec 23, 2022

No just call map or broadcast manually with .

I'm not sure if I follow. Am I right you want to be strict about requiring scalar inputs and require the user to broadcast/map manually? :D

@Datseris
Copy link
Member Author

Either all methods have a folder, or no methods have a folder

Why? This sounds like an arbitrary requirement. My suggestion is that things should be in folders if they require more than 1 file, which is unlikely for a probability estimator generally speaking.

@kahaaga
Copy link
Member

kahaaga commented Dec 23, 2022

Why? This sounds like an arbitrary requirement. My suggestion is that things should be in folders if they require more than 1 file, which is unlikely for a probability estimator generally speaking.

It's just a preference I have, but I have no problems with not sticking to that. Your version is also a natural choice. Feel free to drop the folders if you like.

@kahaaga
Copy link
Member

kahaaga commented Dec 23, 2022

@Datseris Are you fixing the remaining tests (some fail after the OrdinalPatternEncoding refactoring) , or shall I do it when reviewing the PR?

@Datseris
Copy link
Member Author

I haven't finished yet, I still need to go through Transfer Operator and this will take a lot of time

@kahaaga
Copy link
Member

kahaaga commented Dec 23, 2022

I haven't finished yet, I still need to go through Transfer Operator and this will take a lot of time

As part of #55, I'll have to re-think the whole transfer operator estimator API anyways in light of all our recent updates. So I suggest we don't dive too deeply into it now, except cleaning up the documentation and cleaning up the source code.

However, I would be very hesitant to do any fundamental changes to the structure of the source code now (i.e. remove/add existing functions/types), because this code is directly used in scripts related to publications that are already out, and in other materials I use, which I won't have time to re-do in a while.

@Datseris
Copy link
Member Author

As part of #55, I'll have to re-think the whole transfer operator estimator API anyways in light of all our recent updates. So I suggest we don't dive too deeply into it now, except cleaning up the documentation and cleaning up the source code.

Okay then I don't do TraOp at the moment. Well then this PR is finished from my end and should be reviewed. Minor bugfixes for the tests can be corrected overtime but shouldn't hold back the review.

However, I would be very hesitant to do any fundamental changes to the structure of the source code now (i.e. remove/add existing functions/types), because this code is directly used in scripts related to publications that are already out, and in other materials I use, which I won't have time to re-do in a while.

But the publications use Entropies v1.2 which is already out, not only that but modifications to the code will be published under a package with different name! So why would this matter for your existing publicaitons...?

@Datseris
Copy link
Member Author

As you saw, many of the old estimators such as the ValueHistogram or symbolic stuff saw a complete rewrite that led to much simpler code. They are so easy to follow now. Probably worth doing the same with TE.

@kahaaga
Copy link
Member

kahaaga commented Dec 23, 2022

But the publications use Entropies v1.2 which is already out, not only that but modifications to the code will be published under a package with different name! So why would this matter for your existing publicaitons...?

The scripts mostly investigate the transfer operator in the context of causal inference. I still want users to be able to use the existing material, and still be able to to use the latest stuff from CausalityTools V2 when it is released, without having to check out different versions (and remember differences in syntax), and manually using manifests. From experience, that will will quickly become a mess of proportions, unless you know very well what you're doing.

As you saw, many of the old estimators such as the ValueHistogram or symbolic stuff saw a complete rewrite that led to much simpler code. They are so easy to follow now. Probably worth doing the same with TE.

Yes, we definitely should do something similar. But I'll think about that once I have time to address #55.

Well then this PR is finished from my end and should be reviewed.

Ok, excellent. Thanks! I'll have a look at this as soon as possible and follow up with a last review of my own, but probably not until early next week. Some holiday-days are incoming now. Merry christmas, @Datseris!

@Datseris
Copy link
Member Author

Sounds good to me, happy holidays!

@Datseris Datseris changed the title Review of codebase and docs - Datseris Review of codebase and docs - Datseris - Probabilities only Dec 24, 2022
@Datseris
Copy link
Member Author

This is done now and all tests pass. I am re-writing the top-level comment to outline the changes. I don't remember all of them. There weere some small changes regarding clarity here and there. I won't write them.

@Datseris Datseris changed the title Review of codebase and docs - Datseris - Probabilities only Review of codebase and docs - Probabilities and Encodings- Datseris Dec 24, 2022
@kahaaga
Copy link
Member

kahaaga commented Dec 25, 2022

This is a review of all probabilities estimators except the Transfer Operator for v2. Once this PR is merged, all probabilities estimators except the Transfer Operator are "set in stone" for v2 and will not get any further changes. So it is important @kahaaga once we merge this to both agree that this is "done" and commit to this doneness so that we can actually get some v2 out at some point in during our lifespans, because if we don't commit I am sure I will be finding out things to change every week given my history with the repo...

Alright, here I outline the major changes did:

  • Improved (in my opinion) heading and outline in docs
  • Simplified and reduced in length several docstrings, mainly for the estimators that use encodings
  • Removed entropy! method coimpletely.
  • Added an Encodings documentation page
  • Adjusted the symbolic encoding: it now operates on the elements (the vectrors from which to estimat the permutations) and in general obeys the encoding API as declared.
  • Complete re-write of the symbolic permutation estimators. it is amazing how little code they occupy now. They are all in one file, and they have the ordinal pattern encoding as their field. They share almost all their lines of code besides 4 lines of code that define the weights of the permutations. The simplification is insane!!!
  • Spatial symbolc estimator also has encoding as a field now, similarly with the others.
  • Similar rewrite of the Dispersion estimator and the Gaussian encoding. The estimator now has the encoding as its field.
  • Fix all tests

@Datseris I've now reviewed this PR in detail, and I must say your changes look very, very good! The simplification of the symbolic estimators is ridiculous :D

I only had some minor docstring changes (typos, slight clarifications/reformulations).

So it is important @kahaaga once we merge this to both agree that this is "done" and commit to this doneness so that we can actually get some v2 out at some point in during our lifespans, because if we don't commit I am sure I will be finding out things to change every week given my history with the repo...

The existing probabilities estimators are now mature enough that I think we can guarantee their v2-stability. There comes a point where further optimisation at the expense of delayed publication becomes a bit silly. I don't think we're beyond the threshold yet, but we're approaching it :D

The only thing I'm a bit unsure of is the multiscale API, which may change. I have still to test whether it makes sense for upstream methods. However, there is no reason to delay Entropies v2 because of that. We can always release a minor breaking version later if necessary.

That said, I'm now merging, so we can get on with the rest on the review, and I can get on with solidifying the probabilities-based estimators in CausalityTools.

@kahaaga kahaaga merged commit cbb3319 into main Dec 25, 2022
@kahaaga kahaaga deleted the datseris_review branch December 25, 2022 19:29
@Datseris
Copy link
Member Author

We can always release a minor breaking version later if necessary.

There is no such thing as a minor breaking version. If you think multiscale isn't ready, it shouldn't go into v2.

@kahaaga
Copy link
Member

kahaaga commented Dec 25, 2022

There is no such thing as a minor breaking version. If you think multiscale isn't ready, it shouldn't go into v2.

For the applications in this package, it is ready. I can always take care of any future issues with a simple wrapper upstreams.

@Datseris
Copy link
Member Author

Well in anycase, I haven't review that part yet, so we'll see soon how it is. I'm now going to make my entropies review PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants