Review of codebase and docs - Probabilities and Encodings- Datseris #213

Datseris · 2022-12-22T12:34:51Z

This is a review of all probabilities estimators except the Transfer Operator for v2. Once this PR is merged, all probabilities estimators except the Transfer Operator are "set in stone" for v2 and will not get any further changes. So it is important @kahaaga once we merge this to both agree that this is "done" and commit to this doneness so that we can actually get some v2 out at some point in during our lifespans, because if we don't commit I am sure I will be finding out things to change every week given my history with the repo...

Alright, here I outline the major changes did:

Improved (in my opinion) heading and outline in docs
Simplified and reduced in length several docstrings, mainly for the estimators that use encodings
Removed entropy! method coimpletely.
Added an Encodings documentation page
Adjusted the symbolic encoding: it now operates on the elements (the vectrors from which to estimat the permutations) and in general obeys the encoding API as declared.
Complete re-write of the symbolic permutation estimators. it is amazing how little code they occupy now. They are all in one file, and they have the ordinal pattern encoding as their field. They share almost all their lines of code besides 4 lines of code that define the weights of the permutations. The simplification is insane!!!
Spatial symbolc estimator also has encoding as a field now, similarly with the others.
Similar rewrite of the Dispersion estimator and the Gaussian encoding. The estimator now has the encoding as its field.
Fix all tests

Datseris · 2022-12-22T14:42:08Z

Name sortperm is used in the docstring of OrdinalPattern encoding but is not anywhere in the docs. I'll probably re-think how all of this is exposed in this PR.

codecov · 2022-12-22T14:49:45Z

Codecov Report

Merging #213 (7f77993) into main (3de9294) will decrease coverage by 0.16%.
The diff coverage is 91.66%.

@@            Coverage Diff             @@
##             main     #213      +/-   ##
==========================================
- Coverage   84.63%   84.47%   -0.17%     
==========================================
  Files          49       45       -4     
  Lines        1191     1140      -51     
==========================================
- Hits         1008      963      -45     
+ Misses        183      177       -6

Impacted Files	Coverage Δ
src/encoding/rectangular_binning.jl	`88.52% <ø> (ø)`
src/entropy.jl	`79.31% <ø> (+11.66%)`	⬆️
src/probabilities.jl	`80.76% <ø> (ø)`
.../probabilities_estimators/dispersion/dispersion.jl	`89.28% <ø> (ø)`
...abilities_estimators/histograms/value_histogram.jl	`87.50% <ø> (ø)`
src/probabilities_estimators/spatial/utils.jl	`94.11% <ø> (ø)`
src/encoding/ordinal_pattern.jl	`80.00% <75.00%> (-6.05%)`	⬇️
src/encoding/gaussian_cdf.jl	`61.53% <80.00%> (+4.39%)`	⬆️
.../spatial_permutation/SpatialSymbolicPermutation.jl	`70.27% <81.25%> (-0.19%)`	⬇️
...mators/permutation_ordinal/symbolic_permutation.jl	`94.00% <94.00%> (ø)`
... and 3 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

kahaaga · 2022-12-22T15:16:13Z

Name sortperm is used in the docstring of OrdinalPattern encoding but is not anywhere in the docs. I'll probably re-think how all of this is exposed in this PR.

Maybe just say Base.sortperm, and assume that a user would understand to look at Julia's Base docs?

Datseris · 2022-12-22T15:49:25Z

aaaaaaaaaaaaaaaaah haha my bad. I did'think about it and thought it was the internal function that does the lehmer code thingy...

Datseris · 2022-12-22T16:31:19Z

@kahaaga I would like to remove entropy!(s, e::Entropy, p::ProbEst, x) from the API. It just calls probabilities!(s, ...) and then calls the normal entropy function. It doesn't serve any actual optimization purpose. A user that cares about getting this small optimization can live with calling probabilities! and explicitly passing the result to entropy.

Datseris · 2022-12-22T16:33:50Z

I would also remove encodings_from_permutations it is an internal function without purpose, as it just becomes encodings_from_permutations! later. Only encodings_from_permutations! should exist. This is rather simple to remove.

Datseris · 2022-12-22T16:38:19Z

@kahaaga the OrdinalPatternEncoding still does not match the declared api. The API says that encode takes in an element of the input data. Here should take in an SVector, an element of the input dataset (or delay embedded timeseries). But that's not what it takes in. I want to change this to do what the API says, which is the most useful for a user, who shouldn't have to care of creating the intermedaite sortperm step. Which is also super expensive.

EDIT: In fact, I am convinced that once the encoding does as its supposed to do, we will save a lot of lines of source code. Just like we do in the histogram encoding. In essense, applyling the estimator SymbolicPermutation will be a straightfrward call of map(x -> encode(c, x), input_dataset). with c an instance of OrdinalPatternEncoding.

Here is my plan on how to change it. First, the entirety of the common.jl file is deleted. Then, perm = @MVector zeros(Int, m) becomes a field of OrdinalPatternEncoding. Then, we define:

function encode(encoding::OrdinalPatternEncoding, x::SVector{m}) # x already element of input dataset 
    perm = sortperm!(encoding.dummyperm, x, lt = est.lt)
    m = encoding.m
    n = 0
    for i = 1:m-1
        for j = i+1:m
            n += perm[i] > perm[j] ? 1 : 0
        end
        n = (m-i)*n
    end
    # The Lehmer code actually results in 0 being an encoded symbol. Shift by 1, so that
    # encodings are positive integers.
    return n + 1
end

and BOOM, we saved 200 lines of code.

hence, like in ValueHistogram the source code of the estimator becomes: (1) initialize the encoding and make it your only field, (2) call encode mapping over elements of input. (3) call probabilities(encoded_result).

kahaaga · 2022-12-22T16:52:01Z

I would like to remove entropy!(s, e::Entropy, p::ProbEst, x) from the API. It just calls probabilities!(s, ...) and then calls the normal entropy function. It doesn't serve any actual optimization purpose. A user that cares about getting this small optimization can live with calling probabilities! and explicitly passing the result to entropy.

Ok with me! We just need to remember to remove all references to entropy! from all docstrings.

I would also remove encodings_from_permutations it is an internal function without purpose, as it just becomes encodings_from_permutations! later. Only encodings_from_permutations! should exist. This is rather simple to remove.

Also fine with me.

...
Here is my plan on how to change it. First, the entirety of the common.jl file is deleted. Then, perm = @mVector zeros(Int, m) becomes a field of OrdinalPatternEncoding. Then, we define:

Excellent. A much better solution. Go ahead!

docs/src/index.md

docs/src/probabilities.md

src/probabilities_estimators/permutation_ordinal/SymbolicPermutation.jl

Datseris · 2022-12-22T17:44:08Z

@kahaaga have a look at the new SymbolicPermutation.jl and ordinal_pattern files in latest commit. I think the simplification is massive. The only thing left now is to think about how to incorporate the "weighted" or "amplitude aware" versions to the picture.

maybe the loop

    # TODO: The following loop can probably be parallelized!
    @inbounds for (i, χ) in enumerate(x)
        πs[i] = encode(est.encoding, χ)
    end

is altered to add weights to πs? like, something as simple as πs[i] = encode(est.encoding, χ) * permweight(est, χ)? Or if all of x is required, precoompute the weigts as pweights = permweights(est, x) and then multiply πs[i] with pweights[i]?

EDIT: Seems to me that the weight of any estimator is obtained from the data point itself and does not need additional data points. So, each data point χ generates its own weight without needing the remaining data points in the Dataset. So this is super easy, I just make a function permweight!(π, χ, est) that adds weight to π = πs[i] according to the estimator.

src/probabilities_estimators/permutation_ordinal/SymbolicPermutation.jl

kahaaga · 2022-12-22T18:05:52Z

@kahaaga have a look at the new SymbolicPermutation.jl and ordinal_pattern files in latest commit. I think the simplification is massive. The only thing left now is to think about how to incorporate the "weighted" or "amplitude aware" versions to the picture.

maybe the loop
    # TODO: The following loop can probably be parallelized!
    @inbounds for (i, χ) in enumerate(x)
        πs[i] = encode(est.encoding, χ)
    end
is altered to add weights to πs? like, something as simple as πs[i] = encode(est.encoding, χ) * permweight(est, χ)? Or if all of x is required, precoompute the weigts as pweights = permweights(est, x) and then multiply πs[i] with pweights[i]?

EDIT: Seems to me that the weight of any estimator is obtained from the data point itself and does not need additional data points. So, each data point χ generates its own weight without needing the remaining data points in the Dataset. So this is super easy, I just make a function permweight!(π, χ, est) that adds weight to π = πs[i] according to the estimator.

Yes, the point of the weighted and amplitude-adjusted variants is that they use the same encoded symbols, but weight each symbol differently according some property of the state vector from which the symbol was constructed.

Now, the weights are computed using the permutation_weights method (one in SymbolicAmplitudeAwarePermutation.jl and one in SymbolicWeightedPermutation). These weights are then combined with the encoded symbols using the symprobs function in the utils.jl file.

I just make a function permweight!(π, χ, est) that adds weight to π = πs[i] according to the estimator.

Notice that there is sorting going on in symprobs. Any new method must respect the same ordering, so that there is correspondence between the outcomes and the probabilities returned by probabilites_and_outcomes(::SymbolicWeightedPermutation, x).

Datseris · 2022-12-22T19:37:10Z

What's the point of the in-place method if weights are recomputed in every call to the inplace method? one would need an inplace method with pre-allocated weihts and symbol vector...? E.g., I see in the current SymbolicWeightedPermutation it doesn't even define an in-place method for probabilities!.

kahaaga · 2022-12-23T12:19:29Z

EDIT: No, I guess it does? I am confused by all the broadcasting that happens on y in the encode of gaussian. It's a real number, right?

Yes, the returned value from encode(::GaussianCDFEncoding) should be a real number. The broadcasting is actually not needed anymore. This is a remnant from when we demanded the encoding of the entire input timeseries at the same time. Since we now pre-compute the empirical mean and standard deviations for the gaussian CDF, encode can now operate on one element at a time, so we don't need to do elementwise operations in the last step.

The question is: should we be strict (only allow scalar input encode(::GaussianCDFEncoding, x::Real)) or should we keep it as is, which would allow you to do encode(::GaussianCDFEncoding, x::AbstractVector too (which would give elementwise mappings to the cdf)?

Datseris · 2022-12-23T12:24:03Z

The question is: should we be strict (only allow scalar input encode(::GaussianCDFEncoding, x::Real)) or should we keep it as is, which would allow you to do encode(::GaussianCDFEncoding, x::AbstractVector too (which would give elementwise mappings to the cdf)?

No just call map or broadcast manually with .

kahaaga · 2022-12-23T12:25:53Z

No just call map or broadcast manually with .

I'm not sure if I follow. Am I right you want to be strict about requiring scalar inputs and require the user to broadcast/map manually? :D

Datseris · 2022-12-23T12:26:41Z

Either all methods have a folder, or no methods have a folder

Why? This sounds like an arbitrary requirement. My suggestion is that things should be in folders if they require more than 1 file, which is unlikely for a probability estimator generally speaking.

kahaaga · 2022-12-23T12:28:13Z

Why? This sounds like an arbitrary requirement. My suggestion is that things should be in folders if they require more than 1 file, which is unlikely for a probability estimator generally speaking.

It's just a preference I have, but I have no problems with not sticking to that. Your version is also a natural choice. Feel free to drop the folders if you like.

kahaaga · 2022-12-23T12:46:59Z

@Datseris Are you fixing the remaining tests (some fail after the OrdinalPatternEncoding refactoring) , or shall I do it when reviewing the PR?

Datseris · 2022-12-23T12:58:26Z

I haven't finished yet, I still need to go through Transfer Operator and this will take a lot of time

kahaaga · 2022-12-23T13:15:45Z

I haven't finished yet, I still need to go through Transfer Operator and this will take a lot of time

As part of #55, I'll have to re-think the whole transfer operator estimator API anyways in light of all our recent updates. So I suggest we don't dive too deeply into it now, except cleaning up the documentation and cleaning up the source code.

However, I would be very hesitant to do any fundamental changes to the structure of the source code now (i.e. remove/add existing functions/types), because this code is directly used in scripts related to publications that are already out, and in other materials I use, which I won't have time to re-do in a while.

Datseris · 2022-12-23T17:17:14Z

As part of #55, I'll have to re-think the whole transfer operator estimator API anyways in light of all our recent updates. So I suggest we don't dive too deeply into it now, except cleaning up the documentation and cleaning up the source code.

Okay then I don't do TraOp at the moment. Well then this PR is finished from my end and should be reviewed. Minor bugfixes for the tests can be corrected overtime but shouldn't hold back the review.

However, I would be very hesitant to do any fundamental changes to the structure of the source code now (i.e. remove/add existing functions/types), because this code is directly used in scripts related to publications that are already out, and in other materials I use, which I won't have time to re-do in a while.

But the publications use Entropies v1.2 which is already out, not only that but modifications to the code will be published under a package with different name! So why would this matter for your existing publicaitons...?

Datseris · 2022-12-23T17:18:07Z

As you saw, many of the old estimators such as the ValueHistogram or symbolic stuff saw a complete rewrite that led to much simpler code. They are so easy to follow now. Probably worth doing the same with TE.

kahaaga · 2022-12-23T17:27:34Z

But the publications use Entropies v1.2 which is already out, not only that but modifications to the code will be published under a package with different name! So why would this matter for your existing publicaitons...?

The scripts mostly investigate the transfer operator in the context of causal inference. I still want users to be able to use the existing material, and still be able to to use the latest stuff from CausalityTools V2 when it is released, without having to check out different versions (and remember differences in syntax), and manually using manifests. From experience, that will will quickly become a mess of proportions, unless you know very well what you're doing.

As you saw, many of the old estimators such as the ValueHistogram or symbolic stuff saw a complete rewrite that led to much simpler code. They are so easy to follow now. Probably worth doing the same with TE.

Yes, we definitely should do something similar. But I'll think about that once I have time to address #55.

Well then this PR is finished from my end and should be reviewed.

Ok, excellent. Thanks! I'll have a look at this as soon as possible and follow up with a last review of my own, but probably not until early next week. Some holiday-days are incoming now. Merry christmas, @Datseris!

Datseris · 2022-12-23T18:30:25Z

Sounds good to me, happy holidays!

Datseris · 2022-12-24T08:59:46Z

This is done now and all tests pass. I am re-writing the top-level comment to outline the changes. I don't remember all of them. There weere some small changes regarding clarity here and there. I won't write them.

Computations are done with probabilities as *input*

kahaaga · 2022-12-25T19:15:35Z

This is a review of all probabilities estimators except the Transfer Operator for v2. Once this PR is merged, all probabilities estimators except the Transfer Operator are "set in stone" for v2 and will not get any further changes. So it is important @kahaaga once we merge this to both agree that this is "done" and commit to this doneness so that we can actually get some v2 out at some point in during our lifespans, because if we don't commit I am sure I will be finding out things to change every week given my history with the repo...

Alright, here I outline the major changes did:

Improved (in my opinion) heading and outline in docs

Simplified and reduced in length several docstrings, mainly for the estimators that use encodings

Removed entropy! method coimpletely.

Added an Encodings documentation page

Adjusted the symbolic encoding: it now operates on the elements (the vectrors from which to estimat the permutations) and in general obeys the encoding API as declared.

Complete re-write of the symbolic permutation estimators. it is amazing how little code they occupy now. They are all in one file, and they have the ordinal pattern encoding as their field. They share almost all their lines of code besides 4 lines of code that define the weights of the permutations. The simplification is insane!!!

Spatial symbolc estimator also has encoding as a field now, similarly with the others.

Similar rewrite of the Dispersion estimator and the Gaussian encoding. The estimator now has the encoding as its field.

Fix all tests

@Datseris I've now reviewed this PR in detail, and I must say your changes look very, very good! The simplification of the symbolic estimators is ridiculous :D

I only had some minor docstring changes (typos, slight clarifications/reformulations).

So it is important @kahaaga once we merge this to both agree that this is "done" and commit to this doneness so that we can actually get some v2 out at some point in during our lifespans, because if we don't commit I am sure I will be finding out things to change every week given my history with the repo...

The existing probabilities estimators are now mature enough that I think we can guarantee their v2-stability. There comes a point where further optimisation at the expense of delayed publication becomes a bit silly. I don't think we're beyond the threshold yet, but we're approaching it :D

The only thing I'm a bit unsure of is the multiscale API, which may change. I have still to test whether it makes sense for upstream methods. However, there is no reason to delay Entropies v2 because of that. We can always release a minor breaking version later if necessary.

That said, I'm now merging, so we can get on with the rest on the review, and I can get on with solidifying the probabilities-based estimators in CausalityTools.

Datseris · 2022-12-25T20:04:10Z

We can always release a minor breaking version later if necessary.

There is no such thing as a minor breaking version. If you think multiscale isn't ready, it shouldn't go into v2.

kahaaga · 2022-12-25T20:06:40Z

There is no such thing as a minor breaking version. If you think multiscale isn't ready, it shouldn't go into v2.

For the applications in this package, it is ready. I can always take care of any future issues with a simple wrapper upstreams.

Datseris · 2022-12-25T20:07:55Z

Well in anycase, I haven't review that part yet, so we'll see soon how it is. I'm now going to make my entropies review PR.

Datseris added 4 commits December 22, 2022 14:25

update probabilities table

edf5da6

CountOccurrences works with Any input

b8a416d

better terminology header

789c5d4

simpler headers in probabilities

d6a327a

kahaaga mentioned this pull request Dec 22, 2022

Allow some variation in differential entropy analytical tests #214

Closed

Add encodings page

fac8c8a

simplify SymbolicPermutation docstring

5dfbc98

Datseris added 3 commits December 22, 2022 18:46

reference complexity measures

d57c4e6

correct dosctring to reference isrand

7150a6e

more organized tests for symbolic permutat

fb9f13f