Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add proptable #19

Merged
merged 7 commits into from
Dec 22, 2017
Merged

add proptable #19

merged 7 commits into from
Dec 22, 2017

Conversation

bkamins
Copy link
Collaborator

@bkamins bkamins commented Dec 19, 2017

following #8.

@nalimilan
Copy link
Owner

Thanks. Looks good, I just wonder about the name dims. Are there other examples of functions which take a keyword argument indicating a region like that? It would be nice to follow a general convention. Also, the plural is weird when you pass a single dimension (which is the most common case), so maybe just dim? I wish we could make it a positional argument...

@bkamins
Copy link
Collaborator Author

bkamins commented Dec 19, 2017

An attempt for positional argument. Loos better?

@nalimilan
Copy link
Owner

Clever! It's tempting, but it's also very non-standard so I'm really hesitant. Can you have a look at what similar functions in Base or in packages do in similar cases?

@bkamins bkamins force-pushed the proptable branch 2 times, most recently from 4eda4f9 to 7c63df4 Compare December 19, 2017 19:14
@bkamins
Copy link
Collaborator Author

bkamins commented Dec 19, 2017

sum, maximum, minimum, all, any, prod: call it dims in documentation;
mean, reducedim, mapreducedim call it region in documentation.

In all cases those are positional arguments. I have not found the use of Vararg the way I proposed.

Finally the option is to define proptable to already work on frequency table (or any other AbstractArray), so one would write something like:

proptable(freqtable(x, y), 1)

then the definition is even simpler:

proptable(tbl::AbstractArray) = tbl / sum(tbl)
proptable(tbl::AbstractArray, dims) = tbl ./ sum(tbl, dims)

This is the way R works. Actually, when I wanted proptable I almost never also needed freqtable in R - that is why I proposed to squash it into one function, but we can split it.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.5%) to 96.053% when pulling 4eda4f9 on bkamins:proptable into 0d70f4a on nalimilan:master.

@coveralls
Copy link

coveralls commented Dec 20, 2017

Coverage Status

Coverage increased (+0.5%) to 96.053% when pulling 6268227 on bkamins:proptable into 0d70f4a on nalimilan:master.

@nalimilan
Copy link
Owner

Yeah, implementation-wise it's simpler to require proptable(freqtable(...), dims), but I have the same experience as you regarding the fact that you very often (though not always) call both at the same time. I think it would still make sense to have that function, since it happens that you have a table of counts an want the proportions. Maybe that function could just be called prop, as I'm not sure what "table" adds here.

It's still tempting to add a direct way to compute a table of proportions, either via a dedicated function or via a keyword argument. Or maybe we should wait until we add a more general convenience function, like Stata's tabulate or R's gmodels::CrossTable.

@bkamins
Copy link
Collaborator Author

bkamins commented Dec 20, 2017

The way to go probably depends on the way you want FreqTables to be developed (actually if you would outline what else you would see in this package I could contribute - I work with such type of analysis a lot).

For me things like tabulate or CrossTable or SAS PROC FREQ seem too heavy for this package.
I guess you do not want to put statistical tests etc. into this package as there are other packages for this - right?
Also I prefer UNIX-like approach of design based on simple functions doing one thing.

Given the above - if we named the function prop (or maybe props would be better?) I would like it as an adequate solution.
After thinking about it this .table part in R was bugging me with prop.table the most 😄.
If we went this way the question is how do you feel that the type of tbl should be restricted - any AbstractArray as above or something more specific, like AbstractArray{<:Integer}?

@nalimilan
Copy link
Owner

The way to go probably depends on the way you want FreqTables to be developed (actually if you would outline what else you would see in this package I could contribute - I work with such type of analysis a lot).

For me things like tabulate or CrossTable or SAS PROC FREQ seem too heavy for this package.
I guess you do not want to put statistical tests etc. into this package as there are other packages for this - right?
Also I prefer UNIX-like approach of design based on simple functions doing one thing.

The scope of the package isn't completely defined. Originally I hoped that the functions would be included in StatsBase once we agreed on a standard for named arrays/axis arrays, but that hasn't happened (yet?). Anyway I wouldn't say that these would be too heavy: by what definition are they "heavy" for a package dedicated to frequency tables? :-) As for tests, I'd rather put them in HypothesisTests, at least if they can support NamedArray inputs. We really need an equivalent of the simple chisq.test R function somewhere.

Given the above - if we named the function prop (or maybe props would be better?) I would like it as an adequate solution.
After thinking about it this .table part in R was bugging me with prop.table the most 😄.
If we went this way the question is how do you feel that the type of tbl should be restricted - any AbstractArray as above or something more specific, like AbstractArray{<:Integer}?

Let's go with prop then, anyway that's a good first step even if we decide to add the other approach too. I'm not a fan of the plural form, but if you have arguments for it... I guess it should accept any AbstractArray{<:Number}?

@bkamins
Copy link
Collaborator Author

bkamins commented Dec 21, 2017

I have commited prop. I was afraid of such a short name for a function working on a type from Base, but I have checked and there should not be much confusion (in the past prop was used to indicate property sometimes).

As for other things - I put chisq.test on my to do 😄.

README.md Outdated
@@ -44,6 +53,15 @@ b │ 3 2
c │ 2 3
d │ 2 3

julia> prop(tbl2, 1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior (inherited from sum) is the opposite of R's: the passed dimensions are those to collapse, while in R they are ones to retain. I wonder whether it's appropriate for computing proportions: it's probably more natural to think "I want to compute proportion by rows" than "I want to divide each row by the column sums", isn't it? After all, we say "row profiles/percents" and "column profiles/percents".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it, initially I wanted to be consistent with sum.
But let us call the argument margin then and make it work like in R.

src/freqtable.jl Outdated
@@ -131,3 +131,6 @@ function freqtable(d::AbstractDataFrame, x::Symbol...; args...)
setdimnames!(a, x)
a
end

prop(tbl::AbstractArray{<:Number}) = tbl / sum(tbl)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a docstring?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to add a docstring for freqtable and prop in a separate PR when we have settled the implementation.

src/freqtable.jl Outdated
prop(tbl::AbstractArray{<:Number}) = tbl / sum(tbl)
prop(tbl::AbstractArray{<:Number}, dims) = tbl ./ sum(tbl, dims)

function prop(tbl::AbstractArray{<:Number,N}, margin) where N
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use margin::Integer...? Also, what's minimum?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer... - good idea, changing
minimum - removed, I wanted to check for negative entries but decided that it is not needed

src/freqtable.jl Outdated
throw(ArgumentError("invalid margin argument; expected integers in a valid dimension range"))
end
minimum
tbl ./ sum(tbl, tuple(setdiff(1:N, margin)...))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that inferrable? It would be nice to add @inferred to tests. If not, I wonder whether we could figure out a solution based on dispatch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is; added check to tests

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

src/freqtable.jl Outdated
"""
prop(tbl, [margin])

Create table of proportions from a frequency table `tbl` with margins generated for
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily a frequency table: any numeric table will do.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

src/freqtable.jl Outdated
**Examples**

```julia
prop([1 2; 3 4])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use jldoctest and put the full REPL output.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

src/freqtable.jl Outdated
prop([1 2; 3 4], (1,2))
tbl = prop(rand(5, 5, 5, 5), (1, 2))
sumtbl = sum(tbl, (3,4))
all(x -> x ≈ 1.0, sumtbl)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to show that, just print the output of sum.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK given the change above

src/freqtable.jl Outdated
Calculating `sum` over the result of `prop` over dimensions that are complement of `margin`
produces `AbstractArray` containing only approximately `1.0`, see last example below.

**Arguments**
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring recommendations in the manual say that we should avoid using a separate section for arguments then possible. Here, it would be clearer to mention the expected types of arguments in the signature, and explain what they do directly in the description.

It would also be be nice to be less technical and explain in very simple terms how to compute row and column proportions. No need to mention "approximately" nor duplicated entries.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although margin allows to many types to specify it in the signature. I left it in the description.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry - given your comment below I can put it into the signature :)

@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 95.89% when pulling a9070ae on bkamins:proptable into 0d70f4a on nalimilan:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 95.89% when pulling a9070ae on bkamins:proptable into 0d70f4a on nalimilan:master.

@@ -26,7 +59,10 @@ tab =freqtable(x, y,
3.0 2.0
1.5 1.0]
@test names(tab) == [["a", "b", "c", "d"], ["C", "D"]]

@test prop(tab) == [4 6; 2 3; 6 4; 3 2] / 30.0
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't hurt to have @inferred here too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

src/freqtable.jl Outdated
1.0 1.0
1.0 1.0

julia> tbl = prop(rand(2, 2, 2, 2), 1, 2); sum(tbl, (3,4))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you show this with a simple 3D array, e.g. reshape(1:12, (2, 2, 3)), and print the result of prop first? I think that's very helpful to understand what this does.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

0.075 0.05 0.05 0.075;
0.05 0.075 0.075 0.05;
0.05 0.075 0.075 0.05]
pt = @inferred prop(tab, 2)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this test fails. Maybe inference doesn't work after all?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I tested it on Vector and it works. It fails NamedArray - I will investigate what can be done.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, weird. Looks like the problem might be in sum(a::NamedArray, dim):

julia> @code_warntype sum(NamedArray([1 2; 3 4]), 1)
Variables:
  #self# <optimized out>
  a::NamedArrays.NamedArray{Int64,2,Array{Int64,2},Tuple{DataStructures.OrderedDict{String,Int64},DataStructures.OrderedDict{String,Int64}}}
  d::Int64

Body:
  begin 
      return $(Expr(:invoke, MethodInstance for sum(::NamedArrays.NamedArray{Int64,2,Array{Int64,2},Tuple{DataStructures.OrderedDict{String,Int64},DataStructures.OrderedDict{String,Int64}}}, ::Tuple{Int64}), :(NamedArrays.sum), :(a), :((Core.tuple)(d)::Tuple{Int64})))
  end::Any

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is - I have patched it here, but see the comment below.

@bkamins
Copy link
Collaborator Author

bkamins commented Dec 21, 2017

In order to ensure type stability I had to use internals of NamedArrays that are not exported.
In general constructors of NamedArray are not type stable:

julia> @code_warntype NamedArray([1,2,3])
Variables:
  #self#::Type{NamedArrays.NamedArray}
  array::Array{Int64,1}

Body:
  begin
      SSAValue(0) = (Core.tuple)((Base.arraysize)(array::Array{Int64,1}, 1)::Int64)::Tuple{Int64}
      return $(Expr(:invoke, MethodInstance for NamedArrays.NamedArray(::Array{Int64,1}, ::Array{Array{String,1},1}, ::Array{Symbol,1}), :(#self#), :(array), :($(Expr(:invoke, MethodInstance for collect(::Base.Generator{Tuple{Int64},NamedArrays.#defaultnames}), :(Base.collect), :($(Expr(:new, Base.Generator{Tuple{Int64},NamedArrays.#defaultnames}, :($(QuoteNode(NamedArrays.defaultnames))), SSAValue(0))))))), :($(Expr(:invoke, MethodInstance for collect(::Base.Generator{UnitRange{Int64},NamedArrays.#defaultdimname}), :(Base.collect), :($(Expr(:new, Base.Generator{UnitRange{Int64},NamedArrays.#defaultdimname}, :($(QuoteNode(NamedArrays.defaultdimname))), :($(Expr(:new, UnitRange{Int64}, 1, :((Base.select_value)((Base.sle_int)(1, 1)::Bool, 1, (Base.sub_int)(1, 1)::Int64)::Int64))))))))))))
  end::Any

@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 95.89% when pulling a660551 on bkamins:proptable into 0d70f4a on nalimilan:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 95.89% when pulling a660551 on bkamins:proptable into 0d70f4a on nalimilan:master.

@nalimilan
Copy link
Owner

OK. Do you think there would be a way to make these constructors type-stable in NamedArrays itself? That would be useful in general.

@bkamins
Copy link
Collaborator Author

bkamins commented Dec 22, 2017

If it does not get fixed earlier I put it to my TODO, but I have some other things I would like to patch before :).

Anyway - even if NamedArrays are fixed the code I propose should be optimal (unless internal representation of NamedArrays is changed, because then it will be broken).

@bkamins
Copy link
Collaborator Author

bkamins commented Dec 22, 2017

BTW: do you know why in CI NamedArrays fail to precompile on 0.7?

@nalimilan
Copy link
Owner

Let's go with that, a two-line function isn't too bad as a temporary solution. But we really need to fix these constructors, as performance suffers a lot.

BTW: do you know why in CI NamedArrays fail to precompile on 0.7?

ERROR: LoadError: LoadError: UndefVarError: writedlm not defined

0.7 is going to require lots of small changes.

@nalimilan nalimilan merged commit 0392f03 into nalimilan:master Dec 22, 2017
@nalimilan
Copy link
Owner

I've just tagged a release: JuliaLang/METADATA.jl#12595

nalimilan pushed a commit that referenced this pull request Apr 9, 2019
@bkamins bkamins mentioned this pull request Nov 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants