Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial work on handling missing values #196

Merged
merged 8 commits into from
Sep 25, 2019
Merged

Conversation

palday
Copy link
Member

@palday palday commented Sep 24, 2019

Because StatsModels.ModelFrame is never called directly, StatsModels.missing_omit is never invoked either. If you extract the fixed effects as well as the blocking variables, then you can use that to create a FormulaTerm that missing_omit() understands and not lose any columns you need in the next step.

The extraction used here is rather kludgy at the moment.

@kleinschmidt
Copy link
Member

I think that if you create the right methods for StatsModels.termvars(::RandomEffectTerm) you can do it without the kludge (and just call missing_omit directly on the formula after apply_schema).

@palday
Copy link
Member Author

palday commented Sep 24, 2019

Thanks, that's the type of pointer I need.

@palday
Copy link
Member Author

palday commented Sep 24, 2019

So it would be nice to define the usual StatsModels methods for RandomEffectTerm (and I'm going to do that), but it turns out not to be strictly necessary for models that aren't doing anything fancy in their formulas. StatsModels.termvars works on the original formula, which parses out the grouping operator as an unspecified function of the grouping variable.

@codecov-io
Copy link

codecov-io commented Sep 24, 2019

Codecov Report

Merging #196 into master will decrease coverage by 0.06%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #196      +/-   ##
==========================================
- Coverage   94.55%   94.49%   -0.07%     
==========================================
  Files          18       18              
  Lines        1158     1162       +4     
==========================================
+ Hits         1095     1098       +3     
- Misses         63       64       +1
Impacted Files Coverage Δ
src/linearmixedmodel.jl 92.15% <100%> (+0.02%) ⬆️
src/randomeffectsterm.jl 92% <75%> (-3.46%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 01a9444...827c27a. Read the comment docs.

@@ -6,6 +6,10 @@ end
Base.show(io::IO, t::RandomEffectsTerm) = print(io, "($(t.lhs) | $(t.rhs))")
StatsModels.is_matrix_term(::Type{RandomEffectsTerm}) = false

function StatsModels.termvars(t::RandomEffectsTerm)
vcat(StatsModels.termvars(t.lhs), StatsModels.termvars(t.rhs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to say you might want to use union (like StatsModels does) but there shouldn't be any duplication between the lhs and rhs so there's really no need.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, duplication on the lhs and rhs would be very bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be worth checking that in the constructor actually...

@palday
Copy link
Member Author

palday commented Sep 24, 2019

@kleinschmidt are there any other important Terms methods that other StatsMods functionality depends on? It should be fairly straightforward to implement as many as possible while I'm thinking about this.

@kleinschmidt
Copy link
Member

Not that I can think of off the top of my head; most of them have frankly terrible, confusing names (I can say that because I named them) and are "internal" anyway and can be implemented as needed

@kleinschmidt
Copy link
Member

Actually, better add tests for this too.

@palday
Copy link
Member Author

palday commented Sep 24, 2019

Working on tests for all this led me to something interesting -- missing values can lead to continuous predictors being converted into categorical. This seems to be unrelated to RandomEffectsTerm:

slp = deepcopy(dat[:sleepstudy])
slp[!,:U] = Array{Union{Missing, Float64},1}(slp[!,:U])
slp[1,:U] = missing

julia> apply_schema(@formula(Y ~ 1 + U), schema(@formula(Y ~ 1 + U), dat[:sleepstudy]))
FormulaTerm
Response:
  Y(continuous)
Predictors:
  1
  U(continuous)

julia> apply_schema(@formula(Y ~ 1 + U), schema(@formula(Y ~ 1 + U), slp))
FormulaTerm
Response:
  Y(continuous)
Predictors:
  1
  U(DummyCoding:109)

It seems the union datatype and not the presence of any missings is responsible for getting a continuous variable converted to categorical:

julia> apply_schema(@formula(Y ~ 1 + U), schema(@formula(Y ~ 1 + U), dropmissing(slp,disallowmissing=false)))
FormulaTerm
Response:
  Y(continuous)
Predictors:
  1
  U(DummyCoding:109)

Or using an example from the StatsMods documentation:

julia> concrete_term(term(:a), [1, 2, 3], nothing)
a(continuous)

julia> concrete_term(term(:a), [1, 2, missing], nothing)
a(DummyCoding:21)

I suspect this hasn't come up in StatsMods because missings are dropped before schema are applied and here I wanted to do it afterwards because the RandomEffecttTerms are created via apply_schema(). I think though that the missings should propagate here as they would for most numerical computations. This would help e.g. in building models with a built-in imputation step. This is drifting a bit from what I was trying to do here; if it's a legitimate interest, I'll re-raise the issue over at StatsMods.

@kleinschmidt
Copy link
Member

There's an open issue for this actually, JuliaStats/StatsModels.jl#145. We've discussed treating unions missing with continuous things as continuous, which I think is reasonable, but there are some tweaks necessary to get it to work. If you want to work up a PR there I'll happily review it, or otherwise it's at the top of my list for later this week when I'll have some Julia cycles...

@kleinschmidt
Copy link
Member

In general, we need a better way to handle missings in StatsModels. Things like lead/lag also produce them and it's not possible to remove them by pre-processing the data there.

@palday palday changed the title [WIP] Initial work on handling missing values Initial work on handling missing values Sep 25, 2019
@palday
Copy link
Member Author

palday commented Sep 25, 2019

Travis failed because of an issue in downloading a package, now everything looks good. Some parts will work better once some of the changes I'm making in JuliaStats/StatsModels.jl#153 land.

@palday palday merged commit d40f9ce into JuliaStats:master Sep 25, 2019
@palday palday deleted the missing branch September 25, 2019 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants