Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove (and enforce?) duplicate priors #640

Open
ashiklom opened this issue Feb 22, 2019 · 1 comment
Open

Remove (and enforce?) duplicate priors #640

ashiklom opened this issue Feb 22, 2019 · 1 comment

Comments

@ashiklom
Copy link
Member

There must be exactly one prior per variable per PFT. However, at least in the BETY dump used by default on Docker, there are at least 4 PFT-variable combinations with multiple priors:

priors_pfts <- tbl(con, "priors") %>%
  inner_join(tbl(con, "pfts_priors"),
             by = c("id" = "prior_id"),
             suffix = c(".priors", ".pfts_priors")) %>%
  collect()

priors_pfts %>%
  group_by(pft_id, variable_id) %>%
  count() %>%
  filter(n > 1)

# # A tibble: 4 x 3
# # Groups:   pft_id, variable_id [4]
#   pft_id          variable_id         n
#   <S3: integer64> <S3: integer64> <int>
# 1         16               7          2
# 2 1000000122              41          2
# 3 1000000130      1000000012          2
# 4 1000000130      1000000015          2

First, and easiest, we need to remove the problematic records from the current BETY dump (@robkooper?). Here are the duplicate records, grouped by PFT and variable, and sorted (within group) by age (oldest updated first).

# A tibble: 8 x 7
# Groups:   variable_id, pft_id [4]
  pft_id       variable_id    id.priors     id.pfts_priors distn  parama  paramb
  <S3: intege> <S3: integer6> <S3: integer> <S3: integer6> <chr>   <dbl>   <dbl>
1         16            7            189           597     unif  0.      100    
2         16            7            197    1000001394     gamma 1.50e+0   0.4  
3 1000000122           41            139    1000001711     unif  0.       10    
4 1000000122           41     1000000321    1000001698     unif  1.00e-3   1    
5 1000000130   1000000012     1000000123    1000002030     norm  6.60e+3  10    
6 1000000130   1000000012     1000000167    1000002020     norm  3.80e+3  10    
7 1000000130   1000000015     1000000310    1000002044     norm  8.60e-1   0.025
8 1000000130   1000000015     1000000311    1000002024     norm  1.10e+0   0.025

More generally, we should try to figure out a way to make sure these records don't creep into the database again. Unfortunately, it looks like there is no way to impose PostgreSQL constraints across tables, which is what would be required here. Perhaps a short term solution is to periodically run some external validation of BETY (possibly based on my dplyr code above) that makes sure that duplicates don't exist in this table.

Related to #185.

@dlebauer
Copy link
Member

I think we've solved this limitation on the past with trigger functions, eg https://github.com/PecanProject/bety/search?q=trigger&type=Issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants