Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

traits value constraints: needed cleanup and decisions #6

Open
20 tasks
dlebauer opened this issue Sep 18, 2018 · 0 comments
Open
20 tasks

traits value constraints: needed cleanup and decisions #6

dlebauer opened this issue Sep 18, 2018 · 0 comments

Comments

@dlebauer
Copy link
Member

@gsrohde commented on Tue Feb 17 2015

date, dateloc, time, timeloc, date_year, date_month, date_day, time_hour, time_minute

  • Decide what constraints to use
  • Clean up data if needed
  • Add to migration

mean

  • Clean up NULLs
  • Add not-null constraint to migration

n

  • Decide if n can be 1; change 1's to NULL if not. Similarly for 0.
  • Add CHECK (n >= 2) or CHECK (n >= 1) or CHECK (n >= 0)

stat, statname, n consistency

  • Decide on consistency constraints
  • Fix violations
  • Add needed constraints

cultivar_id, specie_id consistency

  • Fix cases where cultivar_id = 0
  • Fix cases where specie_id is NULL but cultivar_id is not
  • Fix inconsistent species-cultivar references
  • Figure out how to write a consistency constraint
  • Add it to migration

checked

  • Decide how to handle NULLs
  • Add not null constraint

access_level

  • Decide how to handle cases where access_level = 0
  • Change column type to the new DOMAIN level_of_access

Other NULLs

  • Clean up NULLs in key columns (or rethink whether they need to be non-null)

Details

date, dateloc, time, timeloc, date_year, date_month, date_day

See discussion in GH #239 and in section 3.26 of https://www.overleaf.com/2086241dwjyrd#/5297403/.

UPDATE 9/15/2015

There are serious issues with the way we use date and time fields in the traits and yields tables. I've made a separate issue for this.

mean

29 rows have NULLs in the mean column. Should these be deleted? 18 of these rows have non-null stat values.

n

See discussion in GH #231.

stat, statname, n consistency

See discussion in GH #231.

cultivar_id, specie_id consistency

This is discussed in the context of foreign-key constraints. See GH # 175.

To see inconsistencies in human-readable form, use

SELECT t_sp.scientificname AS "species referred to by traits table", c_sp.scientificname AS "species matching cultivar", c.name FROM traits t JOIN cultivars c ON t.cultivar_id = c.id JOIN species t_sp ON t_sp.id = t.specie_id JOIN species c_sp ON c.specie_id = c_sp.id WHERE t.specie_id != c.specie_id;

checked

checked is NULL in 143 rows.

access_level

access_level = 0 in 2195 rows.

Other NULLs

This supplements the information discussed in GH issues #175 (about the foreign keys in the traits table) and #204 (about the candidate key for the traits table).

As noted in issue #204, NULLs are rampant in this table, even in columns that make up the proposed key. To get an idea of the scope of the problem, run

SELECT COUNT(*) AS "total number of rows", SUM((site_id IS NULL)::int) AS "site_id NULLs", SUM((specie_id IS NULL)::int) AS "species_id NULLs", SUM((citation_id IS NULL)::int) AS "citation_id NULLs", SUM((cultivar_id IS NULL)::int) AS "cultivar_id NULLs", SUM((treatment_id IS NULL)::int) AS "treatment_id NULLs", SUM((date IS NULL)::int) AS "date NULLs", SUM((time IS NULL)::int) AS "time NULLs", SUM((variable_id IS NULL)::int) AS "variable_id NULLs", SUM((entity_id IS NULL)::int) AS "entity_id NULLs", SUM((method_id IS NULL)::int) AS "method_id NULLs", SUM((date_year IS NULL)::int) AS "date_year NULLs", SUM((date_month IS NULL)::int) AS "date_month NULLs", SUM((date_day IS NULL)::int) AS "date_day NULLs", SUM((time_hour IS NULL)::int) AS "time_hour NULLs", SUM((time_minute)::int) AS "time_minute" FROM traits;

The result is

-[ RECORD 1 ]--------+------
total number of rows | 32027
site_id NULLs        | 8364
species_id NULLs     | 79
citation_id NULLs    | 30
cultivar_id NULLs    | 23571
treatment_id NULLs   | 8120
date NULLs           | 13694
time NULLs           | 5290
variable_id NULLs    | 1
entity_id NULLs      | 28374
method_id NULLs      | 29176
date_year NULLs      | 11022
date_month NULLs     | 13559
date_day NULLs       | 15354
time_hour NULLs      | 29120
time_minute          | 15893

(In psql, to see results in this format, use the "\x" command before running the query.)

To get a better idea of how many rows contain no date information or no time information, run

SELECT COUNT(*) AS "total number of rows", SUM((date IS NULL AND date_year IS NULL AND date_month IS NULL AND date_day IS NULL)::int) AS "rows with no date info", SUM((time IS NULL AND time_hour IS NULL AND time_minute IS NULL)::int) AS "rows with no time into" FROM traits;

This yields

-[ RECORD 1 ]----------+------
total number of rows   | 32027
rows with no date info | 9787
rows with no time into | 4988
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant