You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If so, clean up violators [UNCHECKED THIS 9/11/2015: 16 rows still have null citation_id]
Decide on consistency constraint
not necessary / applicable (see notes added below)
Clean up data if needed
Add to migration
date, dateloc
Decide what constraints to use
dateloc is defined in the data entry documentation. I think integer between 0-100 is sufficient. [COMMENT 9/11/2015: The data entry documentation only specifies meanings for the following values: 5, 6, 7, 8, 9, 95, 96, 97. Does it really make sense to allow anything between 0 and 100? There are currently many outliers. See comment below.]
Clean up data if needed
Add to migration
mgmttype
Decide on what constraints to use
Clean up data if needed [UNCHECKED THIS 9/11/2015: many rows have values for this that aren't values the Rails app offers. See comment below.]
Expand table in Rails app if needed (#341)
Add constraints to migration
level
Decide if any meaningful constraints are possible
NULL or numeric
AND non-negative? Even if this isn't made a database constraint, we can validate this easily in Rails.
units
Decide if any meaningful constraints are possible
could require fertilizer* to be kg ha-1 and fertilizer*rate to be kg ha-1 y-1 or g m-2 y-1
Clean up data if needed [UNCHECKED THIS 9/15/2015: THERE CERTAINLY IS MUCH CLEANUP THAT CAN AND SHOULD BE DONE. FOR EXAMPLE, ENTRIES LIKE g/m-2 ARE CERTAINLY INCORRECT. AT LEAST WE KNOW THIS WAS PROBABLY MEANT TO BE g m-2 or g/m2. OTHER ENTRIES ARE MUCH MORE AMBIGUOUS (e.g. g/m-2, g/m-1).
Add constraints to migration if needed
If meaningful constraints aren't feasible, using an autocompletion list containing the list of existing values (but allowing others) would at least help limit the number of variants. Decide if this should be done, and if so, implement.
Details
citation_id
A management is related to one or more citations indirectly via the managements_treatments and citations_managements tables but also directly via the citation_id column. It seems like some kind of consistency constraint might be in order.
SELECTCOUNT(*) FROM managements m WHERE citation_id NOT IN (SELECT citation_id FROM citations_treatments ct JOIN managements_treatments mt USING (treatment_id) WHEREmt.management_id=m.id);
shows 1262 violations of what seems like a reasonable consistency constraint. For a better picture of the two ways of associating with citations, try
SELECTm.citation_id, ARRAY_AGG(ct.citation_idORDER BYct.citation_id) AS
citation_list FROM managements m JOIN managements_treatments mt ONmt.management_id=m.idJOIN citations_treatments ct USING (treatment_id)
GROUP BYm.id, m.citation_idORDER BYm.citation_id;
Another check that might be reasonable is that if date is like '0001-%', then dateloc is 95, 96, or ???
dateloc should be constrained to the defined values--we should define a domain for this since it is used elsewhere, and a comment on the domain should tell what the values mean.
It might also be useful to define a function that combines date with dateloc and outputs a string showing the significant portions of the date (given the dateloc value) in human readable form. For example INPUT date = '0001-04-21', dateloc = 96 might yield OUTPUT 'April'.
Since dates are fundamentally ambiguous when dateloc isn't given ('01' can mean either "unspecified" or "January" when it appears in the month portion of the date), it seems reasonable to require dateloc to be non-null.
mgmttype
The Rails model for managments contains a list of 29 values for mgmttype that the user can choose from when creating a new management. But the query
SELECTCOUNT(*) FROM managements WHERE mgmttype NOT IN ( 'burned', 'coppice', 'cultivated', 'cultivated or grazed', 'fertilization_Ca', 'fertilization_K', 'fertilization_N', 'fertilization_P', 'fertilization_other', 'fungicide', 'grazed', 'harvest', 'herbicide', 'irrigation', 'light', 'pesticide', 'planting (plants / m2)', 'row spacing', 'seeding', 'tillage','warming_soil','warming_air','initiation of natural succession','major storm','root exclusion', 'trenching', 'CO2 fumigation', 'soil disturbance', 'rain exclusion');
shows that 1592 rows contain types not in the list. At least some of these are variants of the "official" values--for example "fertilizer_N" in place of "fertilization_N". These should probably be changed to match the "official list". Other values not like anything in the list may mean we need to expand the list. If this list will change often, perhaps we need a "management_types" table. See discussion at https://www.overleaf.com/2086241dwjyrd#/5297403/.
To see the unsanctioned values that occur in this column, use
SELECT DISTINCT mgmttype FROM managements WHERE mgmttype NOT IN ( 'burned', 'coppice', 'cultivated', 'cultivated or grazed', 'fertilization_Ca', 'fertilization_K', 'fertilization_N', 'fertilization_P', 'fertilization_other', 'fungicide', 'grazed', 'harvest', 'herbicide', 'irrigation', 'light', 'pesticide', 'planting (plants / m2)', 'row spacing', 'seeding', 'tillage','warming_soil','warming_air','initiation of natural succession','major storm','root exclusion', 'trenching', 'CO2 fumigation', 'soil disturbance', 'rain exclusion');
units
Since we have a finite set of values for mgmttype (at least we do in the Rails interface), it seems like we could also limit units to some finite list of values. Currently, there are many variants of what certainly is the same unit; for example, "plant/m2", "plant/ m2", "plants m-2", "plants/m2", and "plants/ m2" all occur.
Ideally, the units would be constrained to be applicable to the mgmttype value. An SQL function could facilitate this. The question is, can will any restriction we impose be too restrictive to account for future possibilites?
ebi_production=# select dateloc, count(*) from managements group by dateloc order by dateloc;
dateloc | count
---------+-------1.00 | 4<-- doesn't make sense for dates without time2.00 | 1<-- doesn't make sense for dates without time4.00 | 114<-- doesn't make sense for dates without time4.50 | 2<-- doesn't make sense for dates without time5.00 | 5385.20 | 1<-- the Data Entry docs don't specify what fractional values mean5.50 | 241<-- the Data Entry docs don't specify what fractional values mean6.00 | 15206.50 | 56<-- the Data Entry docs don't specify what fractional values mean7.00 | 1497.50 | 72<-- the Data Entry docs don't specify what fractional values mean8.00 | 9508.50 | 2<-- the Data Entry docs don't specify what fractional values mean9.00 | 5995.00 | 16
| 1108<-- this is NULL
(16 rows)
@gsrohde commented on Mon Feb 16 2015
citation_id
citation_id
]date, dateloc
mgmttype
level
units
kg ha-1
and fertilizer*rate to bekg ha-1 y-1
org m-2 y-1
g/m-2
ARE CERTAINLY INCORRECT. AT LEAST WE KNOW THIS WAS PROBABLY MEANT TO BEg m-2
org/m2
. OTHER ENTRIES ARE MUCH MORE AMBIGUOUS (e.g.g/m-2, g/m-1
).Details
citation_id
A management is related to one or more citations indirectly via the managements_treatments and citations_managements tables but also directly via the citation_id column. It seems like some kind of consistency constraint might be in order.
shows 1262 violations of what seems like a reasonable consistency constraint. For a better picture of the two ways of associating with citations, try
response (by @dlebauer)
date
When a year is not specified, a date beginning with '0001-' is used. So to impose a range restriction on date, the following might be reasonable:
Another check that might be reasonable is that if date is like '0001-%', then dateloc is 95, 96, or ???
dateloc should be constrained to the defined values--we should define a domain for this since it is used elsewhere, and a comment on the domain should tell what the values mean.
It might also be useful to define a function that combines date with dateloc and outputs a string showing the significant portions of the date (given the dateloc value) in human readable form. For example INPUT date = '0001-04-21', dateloc = 96 might yield OUTPUT 'April'.
Since dates are fundamentally ambiguous when dateloc isn't given ('01' can mean either "unspecified" or "January" when it appears in the month portion of the date), it seems reasonable to require dateloc to be non-null.
mgmttype
The Rails model for managments contains a list of 29 values for mgmttype that the user can choose from when creating a new management. But the query
shows that 1592 rows contain types not in the list. At least some of these are variants of the "official" values--for example "fertilizer_N" in place of "fertilization_N". These should probably be changed to match the "official list". Other values not like anything in the list may mean we need to expand the list. If this list will change often, perhaps we need a "management_types" table. See discussion at https://www.overleaf.com/2086241dwjyrd#/5297403/.
To see the unsanctioned values that occur in this column, use
units
Since we have a finite set of values for mgmttype (at least we do in the Rails interface), it seems like we could also limit units to some finite list of values. Currently, there are many variants of what certainly is the same unit; for example, "plant/m2", "plant/ m2", "plants m-2", "plants/m2", and "plants/ m2" all occur.
Ideally, the units would be constrained to be applicable to the mgmttype value. An SQL function could facilitate this. The question is, can will any restriction we impose be too restrictive to account for future possibilites?
@dlebauer commented on Fri Aug 07 2015
Cleaning up managements - went through and checked lots of units in g m-2 that should've been kg ha-1 https://gist.github.com/dlebauer/2fa2fa9bedf7fc3d5ae7
@dlebauer commented on Fri Aug 07 2015
@gsrohde did we clean up orphaned managements and treatments in #171? There appear to still be some, depending on how I write my queries (#340)
@gsrohde commented on Fri Sep 11 2015
Current dateloc values (as of 2015-09-11):
@gsrohde commented on Fri Sep 11 2015
These are the values for
mgmttype
that exist in the database but aren't selectable in on the New/Editing Management pages:The bulk of these have the value
planting
, which could perhaps be folded in to the valueplanting (plants / m2)
.The text was updated successfully, but these errors were encountered: