ForestGEO App Specification

Summary

ForestGEO, an initiative led by the Smithsonian Institution, studies forest diversity via 70+ research sites. Each research site conducts censuses of every tree within the plot at that location. Once forest data is collected from a census, it can take over a year to prepare that data for analysis. This delay is largely due to a patchwork of highly manual processes and legacy software that ForestGEO relies on to enter data into a digital format and validate the data. Furthermore, a handful of database managers are responsible for maintaining this legacy process because errors requiring expert support frequently arise. As a result, ForestGEO would like to make their data collection system more reliable, automated, and easy to use.

Architecture Proposal

The ForestGEO App is a web application developed using React, an open-source, front-end JavaScript library. This web application will communicate with Azure cloud services, which will in turn communicate with the Smithsonian Digital Repository, a SQL database that is hosted on-premises.

There are two main Azure cloud services that the ForestGEO App will leverage: Azure Static Web Apps and Azure SQL DB (we recommend Azure SQL Serverless). The Azure Static Web App orchestrates all backend processing. When a user enters the URL for the ForestGEO App in their browser, the Azure Static Web App will return all content necessary to render the page. The Azure Static Web App will handle authentication/authorization, computationally expensive queries such as data validations, and syncing with the digital repository and SQL DB. The Azure SQL DB will be used to store census data and to run stored procedures validating census data.

For example: suppose that a user drops in a CSV file with fresh data from their second census. At an earlier time, the Azure Static Web App would have queried the on-premises repository for published data from the first census and cached said data in the SQL DB. Data from the dropped CSV file is sent to the Azure Static Web App, which has authenticated the user and has authorized the user to read and write data to their site’s database on the SQL DB. The Azure Static Web App runs a suite of validations against the second census data, either through code running directly within the Azure Static Web App or by invoking stored procedures on the SQL DB. For instance, one validation may check to see if a tree in the second census has shrunk compared to the previous census. Validated data are stored alongside pertinent error codes within the SQL DB. Finally, the validated data are returned to the client app, which interprets error codes and displays inline error messages to the user.

Front End

Web Application
Typescript, CSS, HTML
React – Open-source JavaScript library

Back End

Microsoft Azure Cloud Services
Azure Static Web Apps
Azure SQL
The Smithsonian Digital Repository: an on-premises SQL server

User Flow

Collect data in the field via tablet (ArcGIS Pro or Fast Field Forms, etc.) OR via paper field form and double data entry
Export CSV(s) of data
Wrangle data to fit designated format (established by database manager and already in use)
Upload file(s) to ForestGEO App
Get immediate error report which both specifies the location of the error and its type
If necessary, check data in the field, and make corrections to your spreadsheet(s) using your chosen data entry method
Re-upload corrected spreadsheet(s)
If there are no more errors, submit data for upload to ForestGEO database
Rinse and repeat as new data are collected

Customer Overview

We are working with Forest Global Earth Observatory (ForestGEO), an initiative led by the Smithsonian Institution in partnership with dozens of research institutions around the world to study forest function and diversity globally. The intensity and scale of ForestGEO is unprecedented in forest science. There are 73 forest research sites in 28 countries participating in ForestGEO, monitoring approximately 7 million trees and nearly 12,000 species. Data from ForestGEO has been leveraged to publish over 1,100 scientific articles, train hundreds of early-career scientists and students, and contribute to forest policy and management.

ForestGEO research sites take measurements from the ground, as opposed to some initiatives which monitor forest health via satellite. It is true that satellite data can cover a greater surface area, but this does not render ForestGEO's methods obsolete because satellite data cannot match ForestGEO's fidelity and accuracy that are so critical to forest science research.

Personas

Note: Some sites do not have all personas present (ex: Lambir and Palanan)

Director

Runs the program across all sites and oversees fundraising
Wishes for all research sites to migrate to tablet data entry in the next few years

Database Manager

Curate datasets and put them up on the server
Train PIs and their field crew in data entry, validation, and how to extract information from database
- How to use SQL to access relational databases
- How to use the existing data entry software, CTFSWeb, to enter, screen, and upload data
Perform administrative roles for CTFSWeb
Conflict resolution and error checking for submitted data
- For many of the research sites, manually runs SQL scripts to generate error reports (sorted by record rather than by error type), corrects data, and repeats process until all errors resolved
- Writes new SQL scripts that aren't already hosted on CTFSWeb
Example walkthrough: Laura is a database manager. When a research site has completed all data entry for a given census, the lead technician contacts her via email, so that she can complete the data screening process and generate meaningful, reports of errors organized by tag (unique identification number given to each tree) and quadrat (see appendix Fig 7 for plot diagram). Because there are so many errors, Laura feeds these error reports to the lead technician in batches, as not to overwhelm the research site, so that they are more likely to make corrections. The lead technician checks for data entry errors and makes corrections before emailing these corrections back to Laura, so that she can update the database. She is finished entering this set of corrections, but new errors arise as a result of these corrections. Laura again contacts the lead technician over email with the updated error report, and the process repeats until there are no more errors to correct. This process can sometimes take months, and Laura is likely doing it for multiple research sites at a time.

Principal Investigator (PI)

Head scientist(s) at a research site, sometimes assisted by lead technician
Usually "approves" report of post-census summary statistics

Lead Technician

Usually does majority of data wrangling and handling (prior to uploading into CTFSWeb for sites using tablets)
In the field, measuring during a census
Depending on the site, there are one or two lead technicians who train the field crew
Point person; buffer between PI and field crew
High turnover role, so knowledge gets lost; many times, technicians aren't permanent and may only be there for a census before moving on
Example walkthrough: Rodolfo is a lead technician at a research site. When his site has completed a census, he reaches out to the database manager Laura for error reports. He sends members of the field crew out to make corrections, and then sends lists of corrections to Laura to input into the database. He would like to be able to make these changes himself, but he would need to know SQL to write the scripts to make changes on the temporary tables.

Field Crew

Performs data collection via paper forms or tablets, depending on site
Enters data using CTFSWeb single or double data entry OR enters data into Excel or Access and send flat files to database managers
Example walkthrough: Mica is a member of the field crew. During a census, they are responsible for collecting data in the field during the day and then returning to the field station at night to enter this new data into CTFSWeb. Sometimes Mica forgets to input the right date or mistypes a number, but they're unable to make edits once they submit a data entry form. So, they make note of these mistakes and contact database manager Laura to make changes.
Note: At BCI and some African and Asian sites, there are one or two people dedicated to data entry only who start about 3 months after field work has started.

Goals

As a database manager, application users can now run their own validation and correct their own errors, so I can better use my technical skills to fulfill my original role responsibilities. I'm also able to easily maintain the application by adding new screening scripts, so that it doesn't become outdated.
As a principal investigator, the software is easy to install and intuitive to use, so I can easily onboard new lead technicians, field crew members, and data entry personnel. And, I have the data ready for analysis by my team and me.
As a lead technician, I'm able to generate printable error reports immediately after data entry, before too much time has passed, so that I can correct my mistakes in the field the next day and repeat the process.
As a director, previously manual data validation is automatic, reducing the time spent data entry and data use from months to seconds, so that census data is available for analysis sooner. And, for those datasets that I manage, I have them made publicly accessible in close to real-time.

Capabilities of Current Solution: CTFSWeb

CTFSWeb was built by a single developer several years ago who has since left ForestGEO. This is the application used by database managers to screen data and by many sites to perform data entry.

Log in with user and site credentials
Viewing and downloading data reports
1. Creating census-specific, filtered data reports
  1. Filtering for only certain taxa (e.g., "Faramea" species)
  2. Filtering only certain areas of the plot (e.g., quadrat 01-01)
  3. Filtering only certain trees (e.g., all trees with dbh ≥ 10 cm)
2. Downloading complete data in one file
Configuring data entry forms
1. Allows for the addition/removal/reordering of columns to accommodate specific site needs
  1. Ex: Some sites opt to hide past census' DBH measurements while others opt to include them
  2. The ability to flag certain trees that need special attention (e.g., collect sample for identification)
Data entry
1. System relies on data entry forms being configured first
2. Either create new census or work on existing census
3. Data entered by quadrat into the following forms (see appendix Fig 1, 2 & 3):
  1. New Plants Form
  2. Old Tree Form
  3. Multiple Stems Form
  4. Big Trees Form
  5. Problems Form
4. Cannot make edits after submission
5. Keyboard shortcuts for fast work
Data screening
1. Single Data Entry or Double Data Entry
2. Data put into temporary tables
3. Pre-uploading screening (See appendix Fig 4)
4. "Finishing Up" (browse or download)
  1. Includes creating the ViewFullTable and ViewTaxonomy tables where all the relevant data is put back into Excel-like tables for easy query
5. Post-uploading screening (See appendix Fig 5)
6. PI checks results and either sends field people out to check errors in plot or waits until next census/marks as a first draft of database (more likely)
  1. CTFSWebgives separate lists of each type of error i.e. one for wrong species, wrong DBHs, or duplicate tags. But, having three or four separate lists isn't helpful to a field person who needs the errors organized by tag, so database managers use SQL to reformat/manipulate error reports to be ready for printing.
  2. CTFSWeb doesn't allow users to go back and edit temporary tables or database, so they send corrections to database managers to enter.
7. See appendix for documentation of current data validations
8. Uploading data into MySQL database
9. Generating customizable field form PDFs for print use

Pain Points

#	Pain Point	Details
1	Data entry process has additional steps where errors can be created and is time consuming	1. Paper forms: Collecting census data on paper then digitizing introduces two points of potential error rather than one if forms were electronic input. There’s an opportunity to mis-record a record on paper in the field and/or to mistype when transferring data to digital format; double data entry minimizes these transcription errors. 2. Tablets: There is no way for people entering data to convert from spreadsheet to format used in legacy application, so they email files to database managers who manually run SQL scripts. 3. The current “save” process is convoluted. Data that is submitted but hasn’t undergone screening remains in temporary tables. However, the people entering data cannot edit data after submitting. Therefore, they are incentivized to enter all census data in a batch run rather than screening individual quadrats worth of data while it’s still reasonable to correct errors.
2	Database manager role is overloaded with manual data validation	It’s ideal/expected that PIs and field crews enter data, screen for errors, and generate a post-validation report. However, much of this responsibility has fallen to the database managers because the entire validation process isn’t accessible through the existing software’s user interface and instead requires technical skills to manually run SQL scripts. This impacts the existing flow in a few ways: 1. Single point of failure. If database managers were to retire or be replaced, the entire process breaks down because they hold all of the understanding of how the current system functions. 2. Scalability. The database managers are already handling many (3-6 annually) of research sites’ error reports. Additional sites are added from time to time, and each new site is additional work for the database managers. 3. Overcapacity. Time spent acting as a helpdesk for research sites is time not spent fulfilling database managers’ original roles.
3	Technical concerns	1. Not maintainable. Database managers don’t know how to edit the existing PHP code where screening scripts are hosted. Every time they want to make a change to the validation process, they must write new SQL scripts which can only be run manually. 2. Installation. The installation process is cumbersome and confusing. For a site to get on-boarded, it requires reading a 40+ page instruction manual, attending training workshops, and/or getting individual help from database managers. 3. Existing validation is not always robust. Current screening scripts match by tag number or by order of data entry to locate errors. However, if the field crew inputs tag numbers incorrectly, scripts struggle to detect errors. Additionally, the database mangers have a wish list of validations they’d like the software to perform that they don’t yet have scripts written for.
4	Externalities outside ForestGEO’s control	1. Lack of internet connectivity. Some research sites are remote or located in countries facing civil/political conflicts that result in reduced access to internet. In fact, some sites can only reach internet by traveling for hours or may only have internet once per week in an internet café. 2. Lack of funding. ForestGEO has limited resources for building and maintaining a new solution themselves and has volunteers and a temporary intern managing the project today 3. High turnover rate at research sites. A PI or field worker might know enough SQL from database workshops to run validations on their own, but people come and go and there are always new staff in need of training.

Requirements

*P0 = Essential product feature, P1 = Important, but the product can work without these, P2 = Nice to have

MVP (Minimum Viable Product)

Client agnostic drag and drop data validation platform

#	Requirement	Notes	Priority*
1a	User can upload census data via CSV file	File should meet certain formatting requirements.	P0
1b	User can generate an error report following CSV upload but prior to Smithsonian Digital Repository upload	Specifies error type and location. User can download and/or print error report for field use.	P0
1c	User can view previously uploaded file list		P0
1d	After user corrects errors and re-uploads CSV, older version of data is replaced with revised version	Requires user confirmation. Version control: Soft delete (mark as deleted but retain data until a certain point).	P0
1e	User can generate report of post-census summary statistics, which relies on historical data		P0
1f	User credentials required before form upload		P0
1g	Application supports direct integration with the Smithsonian Digital Repository (on-premises SQL server)	• User can click a button to get historical data or click a button to send data. • Need backend API to store “temporary” data (cloud rather than local); can’t connect directly to Smithsonian Digital Repository. • When user uploads a CSV, what’s the scope of it? Is it one quadrat or the whole census? • Data redundancy: Is this process serving as a backup? • Can run validations on backend data storage for speed. • Should some/all validation be done client side -- esp. for offline users? Challenging b/c there are multiple devices per site.	P0
1h	Administrators can add and edit screening/validation scripts	Application is maintainable long term. Configure driven design.	P0
1i	Application supports streamlined process for conversion of post-census datasets to standard R tables used for analysis		P1
1j	User can create census-specific, filterable data reports for viewing/downloading		P1
1k	User can reconfigure error report structure	User can organize reports either by type of error or by tag.	P2

MVP+

Excel-like data entry interface with real time, in-line error validations

#	Requirement	Notes	Priority*
2a	User can enter data offline	Internet access is limited at some sites.	P0
2b	Application automatically saves a copy of form in its current state	Version control: Database manager wants research sites to be able to revert back to previous version in the event that something goes wrong.	P0
2c	Data entered has local persistence (application stores the state as data in user’s device storage)	The data entry person would like to be able to make changes to forms after saving/submitting.	P0
2d	User can configure columns on each form (addition, removal, reordering of columns)	Not all sites collect data in the same order or on the same variables.	P0
2e	User(s) can practice double data entry (entering the same data twice, comparing versions, and selecting the most accurate record) to minimize transcription errors	Read more about double data entry as a means for quality control here.	P0
2f	Users can complete manual data entry quickly using keyboard shortcuts		P0
2g	User can generate customizable field form PDFs for print use	This step is done once at the beginning of a census for each form. Would like to be able to fit two tables of data side by side on a single sheet to conserve paper.	P0
2h	User can enter location data and specify conversion parameters in order to standardize location data	Location data collected on different scales (utm, global, local, 20x20m, 10x10m, or 5x5m coordinates) depending on site. Makes sure that there’s a location for every new tag and informs of errors.	P1

Lo-Fi UI Mockups

Note: Not all features of the new ForestGEO App user experience are included in these mockups.

MVP

Figure 1.1 Drop box for CSV files containing census data that is either new or revised. Desktop - 1

Figure 1.2 CSV-uploaded census data displayed with inline error messages, for print or download.

Desktop - 3

Figure 1.3 Previously uploaded file list, with indication of validation status and ability to edit or delete files.

Desktop - 4

MVP+

Note: This iteration includes a stepper bar which contextualizes the user's location in the entire census data management process.

Figure 2.1 Site-specific form configuration with column reordering, addition, and/or removal.

Desktop - 1

Figure 2.2 Previously entered data list, separated by form and quadrat, and options for user to upload CSV file or manually enter data.

Desktop - 2

Figure 2.3 Excel-like manual data entry with real-time, in-line error validations and double data entry option accessed via "Compare Versions" button

Desktop - 3

Figure 2.4 Comparing two versions of the same data, entered using double data entry method, and selecting the most correct record before merging into one file

Desktop - 4

Solutions Previously Explored

Updating legacy CTFSWeb application rather than rebuilding. CTFSweb was designed to run on a local machine for one research site. In order to leverage the cloud, support multiple sites, and make other much-needed enhancements to the user-experience, we would need to change nearly every piece of the CTFSWeb source code. Therefore, our assessment is that creating a new application will overall be less friction than updating CTFSWeb to meet the needs of users. We've also consulted with the original developers of CTFSWeb, and they agree with our assessment.
Automated data entry using Azure Forms Recognizer. In Summer 2020, Microsoft Global Hackathon volunteers spent a couple days exploring using machine learning to convert paper field forms to digital data. This investigation did not yield a viable prototype, since ForestGEO paper forms were found to have inconsistent formatting and sometimes multiple languages.
Progressive Web Application. In Fall 2021, Microsoft Global Hackathon volunteers including software engineers, data scientists, and a program manager worked to improve upon the progress made in the previous hackathon on the single data entry webpage. The team successfully built a progressive web application (PWA) that functions both online and offline. They also documented existing pre- and post-uploading screenings and determined whether they require internet to run (Appendix Fig 4 & 5).

Out of Scope

Modify schema of the final data in the Smithsonian Digital Repository.
Managing supplementary data processes (wood density, phenology, rainfall, etc.) not directly related to five-year, full census.

Appendix

Figure A.1 Function of data entry forms

Explains the functions and use cases for each data entry form used during a census.

Data Entry Form	Description
New Plants Form	• Enter data for recruits ≥ 1cm • Only form needed for first time census
Old Tree Form	• Enter new data for existing trees and stems
Multiple Stems Form	• Enter data for new stems ≥ 1cm growing from existing tree • Tree tags should already exist in Old Tree Form
Big Trees Form	• Enter data if big trees are measured at a different time than the rest of the trees • Rarely used, according to Suzanne
Problems Form	• Rarely, if ever, used except at BCI b/c field forms for old trees have 2 columns and no place to enter comments • Paper form used by field supervisors to correct problems in the field • Data entry technicians have to refer to this form and when necessary, input observations into the field comments • Not many notes relative to number of trees

Figure A.2 Data entry form

Example configuration of an Old Tree Form in CTFSWeb. Someone entering data can use specific keycodes to move from one field to the next and/or add new fields.

Figure A.3 Column names for data entry forms

List of column names available for use during configuration of data entry forms. Artifact from CTFSWeb that should be streamlined into new ForestGEO App.

COLUMNS	COLUMN NAME (USE THIS NAME WHILE ADDING COLUMNS
TreeID	treeID
Tag, Tree Tag	tag
StemID	stemID
Stem Tag, Multi Stem	stemTag
QuadratID	quadratID
Quadrat Name	quadratName
Subquadrat	subquadrat
SpeciesID	speciesID
Mnemonic, Species	species
SubSpeciesID	subSpeciesID
PlotID	plotID
CensusID	censusID
PlotCensusNumber	plotCensusNumber
Primary Stem	primaryStem
X	x
Y	y
DBH, Diameter	dbh
OldDBH, Old Diameter	olddbh
HOM, Height	hom
OldHOM, Old Height	oldhom
Codes	codes
Old Codes	oldcodes
Comments	comments
Exact Date	exactDate
Errors	errors
Old Trees	oldTrees
Big Trees	bigTrees

Figure A.4 Pre-upload screening

Explanation of each validation function performed during pre-upload screening and whether it can be run locally or requires internet connection to the cloud. Written by data science volunteers from Microsoft during internal hackathon.

Validation function	Notes	Location required run validation
makeErrorBlank	Sets Errors as empty cell when errors are null or none.	Local
checkDuplTagout	Check for duplicate Tag+StemTag (unique combination) in another database. Uses ViewFullTable	Cloud
checkDuplTagin	Check for duplicate Tag +StemTag (unique combination) within temporary database.	Local
screenXY	Checks if the x and y-coordinates are within plot dimensions "SELECT TempID FROM ".$file." as a,Site as b WHERE a.PlotID=b.PlotID and a.X NOT BETWEEN 0 AND b.QDimX AND Errors<>'NONE'	Local
checkDuplStemin	Check for duplicate Tree and Stem tags within temporary database. Does not include dead stems that couldn't be tagged (StemTag=NULL) "SELECT Tag, StemTag, COUNT(Tag) as cnt FROM ".$fileName." GROUP BY Tag, StemTag HAVING cnt > 1"	Local
screenDiam	Check for diameter range (d <0 and d> maxpossible raises an error). "SELECT TempID FROM ".$fileName." WHERE DBH<>0 AND (DBH>".$maxDBH." OR DBH<".$minDBH.") AND Errors<>'NONE' Minimum DBH for a stem is often 1.0, but this depends on whether site uses mm or cm.	Local
screenCodes	Check for invalid codes and for Trees with both a D code and diameter measurement Get codes from TSMAttributes table and put in 1-D array Check if there are duplicate codes Flag if code does not appear in TSMAttributes table Flag if tree is dead and there is a dbh	Local
screenMainStems	Assumes there are main stem codes. Check if there is more than one main stem, if there is none, if main stem died, etc. for multiple-stemmed trees Get all the codes Get all the multiple-stemmed trees Screen for more than one main stem Screen if no main stem Screen if main stem died and there are other live stems Screen if main stem does not have largest dbh	Local
checkSpecies	Check that species codes in filein are found in fileout "SELECT TempID,Mnemonic FROM ".$filein." WHERE Mnemonic IS NULL or Mnemonic NOT IN (SELECT DISTINCT Mnemonic FROM ".$fileout.") AND Errors<>'NONE'	Cloud
checkQuadrat	Check for quadrats not matching another file, i.e. tempNewPlants. Is there an easier way??? Remember that there may be more than one record with the same tag in tempNewPlants "SELECT DISTINCT Tag, QuadratName FROM ".$fileout." WHERE PlotID=".$plotID	Cloud
checkTagnotout	Check for Tree tag NOT in another database "SELECT TempID,Tag FROM ".$filein." WHERE Tag NOT IN (SELECT DISTINCT TAG FROM ".$fileout." WHERE PlotID=".$plotID.") AND Errors<>'NONE'	Cloud
checkQuadName	Check that QuadratName in filein are found in fileout	Cloud
screenDuplQuad	This public function checks to see if all the stems of a tree have the same quadrat	Cloud
screenDuplSpec	This public function checks to see if all the stems of a tree have the same species code	Local
createTempSpecies	Creates a table if species doesn't exist	Local
screenSpecies	Check whether genus name not found in Genus table Needs Genus cache	Local
cntSpeciesrecords	Counts the number of species record	Local
startCensusDate	Screens for dates entered, should not be earlier than the user defined start date. Sets the class variables PlotID, PlotCensusNumber and returns StartDate	Local
screenDuplQuadratInfo	Checks for duplicated quadrats	Local
screenQuadratInfo	Check for date ranges SELECT TempID FROM TempPersonnelInfo WHERE DateWorked NOT BETWEEN '".$start."' AND '".$end."' AND Errors<>'NONE'	Local
checkGrowth	Check for extreme growth - for Old Trees form Flags stems where absolute annual growth > $annualGrowth mm Check for shrinkage in recensuses - only for Old Trees form Flags records with annual relative growth <= -0.05 in stems with previous dbh>=50mm and no current stem lost code	Local
checkTagInOldTrees	Check that all Tree tags in Multiple Stem file appears in tempOldTrees SELECT TempID FROM ".$tempMultiName." WHERE Tag NOT IN (SELECT DISTINCT TAG FROM ".$tempOldName." UNION SELECT DISTINCT TAG FROM ".$tempOldName."Error) AND Errors<>'NONE'	Local
checkDuplStemout	Check for duplicate Tree and Stem tags in another database SELECT TempID,Tag,StemTag FROM ".$filein." WHERE (Tag,StemTag) IN (SELECT DISTINCT TAG,StemTag FROM ".$fileout." WHERE PlotID = ".$plotID." UNION SELECT DISTINCT TAG,StemTag FROM ".$fileout."Error where PlotID = ".$plotID.") AND Errors<>'NONE'	Cloud
screenMainStems2	Check main stems for multiple-stemmed trees in 2 files: Old Trees and Multiple Stems Get all the codes Get all the multiple-stemmed trees in first file Get all the tree tags from multiple-stemmed file Screen for more than one main stem Screen if no main stem Screen if main stem died and there are other live stems Screen if main stem does not have largest dbh	Cloud
ifTableExist	Check if a table exists
censusStartDate	Check for valid census using plotID and CensusNumber
Helper Functions
startTransaction	Starts db transaction
endTransaction
_getCodes	Gets codes

Figure A.5 Report of post-census summary statistics

Explanation of each summary statistic function performed after a census and whether it requires current data or previous censuses to run. Written by data science volunteers from Microsoft during internal hackathon.

Validation	Notes	Location required to run validation
-- A. COUNT THE NUMBER OF RECORDS BY QUADRAT -- -- This is useful for finding quadrats with no stems (not listed) or quadrats that have too many stems #1 -- Counting all stem records, dead or alive SELECT PlotID,CensusID,QuadratName,COUNT() FROM ViewFullTable GROUP BY PlotID,CensusID,QuadratName; #2 -- Counting all live stem records only SELECT PlotID,CensusID,QuadratName,count() FROM ViewFullTable WHERE status<>'dead' and status<>'stem dead' and status<>'missing' GROUP BY PlotID,CensusID,QuadratName; #3 -- Counting number of trees SELECT PlotID,CensusID,QuadratName,count(DISTINCT TreeID) FROM ViewFullTable GROUP BY PlotID,CensusID,QuadratName;	Number of alive trees per quadrat. These queries could be simplified. Would #1 and #3 be the same? This seems to be gathering stats not performing a validation.	Batch validation on current data. No validation criteria specified here
-- B. COUNT THE NUMBER OF DEAD STEMS AND MISSING STEMS BY CENSUS -- SELECT CensusID,status,COUNT(*) FROM ViewFullTable WHERE status IN ('dead','stem dead','missing') GROUP BY CensusID,status;	Counts number of dead tree. This should add up to total count above - # alive from above. No need to recompute this.	Batch validation on current data. No validation criteria specified here
-- C. CHECK WHETHER THERE ARE ANY TREES LOCATED OUTSIDE THE LIMITS OF THE PLOT OR ANY TREES WITH MISSING COORDINATES -- SELECT * FROM ViewFullTable WHERE X IS NULL OR Y IS NULL; -- For a 1000 by 500m plot SELECT * FROM ViewFullTable WHERE PX>=@site_xdim OR PY>=@site_ydim;	Check coordinates are valid.	Simple type validation, could be in javascript.
-- D. CHECK WHETHER THERE ARE ANY DUPLICATE TAGS -- SELECT Tag,COUNT() FROM Tree GROUP BY Tag HAVING COUNT()>1; -- If StemTags are unique -- SELECT StemTag,COUNT() FROM Stem GROUP BY StemTag HAVING COUNT()>1; -- If StemTags are not unique, i.e. starts with 1 for every tree -- SELECT TreeID,StemTag,COUNT() FROM Stem GROUP BY TreeID,StemTag HAVING COUNT()>1;	Checks for duplicates within the current census.	Batch validation on current data.
-- E. DISPLAY THE RANGE OF THE DATES OF THE CENSUSES -- SELECT CensusID,MIN(ExactDate) AS DateMin, MAX(ExactDate) AS DateMax FROM ViewFullTable GROUP BY CensusID;	Checks the date range is valid.	Simple type validation, could be in javascript.
F. DISPLAY THE LARGEST DBH AND HEIGHT OF MEASURE FOR EACH SPECIES -- -- Can check whether any shrub species have too large of a DBH, for example SELECT Mnemonic, MAX(DBH), MAX(HOM) FROM ViewFullTable GROUP BY Mnemonic;	Check the max height and diam are valid.	Simple type validation, could be in javascript.
G. CHECK WHETHER ANY TREES HAVE BOTH A DEAD AND ALIVE STATUS IN ONE CENSUS -- SELECT * FROM ViewFullTable WHERE status='dead' AND (CensusID,TreeID) IN (SELECT CensusID,TreeID FROM ViewFullTable WHERE status<>'dead' AND status<>'stem dead');	Checking for both dead and alive in one census --> are we expecting there to be duplicate tree ids – or would finding duplicate tree ids be an error already?	Maybe not needed.
H. CHECK WHETHER ALL TREES FROM ONE CENSUS ARE ACCOUNTED FOR IN THE NEXT CENSUS -- SELECT * FROM ViewFullTable WHERE CensusID=@census1 AND StemID NOT IN (SELECT StemID FROM ViewFullTable WHERE CensusID=@census2);	Check for missing trees from the previous census.	Validation requiring previous census.
I. CHECK WHETHER ANY TREES THAT WERE DEAD IN ONE CENSUS TURNS OUT ALIVE IN NEXT CENSUS -- SELECT * FROM ViewFullTable WHERE CensusID=@census1 AND Status IN ('dead','stem dead') AND StemID IN (SELECT StemID FROM ViewFullTable WHERE CensusID=@census2 AND status<>'dead' AND status<>'stem dead');	Check for invalid alive trees (that were dead in the previous census.	Validation requiring previous census.
-- J. CHECK WHETHER THERE ARE ANY TREES WITH EXTREME GROWTH RATES -- -- There are several ways of calculating growth rates, the following are some examples -- -- The following uses the temporary table TempOldTrees and the view table ViewFullTable and compares the dbh from the 2nd and 3rd census -- To find stems where the growth rate is too high, listing stems where absolute annual growth > 65 mm. -- NOTE: Growth rates should be calculated only when the HOM is the same in both censuses and both dbhs >= 10 (if in mm). SELECT A.Tag,A.StemTag,StemID,A.mnemonic,A.QuadratName,B.DBH AS DBH2, A.DBH AS DBH3 FROM TempOldTrees A JOIN ViewFullTable B USING (StemID) WHERE B.CensusID=@census1 AND A.DBH>=10 AND B.DBH>=10 AND A.HOM=B.HOM AND (A.DBH-B.DBH)/((to_days(A.ExactDate)-to_days(B.ExactDate))/365.25) > 65; -- To find stems where the growth rate is too negative: -- List stems where annual relative growth <= -0.05 in stems where previous dbh>=50mm AND there is no broken stem codes in 3rd census SELECT A.Tag,A.StemTag,StemID,A.mnemonic,A.QuadratName,B.DBH AS DBH2, A.DBH AS DBH3,A.codes FROM TempOldTrees A JOIN ViewFullTable B USING (StemID) WHERE B.CensusID=@census1 AND A.DBH>=10 AND B.DBH>=50 AND ((A.DBH-B.DBH)/B.DBH)/((to_days(A.ExactDate)-to_days(B.ExactDate))/365.25) <=-0.05;	Check for extreme growth rates.	Validation requiring previous census.
-- COUNT NUMBER OF RECRUITS PER QUADRAT PER CENSUS -- -- Check what quadrats do not have recruits or too few recruits -- SELECT QuadratName,COUNT() FROM ViewFullTable WHERE DBH>=10 AND CensusID=@census2 AND StemID NOT IN (SELECT StemID FROM ViewFullTable WHERE CensusID=@census1) GROUP BY QuadratName ORDER BY QuadratName; -- Check number of recruits where dbh is between 10 and 20. A site was found to have too few small recruits in one census -- SELECT COUNT() FROM ViewFullTable WHERE DBH>=10 AND DBH<20 AND CensusID=@census2 AND StemID NOT IN (SELECT StemID FROM ViewFullTable WHERE CensusID=@census1);	Checking # of recruits (trees that are in the new census but not in the previous census?). Is the dbh check required? Could this check be done by just comparing the total counts in the new vs old+alive?	Validation requiring previous census. Possible simplification, to run as batch validation only required current census.
-- COUNT NUMBER OF DEAD STEMS PER QUADRAT OR PER SPECIES -- -- Count number of dead stems per quadrat -- SELECT QuadratName,COUNT() FROM ViewFullTable WHERE Status IN ('dead','stem dead') AND CensusID=@census2 AND StemID IN (SELECT StemID FROM ViewFullTable WHERE CensusID=@census1 AND Status NOT IN ('dead','stem dead','missing')) GROUP BY QuadratName ORDER BY QuadratName; -- Count number of dead stems per species in descending order of number of stems -- SELECT Mnemonic,COUNT() FROM ViewFullTable WHERE Status IN ('dead','stem dead') AND CensusID=@census2 AND StemID IN (SELECT StemID FROM ViewFullTable WHERE CensusID=@census1 AND Status NOT IN ('dead','stem dead','missing')) GROUP BY Mnemonic ORDER BY COUNT(*) DESC;	Seems like summary statistics – not a validation criteria. Checks number of newly dead trees in this census vs the previous one.	Analysis, requires the previous census.

Figure A.6 Asks

High level summary of requests made by ForestGEO about usage capabilities of the new application.

Usage scenario	Details	Concern
Data Entry can be performed offline. Ideally, screening can also run offline.	Given the number of remote sites without internet connectivity, a method to pre-validate using offline scripts hosted locally would be helpful. Post-validation scripts should ideally also run offline, but that would require constant backups to the previous censuses.	Concern that post-validation scripts might not be able to run offline and may take too much compute power or technical know-how to run
More insightful validation scripts	Shameema and Suzanne have to write additional scripts to do error checking because their existing validation scripts don’t give accessible results.	We know where the scripts are located on the site, but have not investigated whether we can rework them to run more quickly.
The validation process allows for editing and addition/removal of new validations.	Shameema and Suzanne know SQL well, but they don’t know PHP well enough to edit currently-hosted scripts. They must run manual scripts.
Faster Data entry for paper sites	For paper sites especially, having to re-type paper entries into a database leaves the site more prone to mistakes and slows down the process	Their current system functions today, and there’s limited room for improvement given the offline requirements.
PIs and field people take on the validation role		Multiple challenges here—not sure whether field people would use a new solution even if it were more accessible because they don’t see screening as part of the job description.
Faster and easier installation		This is a one-time operation, and might be improved upon incidentally if we improve the validation process.
Faster data entry for online sites		Customer has an existing solution that’s highly manual, but does function.

Figure A.7 Plot diagram

Simple illustration of plot set up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ForestGEO App Specification

Summary

Architecture Proposal

User Flow

Customer Overview

Personas

Goals

Capabilities of Current Solution: CTFSWeb

Pain Points

Requirements

MVP (Minimum Viable Product)

MVP+

Lo-Fi UI Mockups

MVP

MVP+

Solutions Previously Explored

Out of Scope

Appendix

Clone this wiki locally