Perform ETL into the OpenAQ Low Cost Sensor Database
yarn cdk deploy
deploy this stack to your default AWS account/regionyarn cdk diff
compare deployed stack with current stateyarn cdk synth
emits the synthesized CloudFormation template
Javascript Documentation can be obtained by running the following
yarn doc
Tests can be run using the following
yarn test
Configuration for the ingestion is provided via environment variables.
BUCKET
: The bucket to which the ingested data should be written. RequiredSOURCE
: The data source to ingest. RequiredLCS_API
: The API used when fetching supported measurands. Default:'https://api.openaq.org'
STACK
: The stack to which the ingested data should be associated. This is mainly used to apply a prefix to data uploaded to S3 in order to separate it from production data. Default:'local'
SECRET_STACK
: The stack to which the used Secrets are associated. At times, a developer may want to use credentials relating to a different stack (e.g. a devloper is testing the script, they want output data uploaded to thelocal
stack but want to use the production stack's secrets). Default: the value from theSTACK
env variableVERBOSE
: Enable verbose logging. Default: disabled
To run the ingestion script locally (useful for testing without deploying), see the following example:
LCS_API=https://api.openaq.org \
STACK=my-dev-stack \
SECRET_STACK=my-prod-stack \
BUCKET=openaq-fetches \
VERBOSE=1 \
SOURCE=habitatmap \
node fetcher/index.js
Data Sources can be configured by adding a config file & corresponding provider script. The two sections below outline what is necessary to create and a new source.
The first step for a new source is to add JSON config file to the the fetcher/sources
directory.
{
"schema": "v1",
"provider": "example",
"frequency": "hour",
"meta": {}
}
Attribute | Note |
---|---|
provider |
Unique provider name |
frequency |
day , hour , or minute |
The config file can contain any properties that should be configurable via the provider script. The above table however outlines the attributes that are required.
The second step is to add a new provider script to the fetcher/providers
directory.
The script here should expose a function named processor
. This function should pass
SensorSystem
& Measures
objects to the Providers
class.
The script below is a basic example of a new source:
const Providers = require("../providers");
const { Sensor, SensorNode, SensorSystem } = require("../station");
const { Measures, FixedMeasure, MobileMeasure } = require("../measure");
async function processor(source_name, source) {
// Get Locations/Sensor Systems via http/s3 etc.
const locs = await get_locations();
// Map locations into SensorNodes
const station = new SensorNode();
await Providers.put_stations(source_name, [station]);
const fixed_measures = new Measures(FixedMeasure);
// or
const mobile_measures = new Measures(MobileMeasure);
fixed_measures.push(
new FixedMeasure({
sensor_id: "PurpleAir-123",
measure: 123,
timestamp: Math.floor(new Date() / 1000), //UNIX Timestamp
})
);
await Providers.put_measures(source_name, fixed_measures);
}
module.exports = { processor };
For data providers that require credentials, credentials should be store on AWS Secrets Manager with an ID composed of the stack name and provider name, such as :stackName/:providerName
.
Some providers (e.g. CMU, Clarity) require us to read data from Google services (e.g. Drive, Sheets). To do this, the organization hosting the data should do the following:
- create a project & enable access to the required APIs
- create a service account
- generate service account keys
The should look something like the following and be stored in its entirety within the AWS Secrets Manager.
{
"type": "service_account",
"project_id": "project-id",
"private_key_id": "key-id",
"private_key": "-----BEGIN PRIVATE KEY-----\nprivate-key\n-----END PRIVATE KEY-----\n",
"client_email": "service-account-email",
"client_id": "client-id",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://accounts.google.com/o/oauth2/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/service-account-email"
}