The purpose of this lab is to implement a Businesss Intelligence(BI) System using AWS Analytics Services.
You'll learn the concepts of lambda architecture and the actual deployment process through the example of building a serverless business intelligence system using Amazon Kinesis, S3, Athena, OpenSearch Service, and QuickSight.
Through this lab, you will set up a Data Collection -> Store -> Analysis/Processing -> Visualization
pipeline.
- Read this in other languages: English, Korea(한국어)
- Solutions Architecture Overview
- Lab setup
- [Step-1a] Create Kinesis Data Streams to receive input data
- [Step-1b] Create Kinesis Data Firehose to store data in S3
- [Step-1c] Verify data pipeline operation
- [Step-1d] Analyze data using Athena
- [Step-1e] Data visualization with QuickSight
- (Optional)[Step-1f] Combine small files stored in S3 into large files using AWS Lambda Function
- [Step-2a] Create Amazon OpenSearch Service for Real-Time Data Analysis
- [Step-2b] Ingest real-time data into OpenSearch using AWS Lambda Functions
- [Step-2c] Data visualization with Kibana
- Recap and Review
- Resources
- Reference
- Deployment by AWS CDK
[Top]
Before starting the lab, create and configure EC2, the IAM user you need.
[Top]
Select Kinesis from the list of services on the AWS Management Console.
- Make sure the Kinesis Data Streams radio button is selected and click Create data stream button.
- Enter
retail-trans
as the Data stream name. - Enter the desired name for Kinesis stream name (e.g.
retail-trans
). - Choose either the On-demand or Provisioned capacity mode.
With the On-demand mode, you can then choose Create Kinesis stream to create your data stream.
With the Provisioned mode, you must then specify the number of shards you need, and then choose Create Kinesis stream.
If you choose Provisioned mode, enter1
in Number of open shards under Data stream capacity. - Click the Create data stream button and wait for the status of the created kinesis stream to become active.
[Top]
Kinesis Data Firehose will allow collecting data in real-time and batch it to load into a storage location such as Amazon S3, Amazon Redshift or OpenSearch Service.
-
If you are on the Kinesis Data Stream page from the previous step, select Delivery streams from the left sidebar. If you are starting from the Kinesis landing page, select the Kinesis Data Firehose radio button and click the Create delivery stream button.
-
(Step 1: Name and source) For Delivery stream name enter
retail-trans
. -
Under Choose a source, select the Kinesis Data Stream radio button and choose
retail-trans
stream that you created earlier from the dropdown list. Click Next. If you do not see your data stream listed, make sure you are in Oregon region and your data stream from previous step is in Active state. -
(Step 2: Process records) For Transform source records with AWS Lambda and Convert record format, leave both at
Disabled
and click Next. -
(Step 3: Choose a destination) Select Amazon S3 as Destination and click
Create new
to create a new S3 bucket. S3 bucket names are globally unique, so choose a bucket name that is unique for you. You can call itaws-analytics-immersion-day-xxxxxxxx
wherexxxxxxxx
is a series of random numbers or characters of your choice. You can use something like your name or your favorite number. -
Under S3 Prefix, copy and paste the following text exactly as shown. Enter S3 prefix. For example, type as follows:
json-data/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/
At this point, you may see a message You can't include expressions in the prefix unless you also specify an error prefix. Ignore this, it will go away once you enter the error prefix in the next step.
Under S3 error prefix, copy and paste the following text exactly as shown.
error-json/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/!{firehose:error-output-type}
⚠️ S3 prefix or S3 error prefix pattern must not contain a new line(\n
) character. If you have copied the example pattern and pasted it into the S3 prefix or S3 error prefix, it is a good idea to remove the trailing line breaks.After entering S3 prefix and 3 error prefix, click Next. (cf. Custom Prefixes for Amazon S3 Objects)
-
(Step 4: Configure settings) Set buffer size to
1
MB and buffer interval to60
seconds in S3 buffer conditions. Leave everything else as default. -
Under Permissions IAM role, select Create or update IAM role and click the Next button.
-
(Step 5: Review) If there are no errors after checking the information entered in Review, click the Create delivery stream button to complete the Firehose creation.
[Top]
In this step, we will generate sample data and verify it is being processed and stored as follows- Kinesis Data Streams -> Kinesis Data Firehose -> S3
.
- Connect SSH to the previously created E2 instance. You can go to the AWS Console and click the Connect button on the instance details page, or SSH from your local machine command line using the key pair you downloaded.
- Run
gen_kinesis_data.py
script on the EC2 instance by entering the following command -If you would like to know more about the usage of this command, you can typepython3 gen_kinesis_data.py \ --region-name us-west-2 \ --service-name kinesis \ --stream-name retail-trans
python3 gen_kinesis_data.py --help
- Verify that data is generated every second. Let it run for a few minutes and terminate the script. You can enter
Ctrl+C
to end the script execution. - Go to S3 service and open the bucket you created earlier. You can see that the original data has been delivered by Kinesis Data Firehose to S3 and stored in a folder structure by year, month, day, and hour.
[Top]
Using Amazon Athena, you can create tables based on data stored in S3, query those tables using SQL, and view query results.
First, create a database to query the data.
- Go to Athena from the list of services on the AWS Management console.
- The first time you visit Athena console, you will be taken to the Get Started page. Click the Get Started button to open the query editor.
- If this is your first time using Athena, you need to first set an S3 location to save Athena's query results. Click the set up a query result location in Amazon S3 box.
In this lab, we will create a new folder in the same S3 bucket you created in [Step-1b] Create Kinesis Data Firehose to store data in S3 section.
For example, set your query location as
s3://aws-analytics-immersion-day-xxxxxxxx/athena-query-results/
(xxxxxxxx
is the unique string you gave to your S3 bucket) Unless you are visiting for the first time, Athena Query Editor is oppened. - You can see a query window with sample queries in the Athena Query Editor. You can start typing your SQL query anywhere in this window.
- Create a new database called
mydatabase
. Enter the following statement in the query window and click the Run Query button.CREATE DATABASE IF NOT EXISTS mydatabase
- Confirm that the the dropdown list under Database section on the left panel has updated with a new database called
mydatabase
. If you do not see it, make sure the Data source is selected toAwsDataCatalog
.
-
Make sure that
mydatabase
is selected in Database, and click the+
button above the query window to open a new query. -
Copy the following query into the query editor window, replace the
xxxxxxx
in the last line underLOCATION
with the string of your S3 bucket, and click the Run Query button to execute the query to create a new table.CREATE EXTERNAL TABLE IF NOT EXISTS `mydatabase.retail_trans_json`( `invoice` string COMMENT 'Invoice number', `stockcode` string COMMENT 'Product (item) code', `description` string COMMENT 'Product (item) name', `quantity` int COMMENT 'The quantities of each product (item) per transaction', `invoicedate` timestamp COMMENT 'Invoice date and time', `price` float COMMENT 'Unit price', `customer_id` string COMMENT 'Customer number', `country` string COMMENT 'Country name') PARTITIONED BY ( `year` int, `month` int, `day` int, `hour` int) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION 's3://aws-analytics-immersion-day-xxxxxxxx/json-data'
If the query is successful, a table named
retail_trans_json
is created and displayed on the left panel under the Tables section.If you get an error, check if (a) you have updated the
LOCATION
to the correct S3 bucket name, (b) you havemydatabase
selected under the Database dropdown, and (c) you haveAwsDataCatalog
selected as the Data source. -
After creating the table, click the
+
button to create a new query. Run the following query to load the partition data.MSCK REPAIR TABLE mydatabase.retail_trans_json
You can list all the partitions in the Athena table in unsorted order by running the following query.
SHOW PARTITIONS mydatabase.retail_trans_json
-
Click the
+
button to open a new query tab. Enter the following SQL statement to query 10 transactions from the table and click Run Query.SELECT * FROM retail_trans_json LIMIT 10
The result is returned in the following format:
You can experiment with writing different SQL statements to query, filter, sort the data based on different parameters. You have now learned how Amazon Athena allows querying data in Amazon S3 easily without requiring any database servers.
[Top]
In this section, we will use Amazon QuickSight to visualize the data that was collected by Kinesis, stored in S3, and analyzed using Athena previously.
- Go to QuickSight Console.
- Click the Sign up for QuickSight button to sign up for QuickSight.
- Select Standard Edition and click the Continue button.
- Specify a QuickSight account name. This name should be unique to you, so use the unique string in the account name similar to how you did for the S3 bucket name earlier. Enter your personal email address under Notification email address.
- QuckSight needs access to S3 to be able to read data. Check the Amazon S3 box, and select
aws-analytics-immersion-day-xxxxxxxx
bucket from the list. Click Finish. - After the account is created, click the Go to Amazon QuickSight button. Confirm that you are in
US West (Oregon)
region. Click on the account name on the top right corner and select US West (Oregon) if it is not already set to Oregon. Click the New Analysis button and click on New dataset on the next screen. - Click
Athena
and enterretail-quicksight
in the Data source name in the pop-up window. Click Validate connection to change toValidated
, then click the Create data source button. - On the Choose your table screen, select Catalog
AwsDataCatalog
, Databasemydatabase
and Tablesretail_trans_json
. Click the Select button. - On the Finish dataset creation screen, choose
Directly query your data
and click the Visualize button. - Let's visualize the
Quantity
andPrice
byInvoiceDate
. Select vertical bar chart from the Visual types box on the bottom left. In the field wells, draginvoicedate
from the left panel into X axis, dragprice
, andquantity
into Value. You will see a chart get populated as shown below. - Let's share the Dashboard we just created with other users. Click on the account name on the top right corner and select Manage QuickSight.
- Click the
+
button on the right side, and enter an email address of the person with whom you want to share the visualization. Click the Invite button and close the popup window.
- Users you invite will receive the following Invitation Email. They can click the button to accept invitation.
- Return to the QuickSight home screen, select your analysis, and click Share> Share analysis from the upper right corner.
- Select
BI_user01
and click the Share button. - Users receive the following email: You can check the analysis results by clicking Click to View.
[Top]
When real-time incoming data is stored in S3 using Kinesis Data Firehose, files with small data size are created. To improve the query performance of Amazon Athena, it is recommended to combine small files into one large file. To run these tasks periodically, we are going to create an AWS Lambda function function that executes Athena's Create Table As Select (CTAS) query.
- Access Athena Console and go to the Athena Query Editor.
- Select mydatabase from DATABASE and navigate to New Query.
- Enter the following CREATE TABLE statement in the query window and select Run Query.
In this exercise, we will change the json format data of theretal_tran_json
table into parquet format and store it in a table calledctas_retail_trans_parquet
.
The data in thectas_retail_trans_parquet
table will be saved in the locations3://aws-analytics-immersion-day-xxxxxxxx/parquet-retail-trans
of the S3 bucket created earlier.CREATE EXTERNAL TABLE `mydatabase.ctas_retail_trans_parquet`( `invoice` string COMMENT 'Invoice number', `stockcode` string COMMENT 'Product (item) code', `description` string COMMENT 'Product (item) name', `quantity` int COMMENT 'The quantities of each product (item) per transaction', `invoicedate` timestamp COMMENT 'Invoice date and time', `price` float COMMENT 'Unit price', `customer_id` string COMMENT 'Customer number', `country` string COMMENT 'Country name') PARTITIONED BY ( `year` int, `month` int, `day` int, `hour` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://aws-analytics-immersion-day-xxxxxxxx/parquet-retail-trans' TBLPROPERTIES ( 'has_encrypted_data'='false', 'parquet.compression'='SNAPPY') ;
- Open the AWS Lambda Console.
- Select Create a function.
- Enter
MergeSmallFiles
for Function name. - Select
Python 3.11
in Runtime. - Select Create a function.
- Select Add trigger in the Designer tab.
- Select CloudWatch Events/EventBridge in
Select a trigger
of Trigger configuration. SelectCreate a new rule
in Rule and enter the appropriate rule name (egMergeSmallFilesEvent
) in Rule name. SelectSchedule expression
as the rule type, and entercron(5 * * * *)
for running the task every 5 minutes in the schedule expression. - In Trigger configuration, click [Add].
- Copy and paste the code from the
athena_ctas.py
file into the code editor of the Function code. Click Deploy. - Click [Add environment variables] to register the following environment variables.
For example, set Environment variables as follows:
OLD_DATABASE=<source database> OLD_TABLE_NAME=<source table> NEW_DATABASE=<destination database> NEW_TABLE_NAME=<destination table> WORK_GROUP=<athena workgroup> OLD_TABLE_LOCATION_PREFIX=<s3 location prefix of source table> OUTPUT_PREFIX=<destination s3 prefix> STAGING_OUTPUT_PREFIX=<staging s3 prefix used by athena> COLUMN_NAMES=<columns of source table excluding partition keys>
OLD_DATABASE=mydatabase OLD_TABLE_NAME=retail_trans_json NEW_DATABASE=mydatabase NEW_TABLE_NAME=ctas_retail_trans_parquet WORK_GROUP=primary OLD_TABLE_LOCATION_PREFIX=s3://aws-analytics-immersion-day-xxxxxxxx/json-data OUTPUT_PREFIX=s3://aws-analytics-immersion-day-xxxxxxxx/parquet-retail-trans STAGING_OUTPUT_PREFIX=s3://aws-analytics-immersion-day-xxxxxxxx/tmp COLUMN_NAMES=invoice,stockcode,description,quantity,invoicedate,price,customer_id,country
- To add the IAM Policy required to execute Athena queries, click
View the MergeSmallFiles-role-XXXXXXXX role on the IAM console.
in the Execution role and modify the IAM Role. - After clicking the Attach policies button in the Permissions tab of IAM Role, add AmazonAthenaFullAccess and AmazonS3FullAccess in order.
- Select Edit in Basic settings. Adjust Memory and Timeout appropriately. In this lab, we set Timout to
5 min
.
[Top]
An OpenSearch cluster is created to store and analyze data in real time. An OpenSearch Service domain is synonymous with an OpenSearch cluster. Domains are clusters with the settings, instance types, instance counts, and storage resources that you specify.
- In the AWS Management Console, choose Amazon OpenSearch Service under Analytics.
- Choose Create a new domain.
- Provide a name for the domain. The examples in this tutorial use the name
retail
. - Ignore the Custom endpoint setting.
- For the deployment type, choose Production.
- For Version, choose the latest version. For more information about the versions, see Supported OpenSearch Versions.
- Under Data nodes, change the instance type to
t3.small.search
and keep the default value of three nodes. - Under Network, choose VPC access (recommended). Choose the appropriate VPC and subnet. Select the
es-cluster-sg
created in the preparation step as Security Groups. - In the fine-grained access control settings, choose Create master user. Provide a username and password.
- For now, ignore the SAML authentication and Amazon Cognito authentication sections.
- For Access policy, choose Only use fine-grained access control.
- Ignore the rest of the settings and choose Create. New domains typically take 15–30 minutes to initialize, but can take longer depending on the configuration.
[Top]
You can index data into Amazon OpenSearch Service in real time using a Lambda function. In this lab, you will create a Lambda function using the AWS Lambda console.
- Open the AWS Lambda Console.
- Enter the Layers menu and select Create layer.
- Enter
es-lib
for the Name. - Select
Upload a file from Amazon S3
and enter the s3 link url where the library code is stored or the compressed library code file. For how to createes-lib.zip
, refer to Example of creating a Python package to register in AWS Lambda Layer. - Select
Python 3.11
fromCompatible runtimes
.
- Open the AWS Lambda Console.
- Select Create a function.
- Enter
UpsertToES
for Function name. - Select
Python 3.11
in Runtime. - Select Create a function.
- In the Designer tab. choose Add a layer at Layers.
- Select
Custome Layers
in Choose a Layer section, and choose Name and Version of the previously created layer as Name and Version in Custom layers. - Click Add.
- Select
UpsertToES
in the Designer tab to return to Function code and Configuration. - Copy and paste the code from the
upsert_to_es.py
file into the code editor of the Function code. Click Deploy - In Environment variables, click Edit.
- Click Add environment variables to register the following 4 environment variables.
For example, set Environment variables as follows:
ES_HOST=<opensearch service domain> ES_INDEX=<opensearch index name> ES_TYPE=<opensearch type name> REQUIRED_FIELDS=<columns to be used as primary key> REGION_NAME=<region-name> DATE_TYPE_FIELDS=<columns of which data type is either date or timestamp>
ES_HOST=vpc-retail-xkl5jpog76d5abzhg4kyfilymq.us-west-1.es.amazonaws.com ES_INDEX=retail ES_TYPE=trans REQUIRED_FIELDS=Invoice,StockCode,Customer_ID REGION_NAME=us-west-2 DATE_TYPE_FIELDS=InvoiceDate
- Click Save.
- In order to execute the lambda function in the VPC and read data from Kinesis Data Streams, you need to add the IAM Policy required for the Execution role required to execute the lamba function.
Click
View the UpsertToES-role-XXXXXXXX role on the IAM console.
to edit the IAM Role. - After clicking the Attach policies button in the Permissions tab of IAM Role, add AWSLambdaVPCAccessExecutionRole and AmazonKinesisReadOnlyAccess in order.
- Add the following policy statements into customer inline policy (e.g.,
UpsertToESDefaultPolicyXXXXX
). The following IAM Policy enables the lambda function to ingest data into theretail
index in the opensearch service.{ "Action": [ "es:DescribeElasticsearchDomain", "es:DescribeElasticsearchDomainConfig", "es:DescribeElasticsearchDomains", "es:ESHttpPost", "es:ESHttpPut" ], "Resource": [ "arn:aws:es:region:account-id:domain/retail", "arn:aws:es:region:account-id:domain/retail/*" ], "Effect": "Allow" }, { "Action": "es:ESHttpGet", "Resource": [ "arn:aws:es:region:account-id:domain/retail", "arn:aws:es:region:account-id:domain/retail/_all/_settings", "arn:aws:es:region:account-id:domain/retail/_cluster/stats", "arn:aws:es:region:account-id:domain/retail/_nodes", "arn:aws:es:region:account-id:domain/retail/_nodes/*/stats", "arn:aws:es:region:account-id:domain/retail/_nodes/stats", "arn:aws:es:region:account-id:domain/retail/_stats", "arn:aws:es:region:account-id:domain/retail/retail*/_mapping/trans", "arn:aws:es:region:account-id:domain/retail/retail*/_stats" ], "Effect": "Allow" }
- Click the Edit button in the VPC category to go to the Edit VPC screen. Select
Custom VPC
for VPC connection. Choose the VPC and subnets where you created the domain for the OpenSearch service, and choose the security groups that are allowed access to the OpenSearch service domain. - Select Edit in Basic settings. Adjust Memory and Timeout appropriately. In this lab, we set Timout to
5 min
. - Go back to the Designer tab and select Add trigger.
- Select Kinesis from
Select a trigger
in the Trigger configuration. - Select the Kinesis Data Stream (
retail-trans
) created earlier in Kinesis stream. - Click Add.
The lambda function uses the delivery role to sign HTTP (Signature Version 4) requests before sending the data to the Amazon OpenSearch Service endpoint.
You manage Amazon OpenSearch Service fine-grained access control permissions using roles, users, and mappings. This section describes how to create roles and set permissions for the lambda function.
Complete the following steps:
-
The Amazon OpenSearch cluster is provisioned in a VPC. Hence, the Amazon OpenSearch endpoint and the Kibana endpoint are not available over the internet. In order to access the endpoints, we have to create a ssh tunnel and do local port forwarding.
-
Option 1) Using SSH Tunneling
-
Setup ssh configuration
For Winodws, refer to here.
For Mac/Linux, to access the OpenSearch Cluster, add the ssh tunnel configuration to the ssh config file of the personal local PC as follows.# OpenSearch Tunnel Host estunnel HostName <EC2 Public IP of Bastion Host> User ec2-user IdentitiesOnly yes IdentityFile ~/.ssh/analytics-hol.pem LocalForward 9200 <OpenSearch Endpoint>:443
- EC2 Public IP of Bastion Host uses the public IP of the EC2 instance created in the Lab setup step.
- ex)
~$ ls -1 .ssh/ analytics-hol.pem config id_rsa ~$ tail .ssh/config # OpenSearch Tunnel Host estunnel HostName 214.132.71.219 User ubuntu IdentitiesOnly yes IdentityFile ~/.ssh/analytics-hol.pem LocalForward 9200 vpc-retail-qvwlxanar255vswqna37p2l2cy.us-west-2.es.amazonaws.com:443 ~$
-
Run
ssh -N estunnel
in Terminal.
-
-
Option 2) Connect using the EC2 Instance Connect CLI
- Install EC2 Instance Connect CLI
sudo pip install ec2instanceconnectcli
- Run
mssh ec2-user@{bastion-ec2-instance-id} -N -L 9200:{opensearch-endpoint}:443
- ex)
$ mssh ec2-user@i-0203f0d6f37ccbe5b -N -L 9200:vpc-retail-qvwlxanar255vswqna37p2l2cy.us-west-2.es.amazonaws.com:443
- Install EC2 Instance Connect CLI
-
-
Connect to
https://localhost:9200/_dashboards/app/login?
in a web browser. -
Enter the master user and password that you set up when you created the Amazon OpenSearch Service endpoint. The user and password are stored in the AWS Secrets Manager as a name such as
OpenSearchMasterUserSecret1-xxxxxxxxxxxx
. -
In the Welcome screen, click the toolbar icon to the left side of Home button. Choose Security.
-
Under Security, choose Roles.
-
Choose Create role.
-
Name your role; for example,
firehose_role
. -
For cluster permissions, add
cluster_composite_ops
andcluster_monitor
. -
Under Index permissions, choose Index Patterns and enter index-name*; for example,
retail*
. -
Under Permissions, add three action groups:
crud
,create_index
, andmanage
.
In the next step, you map the IAM role that the lambda function uses to the role you just created.
- Choose the Mapped users tab.
- Choose Manage mapping and under Backend roles,
- For Backend Roles, enter the IAM ARN of the role the lambda function uses:
arn:aws:iam::123456789012:role/UpsertToESServiceRole709-xxxxxxxxxxxx
. - Choose Map.
Note: After OpenSearch Role mapping for the lambda function, you would not be supposed to meet a data delivery failure with the lambda function like this:
[ERROR] AuthorizationException: AuthorizationException(403, 'security_exception', 'no permissions for [cluster:monitor/main] and User [name=arn:aws:iam::123456789012:role/UpsertToESServiceRole709-G1RQVRG80CQY, backend_roles=[arn:aws:iam::123456789012:role/UpsertToESServiceRole709-G1RQVRG80CQY], requestedTenant=null]')
[Top]
Visualize data collected from Amazon OpenSearch Service using Kibana.
-
The Amazon OpenSearch cluster is provisioned in a VPC. Hence, the Amazon OpenSearch endpoint and the Kibana endpoint are not available over the internet. In order to access the endpoints, we have to create a ssh tunnel and do local port forwarding.
-
Option 1) Using SSH Tunneling
-
Setup ssh configuration
For Winodws, refer to here.
For Mac/Linux, to access the OpenSearch Cluster, add the ssh tunnel configuration to the ssh config file of the personal local PC as follows.# OpenSearch Tunnel Host estunnel HostName <EC2 Public IP of Bastion Host> User ec2-user IdentitiesOnly yes IdentityFile ~/.ssh/analytics-hol.pem LocalForward 9200 <OpenSearch Endpoint>:443
- EC2 Public IP of Bastion Host uses the public IP of the EC2 instance created in the Lab setup step.
- ex)
~$ ls -1 .ssh/ analytics-hol.pem config id_rsa ~$ tail .ssh/config # OpenSearch Tunnel Host estunnel HostName 214.132.71.219 User ubuntu IdentitiesOnly yes IdentityFile ~/.ssh/analytics-hol.pem LocalForward 9200 vpc-retail-qvwlxanar255vswqna37p2l2cy.us-west-2.es.amazonaws.com:443 ~$
-
Run
ssh -N estunnel
in Terminal.
-
-
Option 2) Connect using the EC2 Instance Connect CLI
- Install EC2 Instance Connect CLI
sudo pip install ec2instanceconnectcli
- Run
mssh ec2-user@{bastion-ec2-instance-id} -N -L 9200:{opensearch-endpoint}:443
- ex)
$ mssh ec2-user@i-0203f0d6f37ccbe5b -N -L 9200:vpc-retail-qvwlxanar255vswqna37p2l2cy.us-west-2.es.amazonaws.com:443
- Install EC2 Instance Connect CLI
-
-
Connect to
https://localhost:9200/_dashboards/app/login?
in a web browser. -
Enter the master user and password that you set up when you created the Amazon OpenSearch Service endpoint. The user name and password of the master user are stored in the AWS Secrets Manager as a name such as
OpenSearchMasterUserSecret1-xxxxxxxxxxxx
. -
In the Welcome screen, click the toolbar icon to the left side of Home button. Choose Stack Managerment.
-
(Management / Create index pattern) In Step 1 of 2: Define index pattern of Create index pattern, enter
retail*
in Index pattern. -
(Management / Create index pattern) Choose > Next step.
-
(Management / Create index pattern) Select
InvoiceDate
for the Time Filter field name in Step 2 of 2: Configure settings of the Create index pattern. -
(Management / Create index pattern) Click Create index pattern.
-
(Management / Advanced Settings) After selecting Advanced Settings from the left sidebar menu, set Timezone for date formatting to
Etc/UTC
. Since the log creation time of the test data is based onUTC
, Kibana's Timezone is also set toUTC
. -
(Discover) After completing the creation of Index pattern, select Discover to check the data collected in OpenSearch.
-
(Discover) Let's visualize the
Quantity
byInvoicdDate
. Select invoicdDate from Available fields on the left, and click Visualize at the bottom -
(Visualize) After selecting Y-Axis in Metrics on the Data tab, apply
Sum
for Aggregation, andQuantity
for Field as shown below. -
(Visualize) Click Save in the upper left corner, write down the name of the graph you saved, and then click Confirm Save.
-
(Dashboards) Click Dashboard icon on the left and click the Create new dashboard button.
[Top]
Through this lab, we have built a Business Intelligent System with Lambda Architecture such that consists of real-time data processing and batch data processing layers.
[Top]
- slide: AWS Analytics Immersion Day - Build BI System from Scratch
- data source: Online Retail II Data Set
[Top]
- Amazon Simple Storage Service (Amazon S3)
- Amazon Athena
- Amazon OpenSearch Service
- AWS Lambda
- Amazon Kinesis Data Firehose
- Amazon Kinesis Data Streams
- Amazon QuickSight
- AWS Lambda Layers
-
Example of creating a python package to register with AWS Lambda layer: elasticsearch
⚠️ You should create the python package on Amazon Linux, otherwise create it using a simulated Lambda environment with Docker.[ec2-user@ip-172-31-6-207 ~] $ python3 -m venv es-lib [ec2-user@ip-172-31-6-207 ~] $ cd es-lib [ec2-user@ip-172-31-6-207 ~] $ source bin/activate (es-lib) $ mkdir -p python_modules (es-lib) $ pip install opensearch-py==2.0.1 requests==2.31.0 requests-aws4auth==1.1.2 -t python_modules (es-lib) $ mv python_modules python (es-lib) $ zip -r es-lib.zip python/ (es-lib) $ aws s3 mb s3://my-bucket-for-lambda-layer-packages (es-lib) $ aws s3 cp es-lib.zip s3://my-bucket-for-lambda-layer-packages/var/ (es-lib) $ deactivate
-
How to create a Lambda layer using a simulated Lambda environment with Docker
$ cat <<EOF > requirements.txt > opensearch-py==2.0.1 > requests==2.31.0 > requests-aws4auth==1.1.2 > EOF $ docker run -v "$PWD":/var/task "public.ecr.aws/sam/build-python3.11" /bin/sh -c "pip install -r requirements.txt -t python/lib/python3.11/site-packages/; exit" $ zip -r es-lib.zip python > /dev/null $ aws s3 mb s3://my-bucket-for-lambda-layer-packages $ aws s3 cp es-lib.zip s3://my-bucket-for-lambda-layer-packages/var/
-
- Windows SSH / Tunnel for Kibana Instructions - Amazon Elasticsearch Service
- Use an SSH Tunnel to access Kibana within an AWS VPC with PuTTy on Windows
[Top]
- Top 10 Performance Tuning Tips for Amazon Athena
- Extract, Transform and Load data into S3 data lake using CTAS and INSERT INTO statements in Amazon Athena
- Query Amazon S3 analytics data with Amazon Athena
- Elasticsearch tutorial: a quick start guide
- Run a petabyte scale cluster in Amazon Elasticsearch Service
- Analyze user behavior using Amazon Elasticsearch Service, Amazon Kinesis Data Firehose and Kibana
- Introduction to Messaging for Modern Cloud Architecture
- Understanding the Different Ways to Invoke Lambda Functions
- Amazon Kinesis Data Firehose custom prefixes for Amazon S3 objects
- Amazon Kinesis Firehose Data Transformation with AWS Lambda
- Under the hood: Scaling your Kinesis data streams
- Scale Amazon Kinesis Data Streams with AWS Application Auto Scaling
- 10 visualizations to try in Amazon QuickSight with sample data
- Visualize over 200 years of global climate data using Amazon Athena and Amazon QuickSight
- Advanced analytics with table calculations in Amazon QuickSight
- Optimize downstream data processing with Amazon Kinesis Data Firehose and Amazon EMR running Apache Spark
- Serverless Scaling for Ingesting, Aggregating, and Visualizing Apache Logs with Amazon Kinesis Firehose, AWS Lambda, and Amazon Elasticsearch Service
- Analyze Apache Parquet optimized data using Amazon Kinesis Data Firehose, Amazon Athena, and Amazon Redshift
- Our data lake story: How Woot.com built a serverless data lake on AWS
-
Securing your bastion hosts with Amazon EC2 Instance Connect
$ # (1) Create a new ssh key. $ ssh-keygen -t rsa -f my_rsa_key $ # (2) Push your SSH public key to the instance. $ aws ec2-instance-connect send-ssh-public-key \ --instance-id $BASTION_INSTANCE \ --availability-zone $DEPLOY_AZ \ --instance-os-user ec2-user \ --ssh-public-key file:///path/to/my_rsa_key.pub $ # (3) Connect to the instance using your private key. $ ssh -i /path/to/my_rsa_key ec2-user@$BASTION_DNS_NAME
-
Connect using the EC2 Instance Connect CLI
$ sudo pip install ec2instanceconnectcli $ mssh ec2-user@i-001234a4bf70dec41EXAMPLE # ec2-instance-id
[Top]
Introducing how to deploy using the AWS CDK.
-
Install AWS CDK Toolkit.
npm install -g aws-cdk
-
Verify that cdk is installed properly by running the following command:
cdk --version
ex)
$ cdk --version 2.41.0 (build 56ba2ab)
cdk ls
list all stacks in the appcdk synth
emits the synthesized CloudFormation templatecdk deploy
deploy this stack to your default AWS account/regioncdk diff
compare deployed stack with current statecdk docs
open CDK documentation
[Top]
When deployed as CDK, 1(a), 1(b), 1(c), 1(f), 2(b), 2(a)
in the architecture diagram below are automatically created.
-
Refer to Getting Started With the AWS CDK to install cdk. Create an IAM User to be used when running cdk and register it in
~/.aws/config
. (cf. Creating an IAM User)
For example, after creating an IAM User called cdk_user, add it to~/.aws/config
as shown below.$ cat ~/.aws/config [profile cdk_user] aws_access_key_id=AKIAIOSFODNN7EXAMPLE aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY region=us-west-2
-
Create a Python package to register in the Lambda Layer and store it in the s3 bucket. For example, create an s3 bucket named
lambda-layer-resources
so that you can save the elasticsearch package to register in the Lambda Layer as follows.$ aws s3 ls s3://lambda-layer-resources/var/ 2019-10-25 08:38:50 0 2019-10-25 08:40:28 1294387 es-lib.zip
-
After downloading the source code from git, enter the s3 bucket name where the package to be registered in the lambda layer is stored in an environment variable called
S3_BUCKET_LAMBDA_LAYER_LIB
. After setting, deploy using thecdk deploy
command.$ git clone https://github.com/aws-samples/aws-analytics-immersion-day.git $ cd aws-analytics-immersion-day $ python3 -m venv .env $ source .env/bin/activate (.env) $ pip install -r requirements.txt (.env) $ export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account --output text) (.env) $ export CDK_DEFAULT_REGION=us-west-2 (.env) $ cdk bootstrap aws://${CDK_DEFAULT_ACCOUNT}/${CDK_DEFAULT_REGION} (.env) $ export S3_BUCKET_LAMBDA_LAYER_LIB=lambda-layer-resources (.env) $ cdk --profile cdk_user deploy --require-approval never --all
✅
cdk bootstrap ...
command is executed only once for the first time to deploy CDK toolkit stack, and for subsequent deployments, you only need to executecdk deploy
command without distributing CDK toolkit stack.(.env) $ export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account --output text) (.env) $ export CDK_DEFAULT_REGION=us-west-2 (.env) $ export S3_BUCKET_LAMBDA_LAYER_LIB=lambda-layer-resources (.env) $ cdk --profile cdk_user deploy --require-approval never --all
-
Enable the Lambda function to ingest records into Amazon OpenSearch.
To delete the deployed application, execute the cdk destroy
command as follows.
(.env) $ cdk --profile cdk_user destroy --force --all
[Top]
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.