Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/lab_1_serverless_etl.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ In this lab you will use infrastructure-as-code tooling to deploy a serverless E
After deploying the architecture you will use drivers to publish and consume JSON and CSV files through the architecture. These drivers will run throughout the labs.

## Objectives

- Observe the architecture and assess the applications steady state
- Review the custom code in the AWS Lambda function
- Determine the service level objectives you will use to measure your steady state
Expand Down Expand Up @@ -59,6 +60,8 @@ After deploying the architecture you will use drivers to publish and consume JSO

> **Note:** If you get a "botocore.exceptions.NoRegionError: You must specify a region." error message when executing the driver programs, you will need to configure your AWS CLI with `aws configure`.

> **Note:** If you start a new shell, remember to run `pyenv shell` before executing the Python scripts.

1. Revisit some of the previous consoles for AWS Lambda, SQS, SNS, DynamoDB, and Amazon S3.

You'll start to see the S3 bucket populated with files, items being stored into DynamoDB, and metrics generated by the Lambda function for every execution. Take a moment and review some of the information these consoles make available to you.
Expand Down
18 changes: 9 additions & 9 deletions docs/lab_2_inject_fault.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,18 +53,18 @@ If you review the [code](../src/lambda.js) for your AWS Lambda function you will

1. Perform a successful test

Before we start injecting failures lets ensure your function is testing normally. Click the `Test` button with your new test event defined. The results should result in a `Succeeded` status.
Before we start injecting failures lets ensure your function is testing normally. Click the `Test` button with your new test event defined. The test should result in a `Succeeded` status.

1. Configure Failure-Lambda for latency injection

Modify the parameter store value you found earlier to have the following value:

```json
{
"isEnabled": true,
"failureMode": "latency",
"rate": 1,
"minLatency": 1000,
"isEnabled": true,
"failureMode": "latency",
"rate": 1,
"minLatency": 1000,
"maxLatency": 5000
}
```
Expand All @@ -81,9 +81,9 @@ If you review the [code](../src/lambda.js) for your AWS Lambda function you will

```json
{
"isEnabled": true,
"failureMode": "blacklist",
"rate": 1,
"isEnabled": true,
"failureMode": "blacklist",
"rate": 1,
"blacklist": ["dynamodb.*.amazonaws.com"]
}
```
Expand All @@ -108,4 +108,4 @@ If you review the [code](../src/lambda.js) for your AWS Lambda function you will

In this lab you learned about the Failure-Lambda NodeJS library and how it can be used to inject artificial failures and disruption into your Lambda functions.

In [the next lab](lab_3_chaos_experiment.md) you will craft your first chaos experiment which will use the failure-lambda library to perturb your ETL architecture and observe the system's ability to perform in turbulent conditions.
In [the next lab](lab_3_chaos_experiment.md) you will craft your first chaos experiment which will use the failure-lambda library to perturb your ETL architecture and observe the system's ability to perform in turbulent conditions.
10 changes: 6 additions & 4 deletions docs/lab_3_chaos_experiment.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ In this lab you will learn about [Chaos-Toolkit](https://chaostoolkit.org/) and

The [Chaos Toolkit](https://chaostoolkit.org/) aims to be the simplest and easiest way to explore building your own Chaos Engineering Experiments. It also aims to define a vendor and technology independent way of specifying Chaos Engineering experiments by providing an Open API.

It uses a declaritive and extensible format for specifying and scripting chaos experiments. This allows you to automate chaos engineering and incorporate experiments into your CI/CD pipelines.
It uses a declarative and extensible format for specifying and scripting chaos experiments. This allows you to automate chaos engineering and incorporate experiments into your CI/CD pipelines.

The toolkit also has been [extended](https://chaostoolkit.org/extensions) to allow it to support, out of the box, the ability to interact with major cloud computing providers, Kubernetes, Spring and Spring Boot, and many others.

Expand All @@ -33,7 +33,7 @@ Take a moment and consider the many ways that your ETL architecture could go wro
$ source aws_resource_names.sh
```

## Define the experiement
## Define the experiment

1. Create your experiment's skeleton

Expand Down Expand Up @@ -146,6 +146,8 @@ Take a moment and consider the many ways that your ETL architecture could go wro
]
```

> **Note**: You may need to adjust the `date` commands here on some platforms. Please see the FAQ in the main README file.

1. Evaluate the steady state

You now have the beginnings of your experiment. Execute Chaos Toolkit with your definition and watch its output as it assesses the steady state of your application.
Expand All @@ -164,7 +166,7 @@ Take a moment and consider the many ways that your ETL architecture could go wro

The [method section](https://docs.chaostoolkit.org/reference/api/experiment/#method) of an experiment defines the step(s) to take in order to introduce turbulence into the system. The method section is a list of actions and probes which you define.

Lets now introduce a minor latency of 3 to 5 seconds to the Lambda function.
Let's now introduce a minor latency of 3 to 5 seconds to the Lambda function.

Update your experiment definition with the following action. It will modify the configuration parameter for the failure-lambda library causing the Lambda function to, 50% of the time, take 3 to 5 seconds longer to execute. After modifying the Lambda functions configuration the system will pause for 5 min before re-evaluating the steady state of the application.

Expand All @@ -183,7 +185,7 @@ Take a moment and consider the many ways that your ETL architecture could go wro
}
}
],
```
```

1. Experiment responsibly

Expand Down
15 changes: 8 additions & 7 deletions docs/lab_4_chaos_experiment_2.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

## Objective

In this lab you will experiment with a different failure mode that could effect your application. The application relies on the DynamoDB service, lets inject intermittent connectivity to the service and observe how the overall application responds.
In this lab you will experiment with a different failure mode that could affect your application. The application relies on the DynamoDB service, so let's inject intermittent connectivity to the service and observe how the overall application responds.

## Service Availability

In the last lab you created your first chaos experiment using the Chaos Toolkit. In this lab you will explore a different failure mode and examine how the application performs. The failure mode we want to explore in this lab is the waivering availability of a dependency. We cannot temporarily disrupt the DynamoDB service however we can temporarily disrupt connectivity to the service.
In the last lab you created your first chaos experiment using the Chaos Toolkit. In this lab you will explore a different failure mode and examine how the application performs. The failure mode we want to explore in this lab is the wavering availability of a dependency. We cannot temporarily disrupt the DynamoDB service however we can temporarily disrupt connectivity to the service.

To simulate a service disruption we will again use the failure-lambda library's `blacklist` feature to block access to the DyanmoDB API some percentage of the time.
To simulate a service disruption we will again use the failure-lambda library's `blacklist` feature to block access to the DynamoDB API some percentage of the time.

## The Next Experiment

Expand Down Expand Up @@ -113,6 +113,8 @@ To simulate a service disruption we will again use the failure-lambda library's
}
```

> **Note**: You may need to adjust the `date` commands here on some platforms. Please see the FAQ in the main README file.

Everything in this template is the same as last time, you have the same steady state definition, the same rollback. The title and description are updated to reflect the nature of the experiment however.

1. Actions
Expand Down Expand Up @@ -166,7 +168,7 @@ To simulate a service disruption we will again use the failure-lambda library's

1. Behavior explained

Hopefully its clear that in order for the `Percent in Flight` to be a negative number the number of messages flowing out of the pipeline are greater than the messages flowing in. This suggests that the architecture is processing messages multiple times, causing duplication.
Hopefully it's clear that in order for the `Percent in Flight` to be a negative number the number of messages flowing out of the pipeline are greater than the messages flowing in. This suggests that the architecture is processing messages multiple times, causing duplication.

If you visit the Monitoring tab of the Lambda function and scroll down to the list of the most expensive invocations these will likely be one of the executions that had difficulty connecting to DynamoDB. To review the log entries copy the RequestID and click the LogStream link for the request. On the CloudWatch Logs console, in the Filter Events search field paste the RequestID in quotes to view only those log entries that relate to the execution. Along with the normal execution messages you should see messages such as the following which show the Lambda was unable to connect to DynamoDB:

Expand All @@ -180,7 +182,7 @@ To simulate a service disruption we will again use the failure-lambda library's

Looking at the source code you will notice that, around line 41, there is a call to DynamoDB which tries to check for a prior record of the message having been processed. This is an asynchronous call and so, while NodeJS waits for DynamoDB to respond, it continues executing, making additional calls to DynamoDB and Amazon S3. As a result, even though DynamoDB may be having issues the Lambda itself still writes output to Amazon S3.

To correct this we can instruct NodeJS to wait for the call to DynamoDB to return, this will prevent any further processing until connectivity to DynamoDB has been confirmed. Update the source code to add the `await` modifier to the initial call to DynamoDB:
To correct this we can instruct NodeJS to wait for the call to DynamoDB to return. This will prevent any further processing until connectivity to DynamoDB has been confirmed. Update the source code to add the `await` modifier to the initial call to DynamoDB:

```javascript
var ddbData = await ddb.get (params).promise ();
Expand All @@ -192,7 +194,6 @@ To simulate a service disruption we will again use the failure-lambda library's

Now re-run your Chaos experiment and notice that the experiment still fails but now it fails because the error rate is unnacceptably high. How could you improve the architecture to better account for this situation?


## Summary

You have now concluded this workshop. You have used Chaos-Toolkit and failure-lambda to develop and execute chaos experiments on a serverless architecture on AWS. There are many more experiments which can be performed on this architecture to improve it, but how will you now use this informaiton to improve your own serverless architecture?
You have now concluded this workshop. You have used Chaos-Toolkit and failure-lambda to develop and execute chaos experiments on a serverless architecture on AWS. There are many more experiments which can be performed on this architecture to improve it, but how will you now use this information to improve your own serverless architecture?
2 changes: 1 addition & 1 deletion terraform/application.tf
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ resource "aws_lambda_function" "chaos_lambda" {
memory_size = 128
role = aws_iam_role.chaos_lambda_role.arn
runtime = "nodejs12.x"
timeout = 3
timeout = 120

environment {
variables = {
Expand Down