Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: gateway timeouts #78

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/config/en-custom.txt
Original file line number Diff line number Diff line change
Expand Up @@ -908,5 +908,11 @@ SecOps
kube
workspace's
Authorizer
idleConnection
timeoutPolicy
httpproxy
apis
backendrequest
sigs
Aditi
Twilio
350 changes: 350 additions & 0 deletions resources/2025-01-gateway-timeouts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,350 @@
# Adding Configurable Gateway Timeouts to Radius

* **Author**: Nick Beenham (@superbeeny)

## Overview

<!--
Provide a succinct high-level description of the component or feature and
where/how it fits in the big picture. The overview should be one to three
paragraphs long and should be understandable by someone outside the Radius
team. Do not provide the design details in this, section - there is a
dedicated section for that later in the document.
-->
The purpose of this feature is to allow the user to configure a timeout on the gateway for an application. This will allow the user to specify how long the gateway should wait for a response from the application before timing out. This will be useful in scenarios where the application may take a long time to respond, or where the user wants to ensure that the application responds within a certain time frame.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the main difference between health/readiness probes and this? Is this for an endpoint that would take a long time to process something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my explicit use case is for LLM responses

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the timeouts configurable per route?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes


## Terms and definitions

<!--
Include any terms, definitions, or acronyms that are used in
this design document to assist the reader. They may or may not
be part of the user-facing experience once implemented, and can
be specific to this design context.
-->
Route timeout: The amount of time the gateway should wait for a response from the application before timing out.
[specification](https://www.envoyproxy.io/docs/envoy/v1.14.2/api-v2/api/v2/route/route_components.proto#envoy-api-field-route-routeaction-timeout)

## Objectives

<!--
Describe goals/non-goals and user-scenario of this feature to understand
the end-user goals.
* If the feature shares the same objectives of the existing design, link
to the existing doc and section rather than repeat the same context.
* If the feature has a scenario, UX, or other product feature design doc,
link it here and summarize the important parts.
-->

> **Issue Reference:** [#8221](https://github.com/radius-project/radius/issues/8221)

### Goals

<!--
Describe goals to define why we are doing this work, how we will make
priority decisions, and how we will determine success.
-->
The goal of this feature is to allow the user to configure a timeout on the gateway for an application within the applications bicep file.

### Non goals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please update this section to include https://github.com/radius-project/design-notes/pull/78/files#r1941756983?


<!--
Describe non-goals to identify something that we won’t be focusing on
immediately. We won’t be expending any effort on these matters. If there
will be follow-ups after this work, list them here. If there are things
we plan to do in the future, but are out of scope of this design, list
them here. Provide a brief explanation on why this is a non-goal.
-->
None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it out of scope to provide more httpproxy punch-through options? Is this pattern extensible to other routing configuration users may want to provide in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add note to doc to limit scope to requirements


### User scenarios (optional)

<!--
Describe the user scenarios for this design. Ensure that you define the
roles and personas in these user scenarios when it requires API design.
If you have an existing issue that describes the user scenarios, please
link to that issue instead.
-->

#### User story 1
As a user, I want to be able to configure a timeout on the gateway for an application so that I can specify how long the gateway should wait for a response from the application before timing out.


## User Experience (if applicable)
<!--
If the change impacts the user experience, provide expected interaction
flow we aim to achieve through this proposal.

When users interact with Radius through the CLI, include sample
input commands and their corresponding output. Include a bicep/helm code
sample, if this proposal involves updates to that experience.
-->

**Sample Input:**
<!--
Provide a sample CLI command input and/or bicep/helm code.
-->
```bicep
resource gateway 'Applications.Core/gateways@2023-10-01-preview' = {
name: 'demo-gateway'
properties: {
application: app.id
hostname: {
fullyQualifiedHostname: 'demo.somedomain.com'
}
routes: [
{
path: '/'
destination: 'http://${ui.name}:3000'
}
{
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax is kind of confusing. if I'm reading it right, this timeoutPolicy is applied to the /api route. why is timeoutPolicy not present at the top level along with path and destination?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a typo that I will fix

timeoutPolicy: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way (or a need) to apply global timeout policies for all routes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary for this PR but worth some further investigation

response: '30s'
idle: '5m'
idleConnection: '1h'
}
}
path: '/api'
destination: 'http://${api.name}'
}
Comment on lines +99 to +109
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why do we need the outer brace?

Suggested change
{
{
timeoutPolicy: {
response: '30s'
idle: '5m'
idleConnection: '1h'
}
}
path: '/api'
destination: 'http://${api.name}'
}
{
timeoutPolicy: {
response: '30s'
idle: '5m'
idleConnection: '1h'
}
path: '/api'
destination: 'http://${api.name}'
}

]
tls: {
certificateFrom: grimmCertificateStore.id
minimumProtocolVersion: '1.2'
}
}
}
```

**Sample Output:**
<!--
Provide a sample output for the inputs provided above.
-->



## Design

### High Level Design
<!--
High level overview of the data flow and key components.

Provide a high-level description, using diagrams as appropriate, and top-level
explanations to convey the architectural/design overview. Don’t go into a lot
of details yet but provide enough information about the relationship between
these components and other components. Call out or highlight new components
that are not part of this feature (dependencies). This diagram generally
treats the components as black boxes. Provide a pointer to a more detailed
design document, if one exists.
-->

The design of this feature will require updates to the versioned datamodel, the render functions and the gateway typespec. The gateway typespec will be updated to include a timeoutPolicy object which will allow the user to configure the timeout on the gateway for an application. The render functions will be updated to render the timeoutPolicy object in the gateway resource. The versioned datamodel will be updated to include the timeoutPolicy object in the gateway resource.

### Architecture Diagram
<!--
Provide a diagram of the system architecture, illustrating how different
components interact with each other in the context of this proposal.

Include separate high level architecture diagram and component specific diagrams, wherever appropriate.
-->
![Architecture Diagram](./2025-01-gateway-timeouts/radius-timeout-arch.png)

### Detailed Design

<!--
This section should be detailed and thorough enough that another developer
could implement your design and provide enough detail to get a high confidence
estimate of the cost to implement the feature but isn’t as detailed as the
code. Be sure to also consider testability in your design.

For each change, give each "change" in the proposal its own section and
describe it in enough detail that someone else could implement it. Cover
ALL of the important decisions like names. Your goal is to get an agreement
to proceed with coding and PRs.

If there are alternatives you are considering please include that in the open
questions section. If the product has a layered architecture, it's good to
align these sections with the product's layers. This will help readers use
their current understanding to understand your ideas.

Discuss the rationale behind architectural choices and alternative options
considered during the design process.
-->

The solution would update the specification for the gateway resource in the bicep file to include a timeoutPolicy object. This object would contain the following properties:
**request:** The amount of time the gateway should wait for a response from the application before timing out.
**backendrequest:** The amount of time the gateway should wait for a response from the application before timing out when the connection is idle.
Comment on lines +175 to +176
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of questions -

  • What is the difference between the two?
  • The property names are different in the subsequent sections of this doc.


This maps to the gateway API timeoutPolicy - https://gateway-api.sigs.k8s.io/api-types/httproute/#timeouts-optional

#### Advantages (of each option considered)
<!--
Describe what's good about this plan relative to other options.
Provides better user experience? Does it feel easy to implement?
Provides flexibility for future work?
-->
The advantage of this approach is that it is simple and easy to understand. It allows the user to configure the timeout on the gateway for an application on a route by route basis within the applications bicep file.

These proposed changes should not be breaking changes and should not require any changes to existing applications.

#### Disadvantages (of each option considered)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any downsides of this approach long term?

<!--
Describe what's not ideal about this plan. Does it lock us into a
particular design for future changes or is it flexible if we were to
pivot in the future. This is a good place to cover risks.
-->

#### Proposed Option
<!--
Describe the recommended option and provide reasoning behind it.
-->

### API design (if applicable)

<!--
Include if applicable – any design that changes our public REST API, CLI
arguments/commands, or Go APIs for shared components should provide this
section. Write N/A here if not applicable.
- Describe the REST APIs in detail for new resource types or updates to
existing resource types. E.g. API Path and Sample request and response.
- Describe new commands in the CLI or changes to existing CLI commands.
- Describe the new or modified Go APIs for any shared components.
-->
Updates to the gateway typespec to allow for timeout configuration.
```diff

@doc("Route attached to Gateway")
model GatewayRoute {
@doc("The path to match the incoming request path on. Ex - /myservice.")
path?: string;

@doc("The URL or id of the service to route to. Ex - 'http://myservice'.")
destination?: string;

@doc("Optionally update the prefix when sending the request to the service. Ex - replacePrefix: '/' and path: '/myservice' will transform '/myservice/myroute' to '/myroute'")
replacePrefix?: string;

@doc("Enables websocket support for the route. Defaults to false.")
enableWebsockets?: boolean;

@doc("The timeout policy for the route.")
timeoutPolicy?: GatewayRouteTimeoutPolicy;
}

@doc("Gateway route timeout policy")
model GatewayRouteTimeoutPolicy {
@doc("The response timeout in duration for the route. Defaults to 15 seconds.")
response?: string;

@doc("The backend timeout in duration for the route. Cannot be more than the request timeout")
backendrequest?: string;


}
Comment on lines +235 to +243
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match the schema above in the user experience section, that has idle and idleConnection instead of backendrequest.. is one of the final version? Would also helpful to know why.

```

### Implementation Details
<!--
High level description of updates to each component. Provide information on
the specific sub-components that will be updated, for example, controller, processor, renderer,
recipe engine, driver, to name a few.
-->
The renderer for the gateway resource will be updated to render the timeoutPolicy object in the gateway resource.

The versioned datamodel will be updated to include the timeoutPolicy object in the gateway resource.


#### Core RP (if applicable)

### Error Handling
<!--
Describe the error scenarios that may occur and the corresponding recovery/error handling and user experience.
-->
Error handling is covered within the functions and Radius errors are used where appropriate.

## Test plan

<!--
Include the test plan to validate the features including the areas that
need functional tests.

Describe any functionality that will create new testing challenges:
- New dependencies
- External assets that tests need to access
- Features that do I/O or change OS state and are thus hard to unit test
-->
The existing gateway tests will be updated to include tests for the new timeoutPolicy object. This will include tests to ensure that the gateway times out correctly when the response time exceeds the configured timeout.

## Security

<!--
Describe any changes to the existing security model of Radius or security
challenges of the features. For each challenge describe the security threat
and its mitigation with this design.

Examples include:
- Authentication
- Storing secrets and credentials
- Using cryptography

If this feature has no new challenges or changes to the security model
then describe how the feature will use existing security features of Radius.
-->
There are no changes to current security policies.

## Compatibility (optional)

<!--
Describe potential compatibility issues with other components, such as
incompatibility with older CLIs, and include any breaking changes to
behaviors or APIs.
-->
There should be no compatibility issues with existing applications.

## Monitoring and Logging

<!--
Include the list of instrumentation such as metric, log, and trace to
diagnose this new feature. It also describes how to troubleshoot this feature
with the instrumentation.
-->
No additional monitoring or logging is required for this feature.

## Development plan

<!--
Describe how you will deliver your features. This includes aligning work items
to features, scenarios, or requirements, defining what deliverable will be
checked in at each point in the product and estimating the cost of each work
item. Don’t forget to include the Unit Test and functional test in your
estimates.
-->
Work will be completed in steps:
1. Update the gateway typespec to include the timeoutPolicy object.
2. Update the render functions to render the timeoutPolicy object in the gateway resource.
3. Update the versioned datamodel to include the timeoutPolicy object in the gateway resource.
4. Update the gateway tests to include tests for the new timeoutPolicy object.
5. Update the functional tests to include tests for the new timeoutPolicy object.
6. Update the documentation to include the new timeoutPolicy object.

## Open Questions

<!--
Describe (Q&A format) the important unknowns or things you're not sure about.
Use the discussion to answer these with experts after people digest the
overall design.
-->

## Alternatives considered

<!--
Describe the alternative designs that were considered or should be considered.
Give a justification for why alternative approaches should be rejected if
possible.
-->

## Design Review Notes

<!--
Update this section with the decisions made during the design review meeting. This should be updated before the design is merged.
-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this scenario deployment to Kubernetes is handled by Application RP instead of DE.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.