Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 11 additions & 9 deletions packages/api/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,9 @@ const (
maxMultipartMemory = 1 << 27 // 128 MiB
maxUploadLimit = 1 << 28 // 256 MiB

maxReadHeaderTimeout = 60 * time.Second
maxReadTimeout = 75 * time.Second
maxWriteTimeout = 75 * time.Second
maxReadTimeout = 75 * time.Second
maxWriteTimeout = 75 * time.Second
idleTimeout = 620 * time.Second

defaultPort = 80
)
Expand Down Expand Up @@ -171,12 +171,14 @@ func NewGinServer(ctx context.Context, logger *zap.Logger, apiStore *handlers.AP
r.MaxMultipartMemory = maxMultipartMemory

s := &http.Server{
Handler: r,
Addr: fmt.Sprintf("0.0.0.0:%d", port),
ReadHeaderTimeout: maxReadHeaderTimeout,
ReadTimeout: maxReadTimeout,
WriteTimeout: maxWriteTimeout,
BaseContext: func(net.Listener) context.Context { return ctx },
Handler: r,
Addr: fmt.Sprintf("0.0.0.0:%d", port),
// Configure timeouts to be greater than the proxy timeouts.
// https://github.com/golang/go/issues/47007
ReadTimeout: maxReadTimeout,
WriteTimeout: maxWriteTimeout,
IdleTimeout: idleTimeout,
BaseContext: func(net.Listener) context.Context { return ctx },
}

return s
Expand Down
5 changes: 3 additions & 2 deletions packages/cluster/network/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -310,8 +310,9 @@ resource "google_compute_url_map" "orch_map" {

### IPv4 block ###
resource "google_compute_target_https_proxy" "default" {
name = "${var.prefix}https-proxy"
url_map = google_compute_url_map.orch_map.self_link
name = "${var.prefix}https-proxy"
url_map = google_compute_url_map.orch_map.self_link
http_keep_alive_timeout_sec = 540
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to change this? can this be changed without any disruption?

Copy link
Member Author

@ValentaTomas ValentaTomas May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this the LB client facing part has a default timeout of 610s (that can be configured), but the backend has reconfigurable timeout of 600s. I don't know why these are the defaults, because of the issue this can cause:
The reason to set the timeout this way (rising going upstream) are here.
This applies to both API and sandbox traffic.

This could be even lower (300 might be ok too), but 540 seems reasonable.

As for the reload without disruption—it should be just part of Envoy config, so most likely yes, but I have not tested this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's misunderstanding, I think the default Load Balancer timeout is higher than the one to service on purpose, so it can close the connection to backend gracefully and to maximize the connection length to client, this should reduce latency

Time →──────────────────────────────────────────────────────────────────────────────▶
  • Client–GFE Idle Keep-Alive Timeout = 610 s
  • GFE–Backend Idle Keep-Alive Timeout = 600 s (fixed)
  • Your Server Idle Keep-Alive Timeout > 620 s

─── At 0 s ────────────────────────────────────────────────────────────────────────  
   Client opens a TCP connection to GFE  
   GFE opens a TCP connection to your server  
   (Both sides handshake, HTTP keep-alive enabled)

─── Idle, no requests for 600 s ─────────────────────────────────────────────────  
   At **600 s**, GFE notices its **backend socket** has been idle for 600 s → GFE sends a **FIN** to your server and closes that backend TCP connection.  
   Your server (with 620 s idle config) has not yet closed its side, so it does a **graceful close** when it sees GFE’s FIN.

─── Idle from 600 s to 605 s ──────────────────────────────────────────────────  
   Client ↔ GFE connection still open (it only times out at 610 s)  
   Backend connection is now closed on both ends (GFE closed it at 600 s; server followed suit)

─── At **605 s**, client issues a new HTTP request over the **existing** TCP connection to GFE  
   1. **GFE receives** the request on the client socket (still alive).  
   2. GFE looks for an **idle backend socket** to forward on—but it closed that at 600 s.  
   3. GFE immediately **opens a brand-new TCP connection** to your server (SYN → SYN-ACK → ACK).  
   4. GFE **forwards** the client’s request over that new connection.  
   5. Your server processes and replies; GFE relays the response back to the client.  

─── Connection now active again; both sides may keep it alive for their configured idle timeouts ────▶


Copy link
Member Author

@ValentaTomas ValentaTomas May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be right, the fact that they did set it up this way by default is at least suspicious.

Copy link
Member Author

@ValentaTomas ValentaTomas May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's scrap this for now.

Copy link
Member Author

@ValentaTomas ValentaTomas May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking, if we should configure our proxies to also have (620s-610s, 630s-620s) timeouts.

Copy link
Member Author

@ValentaTomas ValentaTomas May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really thinking if the LB's 610s-600s (fixed) have any advantage, because the client's (-> LB) timeout should already be <610s, so the connection would still be closed from the client side at that point.
I'm just thinking if there is a point of making the "client" part of each proxy timeout lower than the "server"—wouldn't it work equally well if the "client" part timeout was slightly higher (or the same, doesn't matter in the end) too if the next server timeout was still bigger?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two idle timeouts, downstream and upstream and the rule for (downstream > upstream) is mainly for the same service.

If you close the downstream connection first, the new downstream connection can try to reuse the TCP connection even though it's already in CLOSE_WAIT state, because transport doesn't check the connection state.

Doing this results in error.

  1. For different services you are trying to prevent both services sending FIN at the same time, it's a good practice to add a cushion towards upstream, so you don't have to recreate these connections. Theoretically you could have it the other way around

So the important idle timeout is between server and client in the same service, I committed it with a comment


certificate_map = "//certificatemanager.googleapis.com/${google_certificate_manager_certificate_map.certificate_map.id}"
}
Expand Down
Loading