Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I get error unavailable (14): Transport became inactive with a big number (>20) of unary requests #2202

Open
sergeiromanchuk opened this issue Feb 27, 2025 · 19 comments
Labels
status/triage Collecting information required to triage the issue. version/v1 Relates to v1

Comments

@sergeiromanchuk
Copy link

sergeiromanchuk commented Feb 27, 2025

Describe the bug

I have a table in which in each cell I send two queries to GRPC using async let. If I scroll through the table slowly, I almost never encounter any problems. But if I start scrolling the table quickly, every query I send throws the error unavailable (14): Transport became inactive.

This usually happens when the number of streams in use exceeds 20.

channel log
[{Default}] connectionUtilizationChanged(id:streamsUsed:streamCapacity:)::ChannelProvider.swift::191:19: [API v2] → ChannelProvider::connectionUtilizationChanged(id:streamsUsed:streamCapacity:). Connection ID: ObjectIdentifier(0x000000011ee58900). Stream capacity: 2147483647. Streams used: 21

channel error
[{Default}] connectionClosed(id:error:)::ChannelProvider.swift::205:19: [API v2] → ChannelProvider::connectionClosed(id:error:). Connection ID: ObjectIdentifier(0x000000011ee58900). Error: Optional(NIOCore.ChannelError.ioOnClosedChannel)

request error
[{Default}] xxx()::XXXService.swift::67:23: [API] → XXXService::xxx(). XXX service completed with result: success(unavailable (14): Transport became inactive)

trace
2025-02-27T09:45:36+0100 trace [CallOptions] : call_state=closed grpc.conn.addr_local=10.179.0.18 grpc.conn.addr_remote=10.229.0.160 grpc_request_id=5BFD6881-842F-42AB-95F1-AC0E322817FE [GRPC] failing buffered writes

SPM:
Protobuf 1.25.2
gRPC 1.65.1
gRPC-Swift 1.21.0

Platform: iOS

Expected behaviour

I expect that the channel will normally handle this number of requests as we can see that it is only 20-30 requests.

Additional information

Options for each unary call:

static var unaryCallOptions: CallOptions {
    let timeLimit: TimeLimit = .timeout(.seconds(5))
    #if DEBUG
    var logger = Logger(label: "[\(Self.self)]")
    logger.logLevel = .trace
    return CallOptions(timeLimit: timeLimit, logger: logger)
    #else
    return CallOptions(timeLimit: timeLimit)
    #endif
}

Keepalive

let keepalive = ClientConnectionKeepalive(
    interval: .minutes(5),
    timeout: .seconds(30),
    permitWithoutCalls: true
)

GRPCChannelPool configuration:

let channel = try GRPCChannelPool.with(
    target: .hostAndPort(
        configuration.host,
        configuration.port
    ),
    transportSecurity: .tls(
        configuration.tlsConfiguration
    ),
    eventLoopGroup: configuration.eventLoopGroup,
    { [weak self] channelConfiguration in
        channelConfiguration.keepalive = configuration.keepalive
        channelConfiguration.delegate = self
        channelConfiguration.connectionBackoff = ConnectionBackoff(
            initialBackoff: 1,
            maximumBackoff: 15,
            multiplier: 1.5,
            minimumConnectionTimeout: 15
        )
        #if DEBUG
        var logger = Logger(label: "[Channel gRPC]")
        logger.logLevel = .trace
        channelConfiguration.backgroundActivityLogger = logger
        #endif
    }
)

Request example:

public func getData(
    for id: UUID,
    options: CallOptions
) async throws -> Data {
    let client = try dependency.getClient()
    async let task1 = client.getX1(
        X1Request(uids: [id]),
        callOptions: options
    )
    async let task2 = client.getX2(
        X2Request(uids: [id]),
        callOptions: options
    )
    let (response1, response2) = try await (task1, task1) // ! error !
    return ...
}

Generated code:

public func getX(
  _ request: Frontend_PpX_V1_XRequest,
  callOptions: CallOptions? = nil
) async throws -> Frontend_PpX_V1_XResponse {
  return try await self.performAsyncUnaryCall(
    path: Frontend_PpX_V1_XServiceClientMetadata.Methods.getX.path,
    request: request,
    callOptions: callOptions ?? self.defaultCallOptions,
    interceptors: self.interceptors?.makeXInterceptors() ?? []
  )
}

public func performAsyncUnaryCall<Request: Message & Sendable, Response: Message & Sendable>(
  path: String,
  request: Request,
  callOptions: CallOptions? = nil,
  interceptors: [ClientInterceptor<Request, Response>] = [],
  responseType: Response.Type = Response.self
) async throws -> Response {
  let call = self.channel.makeAsyncUnaryCall(
    path: path,
    request: request,
    callOptions: callOptions ?? self.defaultCallOptions,
    interceptors: interceptors
  )

  return try await withTaskCancellationHandler {
    try await call.response
  } onCancel: {
    call.cancel()
  }
}
@glbrntt glbrntt added status/triage Collecting information required to triage the issue. version/v1 Relates to v1 labels Feb 27, 2025
@glbrntt
Copy link
Collaborator

glbrntt commented Feb 27, 2025

unavailable (14): Transport became inactive means that the connection was unexpectedly dropped (which is backed up by NIOCore.ChannelError.ioOnClosedChannel).

If the client closed the connection because of a local error condition we should see that. So the working hypothesis is that the server dropped the connection. It's unclear why that's the case.

Could you provide the complete trace logs for when this happens? It might shed some light on this.

@sergeiromanchuk
Copy link
Author

sergeiromanchuk commented Feb 27, 2025

@glbrntt Is that valuable logs?

2025-02-27T09:45:36+0100 trace [CallOptions] : call_state=closed grpc.conn.addr_local=10.179.0.18 grpc.conn.addr_remote=10.229.0.160 grpc_request_id=5BFD6881-842F-42AB-95F1-AC0E322817FE [GRPC] failing buffered writes

@glbrntt
Copy link
Collaborator

glbrntt commented Feb 27, 2025

No, that will be a result of the connection being dropped, not the reason it drops.

@sergeiromanchuk
Copy link
Author

sergeiromanchuk commented Feb 27, 2025

GRPCTraceLogs.txt

cc: @glbrntt

@sergeiromanchuk
Copy link
Author

In my code, I do the following

task?.cancel()
task = Task {
    let data = try await service.getData(...)
    // handle data
}

I noticed that the error appears when I make a cancel for the task on which I execute the request, but by removing the cancellation of the task I found that the number of executed requests goes up to 1000+ and there is no problem. Why is this?

cc: @glbrntt

@sergeiromanchuk
Copy link
Author

I also found out that if timeout is triggered in requests, my channel also goes down as a result

LogsWithGRPCPool.txt.zip

@sergeiromanchuk
Copy link
Author

No such stream logs

When debugging, I also found that gRPC calls channelRead when receiving a response and tries to find the corresponding stream, which was previously closed and removed. Then, it sends the error NoSuchStream.

Image Image

Image


2025-02-27T14:45:58+0100 error [Channel gRPC] : error=NoSuchStream(streamID: HTTP2StreamID(233), location: "NIOHTTP2/ConnectionStreamsState.swift:305") grpc.conn.addr_local=10.179.0.62 grpc.conn.addr_remote=10.229.2.186 grpc_connection_id=7F6EA520-3D88-4156-9712-30DC60993782/2 [GRPC] grpc client error
2025-02-27T14:45:58+0100 error [Channel gRPC] : error=ioOnClosedChannel grpc.conn.addr_local=10.179.0.62 grpc.conn.addr_remote=10.229.2.186 grpc_connection_id=7F6EA520-3D88-4156-9712-30DC60993782/2 [GRPC] grpc client error
2025-02-27T14:45:58+0100 debug [Channel gRPC] : connectivity_state=ready grpc_connection_id=7F6EA520-3D88-4156-9712-30DC60993782/2 [GRPC] deactivating connection
2025-02-27T14:45:58+0100 debug [Channel gRPC] : delay=0 grpc_connection_id=7F6EA520-3D88-4156-9712-30DC60993782/2 [GRPC] scheduling connection attempt
2025-02-27T14:45:58+0100 debug [Channel gRPC] : grpc_connection_id=7F6EA520-3D88-4156-9712-30DC60993782/2 new_state=transientFailure old_state=ready [GRPC] connectivity state change
2025-02-27T14:45:58+0100 error [CallOptions] : error=unavailable (14): Transport became inactive grpc.conn.addr_local=10.179.0.62 grpc.conn.addr_remote=10.229.2.186 grpc_connection_id=7F6EA520-3D88-4156-9712-30DC60993782/2 grpc_request_id=BEE94883-7EFB-4287-8F56-60F6BB3EFF76 [GRPC] grpc client error
2025-02-27T14:45:58+0100 trace [CallOptions] : call_state=closed grpc.conn.addr_local=10.179.0.62 grpc.conn.addr_remote=10.229.2.186 grpc_connection_id=7F6EA520-3D88-4156-9712-30DC60993782/2 grpc_request_id=BEE94883-7EFB-4287-8F56-60F6BB3EFF76 [GRPC] failing buffered writes
2025-02-27T14:45:58+0100 error [CallOptions] : error=unavailable (14): Transport became inactive grpc.conn.addr_local=10.179.0.62 grpc.conn.addr_remote=10.229.2.186 grpc_connection_id=7F6EA520-3D88-4156-9712-30DC60993782/2 grpc_request_id=F0BE824A-04A4-4074-9A5B-43700DF739EB [GRPC] grpc client error
2025-02-27T14:45:58+0100 trace [CallOptions] : call_state=closed grpc.conn.addr_local=10.179.0.62 grpc.conn.addr_remote=10.229.2.186 grpc_connection_id=7F6EA520-3D88-4156-9712-30DC60993782/2 grpc_request_id=F0BE824A-04A4-4074-9A5B-43700DF739EB [GRPC] failing buffered writes
2025-02-27T14:45:58+0100 error [CallOptions] : error=unavailable (14): Transport became inactive grpc.conn.addr_local=10.179.0.62 grpc.conn.addr_remote=10.229.2.186 grpc_connection_id=7F6EA520-3D88-4156-9712-30DC60993782/2 grpc_request_id=95EFBB4C-09FF-44BF-8DB5-D8E11ADD7262 [GRPC] grpc client error
2025-02-27T14:45:58+0100 trace [CallOptions] : call_state=closed grpc.conn.addr_local=10.179.0.62 grpc.conn.addr_remote=10.229.2.186 grpc_connection_id=7F6EA520-3D88-4156-9712-30DC60993782/2 grpc_request_id=95EFBB4C-09FF-44BF-8DB5-D8E11ADD7262 [GRPC] failing buffered writes
2025-02-27T14:45:58+0100 error [CallOptions] : error=unavailable (14): Transport became inactive grpc.conn.addr_local=10.179.0.62 grpc.conn.addr_remote=10.229.2.186 grpc_connection_id=7F6EA520-3D88-4156-9712-30DC60993782/2 grpc_request_id=95F10C40-D09F-493F-87A0-2C322C25653E [GRPC] grpc client error
2025-02-27T14:45:58+0100 trace [CallOptions] : call_state=closed grpc.conn.addr_local=10.179.0.62 grpc.conn.addr_remote=10.229.2.186 grpc_connection_id=7F6EA520-3D88-4156-9712-30DC60993782/2 grpc_request_id=95F10C40-D09F-493F-87A0-2C322C25653E [GRPC] failing buffered writes
2025-02-27T14:45:58+0100 debug [Channel gRPC] : connectivity_state=transientFailure grpc_connection_id=7F6EA520-3D88-4156-9712-30DC60993782/3 [GRPC] vending multiplexer future

@sergeiromanchuk
Copy link
Author

https://stackoverflow.com/questions/69422236/grpc-swift-idle-stream-cause-next-rpc-call-to-fail-with-unavailable-14-due-to

@sergeiromanchuk
Copy link
Author

It looks like the main problem is that in the list of streams is not found a particular stream after which the channel is closed. If I load data into a cell and cancel tasks that are no longer needed, I always catch error on stream 29.

(lldb) p underlyingError
(NIOHTTP2.NIOHTTP2Errors.NoSuchStream) {
  streamID = (networkStreamID = 221)
  location = "NIOHTTP2/ConnectionStreamsState.swift:305"
}

Why is that happening?

cc: @glbrntt

@sergeiromanchuk
Copy link
Author

Found related issue: #1421

@sergeiromanchuk
Copy link
Author

sergeiromanchuk commented Feb 27, 2025

I tried a solution to create 1 channel for 1 request. In the end, it helped. Well, it's impossible to work with this because the loading time is incredibly long. I really need some kind of solution that will allow me not to close the channel with a large number of requests. I believe that this behavior is indeed a bug.

I also haven't found a way to optimize the connection time of the channel to the server in any way, so the question is relevant.

I also can't use your new version of the library (2.0.0) because I see that it only supports iOS 18 and our project supports iOS 15.

cc: @glbrntt

@sergeiromanchuk
Copy link
Author

Basically, I understand it like this, and I'm almost sure it's true. We cancel a stream, and it gets added to recentlyResetStreams. This list has a limit of 32 streams. When it fills up, the last stream is removed, and a new one is added.

If we scroll quickly, this reset list fills up very fast, and streams get removed from it just as quickly. Then, packets arrive from old streams/requests, which we try to process. We check if the corresponding stream exists in activeStreams—it doesn't. Then we check recentlyResetStreams, but the stream isn't there either because the list filled up with new entries and our stream got evicted before the packet arrived.

As a result, we return a noSuchStream error, and the channel closes. That's it.

Oh, and yeah—this limit of 32 is hardcoded, so we can't change it. The TODO for fixing it has been there since 2019, so I guess it's about time to do it. 😅


Image


Image


Image

@sergeiromanchuk
Copy link
Author

What do you think?

сс: @jackystd @glbrntt

@glbrntt
Copy link
Collaborator

glbrntt commented Feb 28, 2025

Hey @sergeiromanchuk -- thanks for providing all of this info and good job on diagnosing the issue!

I think allowing the max reset streams limit to be configurable and then exposing that config in gRPC is reasonable. I'll open a separate issue in https://github.com/apple/swift-nio-http2 to track doing that. We can track the gRPC changes in this issue.

In the meantime you can workaround this by not cancelling the streams.

@sergeiromanchuk
Copy link
Author

Thanks for the reply @glbrntt. Unfortunately, it's not enough to just not cancel streams, I'll also have to limit the number of calls because the channel also drops with an error if some request is completed with timeout. I think it is another issue.

@glbrntt
Copy link
Collaborator

glbrntt commented Feb 28, 2025

Here's the http/2 issue: apple/swift-nio-http2#493

@glbrntt
Copy link
Collaborator

glbrntt commented Feb 28, 2025

I'll also have to limit the number of calls because the channel also drops with an error if some request is completed with timeout.

Can you elaborate on this one? An RPC timing out won't close the connection, but an RPC may time out because of a connection timeout.

@sergeiromanchuk
Copy link
Author

When I sent a lot of tasks without canceling ~1000, the connection would process them all for probably about 3 minutes and they would drop with a timout error. After all the streams were processed, the connection would also close with ioOnClosedChannel error.

@sergeiromanchuk
Copy link
Author

sergeiromanchuk commented Feb 28, 2025

@glbrntt I don't cancel unary calls, while limiting parallel execution to 4 tasks. However, the channel still closes with an error. It seems that the problem is much deeper. Please examine the logs in the ioOnCloseChannel places. (It's not noSuchStream issue for that case)

Logs1.txt
Logs2.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/triage Collecting information required to triage the issue. version/v1 Relates to v1
Projects
None yet
Development

No branches or pull requests

2 participants