Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
236 changes: 236 additions & 0 deletions A98-debug-protocol.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
A98: Channelz v2
----
* Author(s): ctiller
* Approver: markdroth
* Status: Draft
* Implemented in: <language, ...>
* Last updated: 2025/05/06
* Discussion at: https://groups.google.com/g/grpc-io/c/XrOzA4akIHo

## Abstract

Add a generalized debug interface for gRPC services.

## Background

In A14 we added channelz.
This protocol mixes some lightweight monitoring with some tracing and debuggability features.
It suffers from being relatively rigid in the topology it presents, and the set of node types available.
This update improves the protocols flexibility.


### Related Proposals:
* A14 - channelz.

## Proposal

A new debug service proto will be added, `channelz/v2/channelz.proto` in the `grpc.channelz.v2` namespace.

### Entities

The fundamental building block of the protocol is a queryable entity.
An entity describes one live object inside of a gRPC implementation.
This might be a call, a channel, a transport, or a TCP socket. It might also describe a resource quota or a watched configuration file.

One entity is described as such:

```
message Entity {
// The identifier for this entity.
int64 id = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use unsigned types for ids?

It looks like channelz uses signed types but documents the fact that they need to be positive values, which seems a little silly. I'd argue that the protocol shouldn't allow values that we consider invalid.

I guess the counter-argument would be to stick to the same types used in channelz for compatibility. But if we're taking the opportunity to define a new protocol anyway, maybe we should clean this up at the same time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to keep compatibility here. Having the underlying type have some invalid values has been useful for implementation too.

// The kind of this entity.
string kind = 2;
// Parents for this entity.
repeated int64 parents = 3;
// Has this entity been orphaned?
bool orphaned = 4;
// Instantaneous data for this entity.
repeated google.protobuf.Any data = 5;
// Historical trace information for the entity.
repeated TraceEvent trace = 6;
}
```

Entities have state and configuration, and tend be be relatively long lived - such that querying them makes sense.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that in the C-core impl you've been working on, it's possible for debug entities to outlive the actual gRPC object that they're associated with, which is useful for after-the-fact debugging. Do we want to say something about that here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some bits

It's allowed for this protocol to return entities who's active object has already been deleted.

**id**: An entity is identified by an id.
These ids are allocated sequentially per the rationale in A14, and implementations should use the same id space for debug entities and channelz objects.

**kind**: An entity has a kind. This is a string descriptor of the general category of object that this entity describes.
We use strings here rather than enums so that implementations are free to extend the kind space with their own objects.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API doesn't seem to have any way to discover what types are supported or what the relationships between those types are. Would it make sense to add something like that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably does, but I think I need concrete use cases on how they'd be used... and right now a .md file would cover all known use-cases.

Common entity kinds will see some level of standardization across stacks, but we expect many kinds to be specific per gRPC implementation.
Initially implementations are expected to match kinds with channelz object types:
kind `"channel"` -> `channelz.v1.Channel`, `"subchannel"` -> `channelz.v1.Subchannel`, `"server"` -> `channelz.v1.Server`, `"socket"` -> `channelz.v1.Socket`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we could define some well-known data types for these kinds that we know will exist in all implementations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... this spec or a sister spec?



**parents**: It's useful to be able to associate entities in parent/child hierarchies.
For example, a channel has many subchannel children.
Channelz listed specific kinds of children in its various node types - this tracking (and the need to produce it when sending an object) has caused contention issues in implementations in the past.
Should the list of children (optionally of a particular kind) for an entity be desired - and that's common - a separate paginated service call can be made.
Instead, this protocol lists only parents (as that set is far more stable).
Multiple parents are allowed - to handle at least the case of C++ subchannels being owned by multiple channels.

**orphaned**: If the gRPC object that this entity represents has been deleted, then this field MUST be set to true to represent that this is potentially stale data.

**data**: This is a list of protobuf Any objects describing the current state of this entity.
Implementations may define their own protobufs to be exposed here, and common sets will be standardized separately.

**trace**: Finally, an entity may supply a small summary of its history as a trace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we note here that traces can be either proactively recorded, in which case they are returned here, or recorded on-demand via the QueryTrace RPC method?

Should we impose any cap on the number of proactively recorded traces?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some commentary

Each event is defined as:

```
message TraceEvent {
// High level description of the event.
string description = 1;
// When this event occurred.
google.protobuf.Timestamp timestamp = 2;
// Any additional supporting data.
repeated google.protobuf.Any data = 3;
// Any referenced entities.
repeated int64 referenced_entities = 4;
}
```

These are made available to all entities (in contrast to channelz that selected which nodes had traces, and which did not) - though an implementation need not make that facility available in its implementation of entities.
Implementations may limit memory per entity, or impose an overall system limit to the amount of traces collected.

The entity trace is a historical snapshot of important events in the entities history.
Which is to say it's not intended to be a high fidelity log of every event that occurred.
It's recommended that implementations limit this trace to high value historical data (a channel disconnected for this reason), and additionally provide some higher fidelity on currently ongoing operations.
For in-depth examination of all events on an entity the QueryTrace API provides live tracing of entities.

Also note that any notion of severity has been removed from the protocol (in contrast to channelz) - in practice this has not been a useful field.
When converting this protocol to channelz all trace events should be taken as CT_INFO.

Again, there is a facility for implementations to provide their own additional information in the **data** field.

### Queries

Queries are made available via the `Channelz` service:

```
service Channelz {
// Gets all entities of a given kind, optionally with a given parent.
rpc QueryEntities(QueryEntitiesRequest) returns (QueryEntitiesResponse);
// Gets information for a specific entity.
rpc GetEntity(GetEntityRequest) returns (GetEntityResponse);
// Query a named trace from an entity.
// These query live information from the system, and run for as long
// as the query time is made for.
rpc QueryTrace(QueryTraceRequest) returns (stream QueryTraceResponse);
}
```

**QueryEntities** allows the full database of entities to be queried for entries that match a set of criteria.
The allowed criteria may be extended over time with additional gRFCs.

```
message QueryEntitiesRequest {
// The kind of entities to query.
// If this is set to empty then all kinds will be queried.
string kind = 1;
// Filter the entities so that only children of this parent are returned.
// If this is 0 then no parent filter is applied.
int64 parent = 2;
}

message QueryEntitiesResponse {
// List of entities that match the query.
repeated Entity entities = 1;
}
```

**GetEntity** allows polling a single entity's state.

```
message GetEntityRequest {
// The identifier of the entity to get.
int64 id = 1;
}

message GetEntityResponse {
// The Entity that corresponds to the requested id. This field
// should be set.
Entity entity = 1;
}
```

Finally, **QueryTrace** allows rich queries of live trace data to be pulled from an instance.

```
message QueryTraceRequest {
// The identifier of the entity to query.
int64 id = 1;
// The name of the trace to query.
string name = 2;
// Implementation defined query arguments.
repeated google.protobuf.Any args = 4;
}

message QueryTraceResponse {
// The events in the trace.
// If multiple events occurred between the last message in the stream being
// sent and this one being sent, this can contain more than one event.
repeated TraceEvent events = 1;
// Number of events matched by the trace.
// This may be higher than the number returned if memory limits were exceeded.
int64 num_events_matched = 2;
}
```

The idea here is that these traces will be very detailed - beyond what can feasibly be stored in the historical traces.
So instead of needing to store this data in historical traces *in case* it is needed, we query and collect this data on demand for small windows of time.

### Well Known Data

A separate `well_known_data.proto` file will be maintained, with debug state that's common across implementations described within it.

The initial state will be:

```
// Channel connectivity state - attached to kind "channel" and "subchannel".
// These come from the specified states in this document:
// https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md
message ChannelConnectivityState {
enum State {
UNKNOWN = 0;
IDLE = 1;
CONNECTING = 2;
READY = 3;
TRANSIENT_FAILURE = 4;
SHUTDOWN = 5;
}
State state = 1;
}

// Channel target information. Attached to kind "channel" and "subchannel".
message ChannelTarget {
// The target this channel originally tried to connect to. May be absent
string target = 1;
}
```

## Rationale

The new interface is heavily inspired by channelz, but removes rigid node associations and predefined data sets.
Much of the infrastructure required for channelz can be shared with the debug interface, and indeed one can be implemented atop the other with good fidelity (in either direction).

Node ids are consistent between the protocols so that one can take a node id from the debug interface and query channelz data directly (and vice versa).
This will allow consumers of the protocols to gradually transition from one to the other.
It also allows re-using the book-keeping already in existence for channelz in implementations, without adding another kind of book-keeping to the mix.

Why not just extend channelz?

There are some fundamental data model issues in that protocol that I'd like to address going forward:

* chaotic-good (a transport currently implemented and in production in C++) has multiple TCP sockets per transport. This might be representable as a subchannel with multiple sockets on the client side (though we've already represented the HTTP2 transport as a socket there), but server side makes no allowance for properly describing the object hierarchy.
* Further, the set of node types ("kinds" here-in) and their relationships are pre-baked into the protocol, and as we evolve and improve gRPC new node types are needed and new relationships will be added or removed. We should not need to update channelz each time this happens.

Specific metrics have not been carried forward from channelz as required parts of the protocol.
Users that need call or message counts from the system are encouraged to use the telemetry features of gRPC.
Implementations however are encouraged to publish whatever metrics they have available at query time to some `data` protobuf.

## Implementation

ctiller will implement this for C++. Other languages may pick this up as needed.