Skip to content

Refactor RoutingManager #741

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Refactor RoutingManager #741

wants to merge 1 commit into from

Conversation

maswin
Copy link
Member

@maswin maswin commented Jul 31, 2025

Description

This commit addresses a number of items

  1. QueryCountBasedRouter was broken. The change introduced in the following commit - 3430f35 introduced a bug. provideBackendConfiguration was used to get the cluster instead of provideClusterForRoutingGroup. But this method was not overridden in QueryCountBasedRouter class. So this led to a bug and QueryCountBasedRouter was failing. Fixed it.
  2. While trying to create a new RoutingManager, it became complex to understand what all method needs to be overridden since the class was populated with lot of internal methods. So extracted them all out to an interface and modified existing class as BaseRoutingManager. selectBackend method can be overridden to modify the cluster selection part.
  3. Modified QueryCountBasedRouter to use ConcurrentHashMap instead of GaurdedBy("this") which locks the entire object. This reduces lock contention.
  4. Fixed related tests

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required, with the following suggested text:

* Fix some things.

@cla-bot cla-bot bot added the cla-signed label Jul 31, 2025
* request object. Default implementation comes here.
*/
public abstract class RoutingManager
public interface RoutingManager

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have java doc for this and method as this becomes our primary interface for all RoutingManager implementations. ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea to have an interface to enforce the rules.

Comment on lines 90 to 92
CacheBuilder.newBuilder()
.maximumSize(10000)
.expireAfterAccess(30, TimeUnit.MINUTES)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can extract this into separate builder and reuse 3 time while building cache.

for example

    private final CacheBuilder<Object, Object> builder = CacheBuilder.newBuilder()
            .maximumSize(10000)
            .expireAfterAccess(30, TimeUnit.MINUTES);


queryIdBackendCache = builder.build(
                                new CacheLoader<>()
                                {
                                    @Override
                                    public String load(String queryId)
                                    {
                                        return findBackendForUnknownQueryId(queryId);
                                    }
                                });
        queryIdRoutingGroupCache = builder.build(
                                new CacheLoader<>()
                                {
                                    @Override
                                    public String load(String queryId)
                                    {
                                        return findRoutingGroupForUnknownQueryId(queryId);
                                    }
                                });
        queryIdExternalUrlCache = builder.build(
                                new CacheLoader<>()
                                {
                                    @Override
                                    public String load(String queryId)
                                    {
                                        return findExternalUrlForUnknownQueryId(queryId);
                                    }
                                });

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 300 to 303
TrinoStatus status = backendToStatus.get(backendId);
if (status == null) {
return true;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TrinoStatus status = backendToStatus.get(backendId);
if (status == null) {
return true;
}
TrinoStatus status = backendToStatus.getOrDefault(backendId, TrinoStatus.UNKNOWN);
return status != TrinoStatus.HEALTHY;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 127 to 111
public ProxyBackendConfiguration provideDefaultBackendConfiguration(String user)
{
List<ProxyBackendConfiguration> backends = gatewayBackendManager.getActiveDefaultBackends();
backends.removeIf(backend -> isBackendNotHealthy(backend.getName()));
return selectBackend(backends, user).orElseThrow(() -> new IllegalStateException("Number of active backends found zero"));
}

/**
* Performs routing to a given cluster group. This falls back to a default backend, if no scheduled
* backend is found.
*/
@Override
public ProxyBackendConfiguration provideBackendConfiguration(String routingGroup, String user)
{
List<ProxyBackendConfiguration> backends = gatewayBackendManager.getActiveBackends(routingGroup);
backends.removeIf(backend -> isBackendNotHealthy(backend.getName()));
return selectBackend(backends, user).orElseGet(() -> provideDefaultBackendConfiguration(user));
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are removing the unhealthy backends from candidate list. You can use lambda to filter out unwanted entries. It will simplify the predicate function that identifies healthy backends.

For example:

Suggested change
public ProxyBackendConfiguration provideDefaultBackendConfiguration(String user)
{
List<ProxyBackendConfiguration> backends = gatewayBackendManager.getActiveDefaultBackends();
backends.removeIf(backend -> isBackendNotHealthy(backend.getName()));
return selectBackend(backends, user).orElseThrow(() -> new IllegalStateException("Number of active backends found zero"));
}
/**
* Performs routing to a given cluster group. This falls back to a default backend, if no scheduled
* backend is found.
*/
@Override
public ProxyBackendConfiguration provideBackendConfiguration(String routingGroup, String user)
{
List<ProxyBackendConfiguration> backends = gatewayBackendManager.getActiveBackends(routingGroup);
backends.removeIf(backend -> isBackendNotHealthy(backend.getName()));
return selectBackend(backends, user).orElseGet(() -> provideDefaultBackendConfiguration(user));
}
public ProxyBackendConfiguration provideDefaultBackendConfiguration(String user) {
var backends = gatewayBackendManager.getActiveDefaultBackends()
.stream()
.filter(backend -> isBackendHealthy(backend.getName()))
.toList();
return selectBackend(backends, user).orElseThrow(() -> new IllegalStateException("Number of active backends found zero"));
}
@Override
public ProxyBackendConfiguration provideBackendConfiguration(String routingGroup, String user) {
var backends = gatewayBackendManager.getActiveBackends(routingGroup)
.stream()
.filter(backend -> isBackendHealthy(backend.getName()))
.toList();
return selectBackend(backends, user).orElseGet(() -> provideDefaultBackendConfiguration(user));
}
private boolean isBackendHealthy(String backendId) {
TrinoStatus status = backendToStatus.getOrDefault(backendId, TrinoStatus.UNKNOWN);
return status == TrinoStatus.HEALTHY;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* This class performs health check, stats counts for each backend and provides a backend given
* request object. Default implementation comes here.
*/
public abstract class BaseRoutingManager

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion IMO, for ease of reading you may want to group methods by visibility and order them.

1/ constructor
2/ abstract methods
3/ public method
4/ protected - package
5/ private

I see the abstract methods as an interface for subclasses so it is easier to discover them by future maintainers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return routingGroup;
}

protected void updateBackEndHealth(List<ClusterStats> stats)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we delete this method ?

It get confusing as we have 3 methods that update the state in cache.
You move this method logic into the interface method public void updateClusterStats . This would simplify and we avoid method overloading of interface method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +255 to +228
if (entry.getValue().isDone()) {
int responseCode = entry.getValue().get();
if (responseCode == 200) {
log.info("Found query [%s] on backend [%s]", queryId, entry.getKey());
setBackendForQueryId(queryId, entry.getKey());
return entry.getKey();
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Curious, Why did you add condition check for future.isDone() ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these code are existing - https://github.com/trinodb/trino-gateway/blob/main/gateway-ha/src/main/java/io/trino/gateway/ha/router/RoutingManager.java
I changed the file name from RoutingManager to BaseRoutingManager and created interface file with name RoutingManager. So git instead of marking it as file name change, assumed everything in the file as new changes.

Copy link
Member

@vishalya vishalya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we separate out this PR into 2 different ones.
(1) Just fix the bug and add the minimal interface needed, and the fixed tests.
(2) Rest of the refactoring

@maswin maswin force-pushed the routing branch 2 times, most recently from 8853b9b to 7cdbcbf Compare August 15, 2025 19:47
@maswin
Copy link
Member Author

maswin commented Aug 15, 2025

Can we separate out this PR into 2 different ones. (1) Just fix the bug and add the minimal interface needed, and the fixed tests. (2) Rest of the refactoring

Tried splitting it into 2 commits but feels bit complicated as the refactoring kind of took care of the bug.
The changes are relatively less, but git for some reason assumes BaseRoutingManager as a new file rather than the a name change from previous RoutingManager class (probably since there is a new interface with that name)

@vishalya
Copy link
Member

I'm concerned that having a default implementation in the BaseRoutingManager base class makes our routing logic brittle. We saw this when the interface change broke the query-count-based router.

A better approach would be to move the default logic into its own class, say DefaultRoutingManager. Then, concrete classes could use composition (a has-a relationship) to include this default behavior instead of inheriting it directly. This will decouple our concrete routers from the base class implementation, preventing similar breaks in the future.

On a related note, could you detail what new runtime tests are being added to catch these kinds of integration failures?

@maswin
Copy link
Member Author

maswin commented Aug 16, 2025

I'm concerned that having a default implementation in the BaseRoutingManager base class makes our routing logic brittle. We saw this when the interface change broke the query-count-based router.

A better approach would be to move the default logic into its own class, say DefaultRoutingManager. Then, concrete classes could use composition (a has-a relationship) to include this default behavior instead of inheriting it directly. This will decouple our concrete routers from the base class implementation, preventing similar breaks in the future.

On a related note, could you detail what new runtime tests are being added to catch these kinds of integration failures?

composition might makes things very complicated. Interface with an abstract base implementation is a common pattern which should be ok. For instance in Trino there is a TrinoCatalog interface with AbtsractTrinoCatalog implementation that has common methods implemented.

The primary problem I see is the interface is bloated and can further be made lean. There should only be 4 methods -

void updateBackEndHealth(String backendId, TrinoStatus value); // When user marks a backend unhealthy
void updateClusterStats(List<ClusterStats> stats); // Update based on JMX metrics
ProxyBackendConfiguration getBackendConfiguration(String routingGroup, String user); // Get for the first time and if not in local cache
ProxyBackendConfiguration setBackendConfiguration(String routingGroup, String user); // Set if not in local cache

Maintaining 3 separate cache and separately setting every cache makes no sense as they all point to same data. One cache with all data together should be enough and exposing just one method to get and set the backEnd configuration.

This should make things less confusing to be overridden and implemented.

If this new lean interface sounds good I can make the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants