WhiteListRoundRobinPolicy - handle nodes with same ip #1253

rukai · 2025-08-20T05:43:23Z

Motivation

When using cqlsh on a cassandra cluster where each node has the same IP but different ports, running SELECT * FROM system.peers_v2; will result in one of the peers being swapped for one of the local nodes.
I tracked this issue down to the python-drivers WhiteListRoundRobinPolicy which is used by cqlsh.

Reproduction

I can reproduce the issue with this docker-compose.yaml, which creates a cluster of nodes accessible on 127.0.0.1 on ports 9042, 9043 and 9044:

services:
  cassandra-one:
    image: &image shotover/cassandra-test:5.0-rc1-r3
    environment: &environment
      CASSANDRA_SEEDS: "cassandra-one,cassandra-two,cassandra-three"
      CASSANDRA_CLUSTER_NAME: TestCluster
      CASSANDRA_DC: datacenter1
      CASSANDRA_RACK: rack1
      CASSANDRA_ENDPOINT_SNITCH: GossipingPropertyFileSnitch
      CASSANDRA_INITIAL_TOKENS: -1838347210670429836,-2934389110905368125,-4713023411728955254,-5691168864245069329,-7310192159942112627,-747050099978217576,-8900196712456011265,1537594777415527418,2609095393755560231,3626946798497987246,4444618731110338041,5520374612335917580,6256290305046811221,7335663112494412879,8579183118175004851,97326547512944180
      CASSANDRA_NATIVE_TRANSPORT_PORT: 9042
      CASSANDRA_BROADCAST_RPC_ADDRESS: "127.0.0.1"
      MAX_HEAP_SIZE: "400M"
      MIN_HEAP_SIZE: "400M"
      HEAP_NEWSIZE: "48M"
    ports:
      - "9042:9042"
    volumes:
      &volumes
      - type: tmpfs
        target: /var/lib/cassandra
  cassandra-two:
    image: *image
    ports:
      - "9043:9043"
    environment:
      <<: *environment
      CASSANDRA_NATIVE_TRANSPORT_PORT: 9043
      CASSANDRA_INITIAL_TOKENS: -2006460884048279486,-3596465436562178124,-387437588351236189,-4563829679640713622,-5807349685321305596,-6886722492768907253,-7622638185479800894,-8698394066705380434,2369342164988465014,3465384065223403303,4556681175915615562,5401057823406777320,590707864164877886,6841326053309360558,7912826669649393371,8930678074391820386
    volumes: *volumes
  cassandra-three:
    image: *image
    ports:
      - "9044:9044"
    environment:
      <<: *environment
      CASSANDRA_NATIVE_TRANSPORT_PORT: 9044
      CASSANDRA_INITIAL_TOKENS: -2141366384311565814,-3731370936825464452,-4698735179903999950,-522343088614522517,-5942255185584591924,-7021627993032193581,-7757543685743087222,-8833299566968666762,2234436664725178686,3330478564960116975,4421775675652329234,455802363901591558,5266152323143490992,6706420553046074230,7777921169386107043,8795772574128534058
    volumes: *volumes

And this python-driver sample:

import cassandra
from cassandra.cluster import Cluster
from cassandra.policies import WhiteListRoundRobinPolicy

def main():
    hostname = "127.0.0.1"
    port = 9042
    conn = Cluster(contact_points=[hostname], port=port, cql_version=None,
                        auth_provider=None,
                        ssl_options=None,
                        load_balancing_policy=WhiteListRoundRobinPolicy([hostname]))
    
    session = conn.connect()
    session.row_factory = cassandra.query.dict_factory
    for i in range(10):
        print("Attempt #" + str(i))
        for row in session.execute("select * from system.peers_v2"):
            print("peer:", row["native_address"] + ":" + str(row["native_port"]))
        print("")
    
main()

You will observe that each attempt prints a different list of peers, which should not be possible.

The problem is that the WhiteListRoundRobinPolicy checks only the hostname/address, so if all nodes have the same address but different ports they will all pass the whitelist.

The fix

The first commit moves the endpoints into their own file, this was needed to avoid a cyclic dependency issue.

Then the second commit alters the WhiteListRoundRobinPolicy constructor to allow the user to optionally specify a port along with the hostname, which is used to make the whitelist more specific.
The change is completely backwards compatible with the existing interface, I had to make the implementation a little complicated to handle the new functionality in a backwards compatible way.

I'm open to any alternatives to the constructor interface I used here.

I didnt really understand what the deal is with the self._allowed_hosts = tuple(hosts) line, it seems to be used for some sort of serialization thing, so I just left it as is. Let me know if it requires any changes.

Process

It seems like a jira ticket isnt strictly needed for small changes, let me know if you need a ticket for this PR to be merged. And if so let me know how I should go about getting access, I tried and couldn't figure it out.

…xt commit

…ent ports

rukai added 2 commits August 20, 2025 14:41

Split endpoints into their own file to avoid cyclic imports in the ne…

476deae

…xt commit

WhiteListRoundRobinPolicy - support nodes with the same IP but differ…

d2d125c

…ent ports

rukai changed the title ~~Whitelist round robin policy handle nodes with same ip~~ WhiteListRoundRobinPolicy - handle nodes with same ip Aug 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WhiteListRoundRobinPolicy - handle nodes with same ip #1253

WhiteListRoundRobinPolicy - handle nodes with same ip #1253

Uh oh!

rukai commented Aug 20, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

WhiteListRoundRobinPolicy - handle nodes with same ip #1253

Are you sure you want to change the base?

WhiteListRoundRobinPolicy - handle nodes with same ip #1253

Uh oh!

Conversation

rukai commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Reproduction

The fix

Process

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rukai commented Aug 20, 2025 •

edited

Loading