Skip to content

Conversation

rukai
Copy link

@rukai rukai commented Aug 20, 2025

Motivation

When using cqlsh on a cassandra cluster where each node has the same IP but different ports, running SELECT * FROM system.peers_v2; will result in one of the peers being swapped for one of the local nodes.
I tracked this issue down to the python-drivers WhiteListRoundRobinPolicy which is used by cqlsh.

Reproduction

I can reproduce the issue with this docker-compose.yaml, which creates a cluster of nodes accessible on 127.0.0.1 on ports 9042, 9043 and 9044:

services:
  cassandra-one:
    image: &image shotover/cassandra-test:5.0-rc1-r3
    environment: &environment
      CASSANDRA_SEEDS: "cassandra-one,cassandra-two,cassandra-three"
      CASSANDRA_CLUSTER_NAME: TestCluster
      CASSANDRA_DC: datacenter1
      CASSANDRA_RACK: rack1
      CASSANDRA_ENDPOINT_SNITCH: GossipingPropertyFileSnitch
      CASSANDRA_INITIAL_TOKENS: -1838347210670429836,-2934389110905368125,-4713023411728955254,-5691168864245069329,-7310192159942112627,-747050099978217576,-8900196712456011265,1537594777415527418,2609095393755560231,3626946798497987246,4444618731110338041,5520374612335917580,6256290305046811221,7335663112494412879,8579183118175004851,97326547512944180
      CASSANDRA_NATIVE_TRANSPORT_PORT: 9042
      CASSANDRA_BROADCAST_RPC_ADDRESS: "127.0.0.1"
      MAX_HEAP_SIZE: "400M"
      MIN_HEAP_SIZE: "400M"
      HEAP_NEWSIZE: "48M"
    ports:
      - "9042:9042"
    volumes:
      &volumes
      - type: tmpfs
        target: /var/lib/cassandra
  cassandra-two:
    image: *image
    ports:
      - "9043:9043"
    environment:
      <<: *environment
      CASSANDRA_NATIVE_TRANSPORT_PORT: 9043
      CASSANDRA_INITIAL_TOKENS: -2006460884048279486,-3596465436562178124,-387437588351236189,-4563829679640713622,-5807349685321305596,-6886722492768907253,-7622638185479800894,-8698394066705380434,2369342164988465014,3465384065223403303,4556681175915615562,5401057823406777320,590707864164877886,6841326053309360558,7912826669649393371,8930678074391820386
    volumes: *volumes
  cassandra-three:
    image: *image
    ports:
      - "9044:9044"
    environment:
      <<: *environment
      CASSANDRA_NATIVE_TRANSPORT_PORT: 9044
      CASSANDRA_INITIAL_TOKENS: -2141366384311565814,-3731370936825464452,-4698735179903999950,-522343088614522517,-5942255185584591924,-7021627993032193581,-7757543685743087222,-8833299566968666762,2234436664725178686,3330478564960116975,4421775675652329234,455802363901591558,5266152323143490992,6706420553046074230,7777921169386107043,8795772574128534058
    volumes: *volumes

And this python-driver sample:

import cassandra
from cassandra.cluster import Cluster
from cassandra.policies import WhiteListRoundRobinPolicy

def main():
    hostname = "127.0.0.1"
    port = 9042
    conn = Cluster(contact_points=[hostname], port=port, cql_version=None,
                        auth_provider=None,
                        ssl_options=None,
                        load_balancing_policy=WhiteListRoundRobinPolicy([hostname]))
    
    session = conn.connect()
    session.row_factory = cassandra.query.dict_factory
    for i in range(10):
        print("Attempt #" + str(i))
        for row in session.execute("select * from system.peers_v2"):
            print("peer:", row["native_address"] + ":" + str(row["native_port"]))
        print("")
    
main()

You will observe that each attempt prints a different list of peers, which should not be possible.

The problem is that the WhiteListRoundRobinPolicy checks only the hostname/address, so if all nodes have the same address but different ports they will all pass the whitelist.

The fix

The first commit moves the endpoints into their own file, this was needed to avoid a cyclic dependency issue.

Then the second commit alters the WhiteListRoundRobinPolicy constructor to allow the user to optionally specify a port along with the hostname, which is used to make the whitelist more specific.
The change is completely backwards compatible with the existing interface, I had to make the implementation a little complicated to handle the new functionality in a backwards compatible way.

I'm open to any alternatives to the constructor interface I used here.

I didnt really understand what the deal is with the self._allowed_hosts = tuple(hosts) line, it seems to be used for some sort of serialization thing, so I just left it as is. Let me know if it requires any changes.

Process

It seems like a jira ticket isnt strictly needed for small changes, let me know if you need a ticket for this PR to be merged. And if so let me know how I should go about getting access, I tried and couldn't figure it out.

@rukai rukai changed the title Whitelist round robin policy handle nodes with same ip WhiteListRoundRobinPolicy - handle nodes with same ip Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant