Skip to content

Conversation

Copy link

Copilot AI commented Sep 18, 2025

The autoscaler was failing with a KeyError when PBS commands qmgr list sched and qmgr list server returned FQDN hostnames instead of short hostnames. This issue occurred when the scheduler host was returned as a fully qualified domain name (e.g., headnode.internal.cloudapp.net) but the server host dictionary was keyed by short hostnames (e.g., headnode).

The error manifested as:

KeyError: 'headnode.internal.cloudapp.net'
  File "pbspro/scheduler.py", line 84, in read_schedulers
    server_dict = server_dicts_by_host[hostname]

Root Cause:
The read_schedulers() function was using the raw hostname values from PBS output to create and lookup entries in the server dictionary. When PBS returned mixed hostname formats (FQDN for schedulers, short names for servers), the lookup would fail.

Solution:
Modified the hostname handling in read_schedulers() to consistently use short hostnames for both dictionary creation and lookup:

  1. Server dictionary creation - Extract short hostname when creating the server lookup dictionary:

    server_dicts_by_host = partition_single(server_dicts, lambda s: s["server_host"].split(".")[0])
  2. Scheduler lookup - Extract short hostname from scheduler host before lookup:

    short_hostname = hostname.split(".")[0]
    server_dict = server_dicts_by_host[short_hostname]

This ensures consistent hostname format handling regardless of whether PBS returns FQDN or short hostnames, while maintaining full backward compatibility with existing deployments.

Testing:
Added comprehensive unit tests covering:

  • Original failing scenario (FQDN scheduler vs short server hostnames)
  • Backward compatibility (both hostnames already short)
  • Mixed hostname formats across multiple schedulers

Fixes #85.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copy link
Author

Copilot AI commented Sep 18, 2025

@xpillons 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs.

I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review.

Copilot AI changed the title [WIP] Autoscaler is failing with KeyError: 'headnode.internal.cloudapp.net' Fix KeyError when PBS commands return FQDN hostnames in autoscaler Sep 18, 2025
Copilot AI requested a review from xpillons September 18, 2025 09:33
@xpillons xpillons reopened this Sep 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Autoscaler is failing with KeyError: 'headnode.internal.cloudapp.net'

4 participants