-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
platforms: platform and host selection methods and intelligent fallback #3827
Comments
@wxtim and I had a talk through the "intelligent fallback" part of this issue which has raised some questions... The basic premise of this logic is that if a host goes down, rather than just failing Cylc can re-try the operation on another host. So how do we know if a host is down, there are two options:
However, not all issues are comms based, for example, what if the platform is not accepting new jobs, say because the queue is full or closed? This seems like the sort of thing Cylc should be able to handle gracefully. In this case there is no point trying another host within the platform, however, it may be worth trying another platform within the group. Should we:
Anything I've missed out @wxtim? |
Hmm, complicated 😬 I have too many questions about this - might be a good one to discuss at next meeting? |
I think that is a fair summary of what we discussed. I have just spent some time looking at the code that was bothering me - it's still not completely clear how job submission might work, but I think ultimately it still comes down to a question of how we can tell a submit failure we want to retry with different platform settings with one to allow to stand. |
I vote for "Only handle SSH failure". |
Ok, I vote (2,3), don't perform a trial connection, detect 255 error codes. |
I've got a proof of concept branch where I've hacked the I had a discussion with @oliver-sanders (feel free to edit this post to make it more accurately reflect our discussion) yesterday, where I talked about the fact that centralizing the logic looks a little tricky because you need to store state information, but also to housekeep it. This discussion generated multiple approaches:
|
We have to preserve the state of the selection process somewhere, either globally accessible (i.e. via a unique id) or locally scoped via some other mechanism. The state is effectively composed of a list of platforms which have been tried and a list of hosts within each platform which have been tried. The easiest way to preserve the state is probably just to use a generator (as they hold their state within their scope until destroyed). Take for example this reference implementation: def select_generator(platform_group):
platforms = platform_group['platforms']
while platforms:
platform = select(platforms, platform_group['method'])
hosts = platform['hosts']
while hosts:
host = select(hosts, platform['method'])
yield host
hosts.remove(host)
platforms.remove(platform) The question is how to hook that up to the call/callback framework into which it must fit. Simplified version: def controller():
proc_pool.call(
call_remote_init,
callback_remote_init
)
def call_remote_init(*args):
pass
def callback_remote_init(*args):
pass Using global state storage it would look something like: def call_remote_init(id, *args):
# note the store is some session/globally scoped object
if id:
# retrieve the state from the store
gen = store[id]
else:
id = uuid() # selection id
# put the state into the store
store[id] = select_generator(platform_group)
gen = store[id]
try:
host = next(gen)
except:
# no available hosts - remote-init failure
pass
# ...
def callback_remote_init(id, *args):
if returned_a_255_error_code:
call_remote_init(id, *args)
return
else:
del state[id]
# ... This would do the job, however, the pattern would have to be reproduced for each call/callback pattern (remote_init, job_submission) and kinda feels messy. The main drawback is that this state store must be housekept (the If this code were all async we would not need the state store. async def remote_init(*args):
for host in select_generator(platform_group):
result = await proc_pool.call(...)
if returned_a_255_error_code:
continue
else:
break
else:
# no hosts - remote-init failure
pass Nice and clean, but re-writing the subprocesspool is a bit much right now. So how to bridge a call/callback pattern to an async pattern.... async def remote_init(*args):
for host in select_generator(platform_group):
# pass a future object into the call/callback
future = asyncio.future()
call_remote_init(*args, future)
await future
if returned_a_255_error_code:
continue
else:
break
else:
# no hosts - remote-init failure
pass
def call_remote_init(future, *args):
pass
def callback_remote_init(future, *args):
# in the callback mark the future as done, this returns control to remote_init
future.done() I'm not sure how to approach this, pros and cons. The async approach might be nice, however, the async code doesn't currently reach down very far from the Scheduler so would involve adding a lot of |
On the 22/04 @oliver-sanders Said:@
But after a discussion between me and @dpmatthews this morning the following is proposed:
Scenarios Considered.
Can you see any issues @oliver-sanders ? |
Additional question: If |
Failed remote init or file install should log the full error, I should think, so the user can see what's gone wrong and either fix it or alert system admins. I don't think we could handle this kind of error automatically? |
Follow on from #3686
Selection Methods
This should involve the creation of interfaces (preferably implemented as Python entry-points to permit in-house extension) for:
We will need to allow these to be configured separately e.g:
The following selection methods would be desirable but only the interface(s) need(s) to be implemented to close this issue:
Intelligent Fallback
In the event that a host is not available (e.g. for job submission) Cylc will need to pick an alternative host from the specified platform or platform group.
Here is a purely illustrative example to explain what is meant by this:
This functionality will be required in a lot of different places (e.g. remote-init, job submission, job polling) so it would make sense to centralise it.
Reloading The Global Config
Issue #3762 will see the global config reloaded at a set interval for the lifetime of the scheduler. Any selection logic should be robust to this, the list of platforms in a group and hosts in a platform are volatile and may change as sys admins move workload around a system.
The text was updated successfully, but these errors were encountered: