Skip to content

Commit

Permalink
[FLINK-36015] [runtime] Align rescale parameters
Browse files Browse the repository at this point in the history
  • Loading branch information
ztison authored and XComp committed Sep 19, 2024
1 parent ffd0522 commit 6ab3c3d
Show file tree
Hide file tree
Showing 15 changed files with 284 additions and 172 deletions.
34 changes: 17 additions & 17 deletions docs/layouts/shortcodes/generated/all_jobmanager_section.html
Original file line number Diff line number Diff line change
Expand Up @@ -8,42 +8,42 @@
</tr>
</thead>
<tbody>
<tr>
<td><h5>jobmanager.adaptive-scheduler.executing.cooldown-after-rescaling</h5></td>
<td style="word-wrap: break-word;">30 s</td>
<td>Duration</td>
<td>Determines the minimum time between scaling operations.</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.executing.resource-stabilization-timeout</h5></td>
<td style="word-wrap: break-word;">1 min</td>
<td>Duration</td>
<td>Defines the duration the JobManager delays the scaling operation after a resource change if only sufficient resources are available. The scaling operation is performed immediately if the resources have changed and the desired resources are available. The timeout begins as soon as either the available resources or the job's resource requirements are changed.<br />The resource requirements of a running job can be changed using the <a href="{{.Site.BaseURL}}{{.Site.LanguagePrefix}}/docs/ops/rest_api/#jobs-jobid-resource-requirements-1">REST API endpoint</a>.</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.max-delay-for-scale-trigger</h5></td>
<td><h5>jobmanager.adaptive-scheduler.rescale-trigger.max-checkpoint-failures</h5></td>
<td style="word-wrap: break-word;">2</td>
<td>Integer</td>
<td>The number of consecutive failed checkpoints that will trigger rescaling even in the absence of a completed checkpoint.</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.rescale-trigger.max-delay</h5></td>
<td style="word-wrap: break-word;">(none)</td>
<td>Duration</td>
<td>The maximum time the JobManager will wait with evaluating previously observed events for rescaling (default: 0ms if checkpointing is disabled and the checkpointing interval multiplied by the by-1-incremented parameter value of jobmanager.adaptive-scheduler.scale-on-failed-checkpoints-count if checkpointing is enabled).</td>
<td>The maximum time the JobManager will wait with evaluating previously observed events for rescaling (default: 0ms if checkpointing is disabled and the checkpointing interval multiplied by the by-1-incremented parameter value of jobmanager.adaptive-scheduler.rescale-trigger.max-checkpoint-failures if checkpointing is enabled).</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.resource-stabilization-timeout</h5></td>
<td><h5>jobmanager.adaptive-scheduler.submission.resource-stabilization-timeout</h5></td>
<td style="word-wrap: break-word;">10 s</td>
<td>Duration</td>
<td>The resource stabilization timeout defines the time the JobManager will wait if fewer than the desired but sufficient resources are available. The timeout starts once sufficient resources for running the job are available. Once this timeout has passed, the job will start executing with the available resources.<br />If <code class="highlighter-rouge">scheduler-mode</code> is configured to <code class="highlighter-rouge">REACTIVE</code>, this configuration value will default to 0, so that jobs are starting immediately with the available resources.</td>
<td>The resource stabilization timeout defines the time the JobManager will wait if fewer than the desired but sufficient resources are available during job submission. The timeout starts once sufficient resources for running the job are available. Once this timeout has passed, the job will start executing with the available resources.<br />If <code class="highlighter-rouge">scheduler-mode</code> is configured to <code class="highlighter-rouge">REACTIVE</code>, this configuration value will default to 0, so that jobs are starting immediately with the available resources.</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.resource-wait-timeout</h5></td>
<td><h5>jobmanager.adaptive-scheduler.submission.resource-wait-timeout</h5></td>
<td style="word-wrap: break-word;">5 min</td>
<td>Duration</td>
<td>The maximum time the JobManager will wait to acquire all required resources after a job submission or restart. Once elapsed it will try to run the job with a lower parallelism, or fail if the minimum amount of resources could not be acquired.<br />Increasing this value will make the cluster more resilient against temporary resources shortages (e.g., there is more time for a failed TaskManager to be restarted).<br />Setting a negative duration will disable the resource timeout: The JobManager will wait indefinitely for resources to appear.<br />If <code class="highlighter-rouge">scheduler-mode</code> is configured to <code class="highlighter-rouge">REACTIVE</code>, this configuration value will default to a negative value to disable the resource timeout.</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.scale-on-failed-checkpoints-count</h5></td>
<td style="word-wrap: break-word;">2</td>
<td>Integer</td>
<td>The number of consecutive failed checkpoints that will trigger rescaling even in the absence of a completed checkpoint.</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.scaling-interval.min</h5></td>
<td style="word-wrap: break-word;">30 s</td>
<td>Duration</td>
<td>Determines the minimum time between scaling operations.</td>
</tr>
<tr>
<td><h5>jobmanager.archive.fs.dir</h5></td>
<td style="word-wrap: break-word;">(none)</td>
Expand Down
34 changes: 17 additions & 17 deletions docs/layouts/shortcodes/generated/expert_scheduling_section.html
Original file line number Diff line number Diff line change
Expand Up @@ -86,42 +86,42 @@
<td>MemorySize</td>
<td>The size of the write buffer of JobEventStore. The content will be flushed to external file system once the buffer is full</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.executing.cooldown-after-rescaling</h5></td>
<td style="word-wrap: break-word;">30 s</td>
<td>Duration</td>
<td>Determines the minimum time between scaling operations.</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.executing.resource-stabilization-timeout</h5></td>
<td style="word-wrap: break-word;">1 min</td>
<td>Duration</td>
<td>Defines the duration the JobManager delays the scaling operation after a resource change if only sufficient resources are available. The scaling operation is performed immediately if the resources have changed and the desired resources are available. The timeout begins as soon as either the available resources or the job's resource requirements are changed.<br />The resource requirements of a running job can be changed using the <a href="{{.Site.BaseURL}}{{.Site.LanguagePrefix}}/docs/ops/rest_api/#jobs-jobid-resource-requirements-1">REST API endpoint</a>.</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.max-delay-for-scale-trigger</h5></td>
<td><h5>jobmanager.adaptive-scheduler.rescale-trigger.max-checkpoint-failures</h5></td>
<td style="word-wrap: break-word;">2</td>
<td>Integer</td>
<td>The number of consecutive failed checkpoints that will trigger rescaling even in the absence of a completed checkpoint.</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.rescale-trigger.max-delay</h5></td>
<td style="word-wrap: break-word;">(none)</td>
<td>Duration</td>
<td>The maximum time the JobManager will wait with evaluating previously observed events for rescaling (default: 0ms if checkpointing is disabled and the checkpointing interval multiplied by the by-1-incremented parameter value of jobmanager.adaptive-scheduler.scale-on-failed-checkpoints-count if checkpointing is enabled).</td>
<td>The maximum time the JobManager will wait with evaluating previously observed events for rescaling (default: 0ms if checkpointing is disabled and the checkpointing interval multiplied by the by-1-incremented parameter value of jobmanager.adaptive-scheduler.rescale-trigger.max-checkpoint-failures if checkpointing is enabled).</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.resource-stabilization-timeout</h5></td>
<td><h5>jobmanager.adaptive-scheduler.submission.resource-stabilization-timeout</h5></td>
<td style="word-wrap: break-word;">10 s</td>
<td>Duration</td>
<td>The resource stabilization timeout defines the time the JobManager will wait if fewer than the desired but sufficient resources are available. The timeout starts once sufficient resources for running the job are available. Once this timeout has passed, the job will start executing with the available resources.<br />If <code class="highlighter-rouge">scheduler-mode</code> is configured to <code class="highlighter-rouge">REACTIVE</code>, this configuration value will default to 0, so that jobs are starting immediately with the available resources.</td>
<td>The resource stabilization timeout defines the time the JobManager will wait if fewer than the desired but sufficient resources are available during job submission. The timeout starts once sufficient resources for running the job are available. Once this timeout has passed, the job will start executing with the available resources.<br />If <code class="highlighter-rouge">scheduler-mode</code> is configured to <code class="highlighter-rouge">REACTIVE</code>, this configuration value will default to 0, so that jobs are starting immediately with the available resources.</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.resource-wait-timeout</h5></td>
<td><h5>jobmanager.adaptive-scheduler.submission.resource-wait-timeout</h5></td>
<td style="word-wrap: break-word;">5 min</td>
<td>Duration</td>
<td>The maximum time the JobManager will wait to acquire all required resources after a job submission or restart. Once elapsed it will try to run the job with a lower parallelism, or fail if the minimum amount of resources could not be acquired.<br />Increasing this value will make the cluster more resilient against temporary resources shortages (e.g., there is more time for a failed TaskManager to be restarted).<br />Setting a negative duration will disable the resource timeout: The JobManager will wait indefinitely for resources to appear.<br />If <code class="highlighter-rouge">scheduler-mode</code> is configured to <code class="highlighter-rouge">REACTIVE</code>, this configuration value will default to a negative value to disable the resource timeout.</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.scale-on-failed-checkpoints-count</h5></td>
<td style="word-wrap: break-word;">2</td>
<td>Integer</td>
<td>The number of consecutive failed checkpoints that will trigger rescaling even in the absence of a completed checkpoint.</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.scaling-interval.min</h5></td>
<td style="word-wrap: break-word;">30 s</td>
<td>Duration</td>
<td>Determines the minimum time between scaling operations.</td>
</tr>
<tr>
<td><h5>jobmanager.partition.hybrid.partition-data-consume-constraint</h5></td>
<td style="word-wrap: break-word;">(none)</td>
Expand Down
34 changes: 17 additions & 17 deletions docs/layouts/shortcodes/generated/job_manager_configuration.html
Original file line number Diff line number Diff line change
Expand Up @@ -8,42 +8,42 @@
</tr>
</thead>
<tbody>
<tr>
<td><h5>jobmanager.adaptive-scheduler.executing.cooldown-after-rescaling</h5></td>
<td style="word-wrap: break-word;">30 s</td>
<td>Duration</td>
<td>Determines the minimum time between scaling operations.</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.executing.resource-stabilization-timeout</h5></td>
<td style="word-wrap: break-word;">1 min</td>
<td>Duration</td>
<td>Defines the duration the JobManager delays the scaling operation after a resource change if only sufficient resources are available. The scaling operation is performed immediately if the resources have changed and the desired resources are available. The timeout begins as soon as either the available resources or the job's resource requirements are changed.<br />The resource requirements of a running job can be changed using the <a href="{{.Site.BaseURL}}{{.Site.LanguagePrefix}}/docs/ops/rest_api/#jobs-jobid-resource-requirements-1">REST API endpoint</a>.</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.max-delay-for-scale-trigger</h5></td>
<td><h5>jobmanager.adaptive-scheduler.rescale-trigger.max-checkpoint-failures</h5></td>
<td style="word-wrap: break-word;">2</td>
<td>Integer</td>
<td>The number of consecutive failed checkpoints that will trigger rescaling even in the absence of a completed checkpoint.</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.rescale-trigger.max-delay</h5></td>
<td style="word-wrap: break-word;">(none)</td>
<td>Duration</td>
<td>The maximum time the JobManager will wait with evaluating previously observed events for rescaling (default: 0ms if checkpointing is disabled and the checkpointing interval multiplied by the by-1-incremented parameter value of jobmanager.adaptive-scheduler.scale-on-failed-checkpoints-count if checkpointing is enabled).</td>
<td>The maximum time the JobManager will wait with evaluating previously observed events for rescaling (default: 0ms if checkpointing is disabled and the checkpointing interval multiplied by the by-1-incremented parameter value of jobmanager.adaptive-scheduler.rescale-trigger.max-checkpoint-failures if checkpointing is enabled).</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.resource-stabilization-timeout</h5></td>
<td><h5>jobmanager.adaptive-scheduler.submission.resource-stabilization-timeout</h5></td>
<td style="word-wrap: break-word;">10 s</td>
<td>Duration</td>
<td>The resource stabilization timeout defines the time the JobManager will wait if fewer than the desired but sufficient resources are available. The timeout starts once sufficient resources for running the job are available. Once this timeout has passed, the job will start executing with the available resources.<br />If <code class="highlighter-rouge">scheduler-mode</code> is configured to <code class="highlighter-rouge">REACTIVE</code>, this configuration value will default to 0, so that jobs are starting immediately with the available resources.</td>
<td>The resource stabilization timeout defines the time the JobManager will wait if fewer than the desired but sufficient resources are available during job submission. The timeout starts once sufficient resources for running the job are available. Once this timeout has passed, the job will start executing with the available resources.<br />If <code class="highlighter-rouge">scheduler-mode</code> is configured to <code class="highlighter-rouge">REACTIVE</code>, this configuration value will default to 0, so that jobs are starting immediately with the available resources.</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.resource-wait-timeout</h5></td>
<td><h5>jobmanager.adaptive-scheduler.submission.resource-wait-timeout</h5></td>
<td style="word-wrap: break-word;">5 min</td>
<td>Duration</td>
<td>The maximum time the JobManager will wait to acquire all required resources after a job submission or restart. Once elapsed it will try to run the job with a lower parallelism, or fail if the minimum amount of resources could not be acquired.<br />Increasing this value will make the cluster more resilient against temporary resources shortages (e.g., there is more time for a failed TaskManager to be restarted).<br />Setting a negative duration will disable the resource timeout: The JobManager will wait indefinitely for resources to appear.<br />If <code class="highlighter-rouge">scheduler-mode</code> is configured to <code class="highlighter-rouge">REACTIVE</code>, this configuration value will default to a negative value to disable the resource timeout.</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.scale-on-failed-checkpoints-count</h5></td>
<td style="word-wrap: break-word;">2</td>
<td>Integer</td>
<td>The number of consecutive failed checkpoints that will trigger rescaling even in the absence of a completed checkpoint.</td>
</tr>
<tr>
<td><h5>jobmanager.adaptive-scheduler.scaling-interval.min</h5></td>
<td style="word-wrap: break-word;">30 s</td>
<td>Duration</td>
<td>Determines the minimum time between scaling operations.</td>
</tr>
<tr>
<td><h5>jobmanager.archive.fs.dir</h5></td>
<td style="word-wrap: break-word;">(none)</td>
Expand Down
Loading

0 comments on commit 6ab3c3d

Please sign in to comment.