[ray,trainer] feat: add master port range configuration for port range#5201
[ray,trainer] feat: add master port range configuration for port range#5201RobotGF wants to merge 2 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a feature to configure a range of ports for the master process. While the intent is good, the implementation has a few critical issues that prevent it from working correctly. There's a logic error in port selection that could lead to using an invalid port. Additionally, the new configuration parameter is not correctly passed through all the necessary components, rendering the feature ineffective. I've provided detailed comments on these points.
| else: | ||
| port = master_port_range[0] | ||
| while port < master_port_range[1]: | ||
| try: | ||
| with socket.socket() as s: | ||
| s.bind(("", port)) | ||
| break | ||
| except OSError: | ||
| port += 1 # Increment port number if already in use | ||
| logger.info("Port %d is already in use, trying port %d", port - 1, port) |
There was a problem hiding this comment.
If all ports within the specified master_port_range are in use, the current implementation will fall through the loop and return addr, str(port) will be called with port equal to master_port_range[1]. This port is outside the specified range [start, end) and is not guaranteed to be free. This can lead to unexpected behavior or failures. The function should instead raise an exception if no free port can be found within the given range.
| else: | |
| port = master_port_range[0] | |
| while port < master_port_range[1]: | |
| try: | |
| with socket.socket() as s: | |
| s.bind(("", port)) | |
| break | |
| except OSError: | |
| port += 1 # Increment port number if already in use | |
| logger.info("Port %d is already in use, trying port %d", port - 1, port) | |
| else: | |
| port = master_port_range[0] | |
| while port < master_port_range[1]: | |
| try: | |
| with socket.socket() as s: | |
| s.bind(('', port)) | |
| break | |
| except OSError: | |
| port += 1 # Increment port number if already in use | |
| logger.info("Port %d is already in use, trying port %d", port - 1, port) | |
| else: | |
| raise RuntimeError(f"Could not find a free port in range {master_port_range}") |
| """ | ||
| self.resource_pool = resource_pool | ||
|
|
||
| self.master_port_range = master_port_range |
There was a problem hiding this comment.
While master_port_range is correctly stored as an instance attribute here, it is not used later in this method. The call to self._get_master_addr_port at line 555 needs to be updated to pass self.master_port_range to make this feature functional. This is one of several places where the new parameter needs to be plumbed through.
| OmegaConf.select(self.config.global_profiler.global_tool_config.nsys, "worker_nsight_options") | ||
| ) | ||
| wg_kwargs["device_name"] = self.device_name | ||
| wg_kwargs["master_port_range"] = self.config.trainer.get("master_port_range", None) |
There was a problem hiding this comment.
The master_port_range is being added to wg_kwargs, but the RayWorkerGroup constructor does not handle this argument. It will be ignored, and the port range configuration will have no effect. You need to update RayWorkerGroup.__init__ to accept master_port_range from **kwargs and then pass it down to _init_with_resource_pool.
| # mode: "auto", "enable", or "disable" | ||
| use_legacy_worker_impl: auto | ||
|
|
||
| #master port range for ray to find a free port |
There was a problem hiding this comment.
Remove this, we can't setup it in the global config
| OmegaConf.select(self.config.global_profiler.global_tool_config.nsys, "worker_nsight_options") | ||
| ) | ||
| wg_kwargs["device_name"] = self.device_name | ||
| wg_kwargs["master_port_range"] = self.config.trainer.get("master_port_range", None) |
There was a problem hiding this comment.
We don't need to pass the config to the trainer, just provide the API so that it is configurable based on actual use case.
…cation
What does this PR do?
support train master port range configuration
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.