Search before asking
Version
megatron: main branch with commit id "e8749f88691cac1eeefd11b6b68cb8a6557356d5", need to change some parallel_state code
mbridge:0.15.1
megatron patch:
parallel_state_patch.patch
I am running qwen3-8B with 1 training process and 1 inference process with the following code, then I failed even after I fixed some bug in awex.
eg :#96
looks like sglang does not know "self_attn.q_norm.weight"
awex_example.py
error log:
File "/inspire/hdd/project/qianghuaxuexi/public/wty/sglang/python/sglang/srt/managers/scheduler.py", line 2713, in run_scheduler_process
scheduler.event_loop_overlap()
File "/inspire/hdd/global_user/wangtongyu-25057/miniconda3/envs/py12_dev/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/inspire/hdd/project/qianghuaxuexi/public/wty/sglang/python/sglang/srt/managers/scheduler.py", line 1009, in event_loop_overlap
self.process_input_requests(recv_reqs)
File "/inspire/hdd/project/qianghuaxuexi/public/wty/sglang/python/sglang/srt/managers/scheduler.py", line 1187, in process_input_requests
output = self._request_dispatcher(recv_req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/inspire/hdd/project/qianghuaxuexi/public/wty/sglang/python/sglang/utils.py", line 507, in __call__
return fn(obj)
^^^^^^^
File "/inspire/hdd/project/qianghuaxuexi/public/wty/sglang/python/sglang/srt/managers/scheduler.py", line 2479, in execute_task_in_model_worker
result = self.model_worker.execute_task_in_model_worker(task_spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/inspire/hdd/project/qianghuaxuexi/public/wty/sglang/python/sglang/srt/managers/tp_worker.py", line 106, in execute_task_in_model_worker
return self.model_runner.execute_task_in_model_worker(task_spec, models=models)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/inspire/hdd/project/qianghuaxuexi/public/wty/sglang/python/sglang/srt/model_executor/model_runner.py", line 931, in execute_task_in_model_worker
return task_func(**kwargs)
^^^^^^^^^^^^^^^^^^^
File "/inspire/hdd/project/qianghuaxuexi/public/wty/asystem-awex/awex/meta/infer_meta_resolver.py", line 169, in _get_model_param_info
for hf_name, hf_param in sglang_to_hf_weight_converter.convert_param(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/inspire/hdd/global_user/wangtongyu-25057/miniconda3/envs/py12_dev/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/inspire/hdd/project/qianghuaxuexi/public/wty/asystem-awex/awex/converter/sglang_converter.py", line 294, in convert_param
converted_params = self._convert_layer_norm_param(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/inspire/hdd/project/qianghuaxuexi/public/wty/asystem-awex/awex/converter/sglang_converter.py", line 255, in _convert_layer_norm_param
raise NotImplementedError(f"Unsupported layer norm parameter name: {name}")
NotImplementedError: Unsupported layer norm parameter name: self_attn.q_norm.weight
Component(s)
Framework
Minimal reproduce step
megatron: main branch with commit id "e8749f88691cac1eeefd11b6b68cb8a6557356d5", need to change some parallel_state code
mbridge:0.15.1
megatron patch:
parallel_state_patch.patch
codes:
awex_example.py
What did you expect to see?
success
What did you see instead?
"NotImplementedError: Unsupported layer norm parameter name: self_attn.q_norm.weight"
Anything Else?
No response
Are you willing to submit a PR?
Search before asking
Version
megatron: main branch with commit id "e8749f88691cac1eeefd11b6b68cb8a6557356d5", need to change some parallel_state code
mbridge:0.15.1
megatron patch:
parallel_state_patch.patch
I am running qwen3-8B with 1 training process and 1 inference process with the following code, then I failed even after I fixed some bug in awex.
eg :#96
looks like sglang does not know "self_attn.q_norm.weight"
awex_example.py
error log:
Component(s)
Framework
Minimal reproduce step
megatron: main branch with commit id "e8749f88691cac1eeefd11b6b68cb8a6557356d5", need to change some parallel_state code
mbridge:0.15.1
megatron patch:
parallel_state_patch.patch
codes:
awex_example.py
What did you expect to see?
success
What did you see instead?
"NotImplementedError: Unsupported layer norm parameter name: self_attn.q_norm.weight"
Anything Else?
No response
Are you willing to submit a PR?