-
Notifications
You must be signed in to change notification settings - Fork 40
Open
Description
I use dInfer a8b4a06 and run test_bd_serving.py, the traceback as follows: @zheng-da
# dllm-dinfer:v-260106 (de602d2bc8f9) (26.4GB)
docker run -it --gpus='"device=4"' --entrypoint=/bin/bash -v /bigdata/shared/models/huggingface/LLaDA2.0-mini--572899f-C8:/model de602d2bc8f9
sed -i 's#/mnt/infra/dulun.dl/models/dllm-mini/block-diffusion-sft-2k-v2-full-bd/LLaDA2-mini-preview-ep4-v0#/model#g' /code/dInfer/tests/test_bd_serving.py # I replace the model_path
sed -i 's#import pytest##g' /code/dInfer/tests/test_bd_serving.py # pytest is not used
sed -i 's# model = init_sglang_dist()# #g' /code/dInfer/tests/test_bd_serving.py # global model has inited in line 97, so line 196 shoule remove
date && python3 /code/dInfer/tests/test_bd_serving.py && date
INFO 01-08 01:23:35 [__init__.py:216] Automatically detected platform cuda.
WARNING:sglang.srt.layers.moe.utils:MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/spawn.py", line 131, in _main
prepare(preparation_data)
File "/usr/lib/python3.12/multiprocessing/spawn.py", line 246, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.12/multiprocessing/spawn.py", line 297, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen runpy>", line 287, in run_path
File "<frozen runpy>", line 98, in _run_module_code
File "<frozen runpy>", line 88, in _run_code
File "/code/dInfer/tests/test_bd_serving.py", line 97, in <module>
model = init_sglang_dist()
^^^^^^^^^^^^^^^^^^
File "/code/dInfer/tests/test_bd_serving.py", line 69, in init_sglang_dist
distributed.init_distributed_environment(1, 0, 'env://', 0, 'nccl')
File "/usr/local/lib/python3.12/dist-packages/sglang/srt/distributed/parallel_state.py", line 1408, in init_distributed_environment
torch.distributed.init_process_group(
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 1757, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/rendezvous.py", line 278, in _env_rendezvous_handler
store = _create_c10d_store(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/rendezvous.py", line 198, in _create_c10d_store
return TCPStore(
^^^^^^^^^
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 40399, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use
Metadata
Metadata
Assignees
Labels
No labels