Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

分布式节点部署,launch时指定卡有问题 #2710

Open
syd1997 opened this issue Dec 26, 2024 · 11 comments
Open

分布式节点部署,launch时指定卡有问题 #2710

syd1997 opened this issue Dec 26, 2024 · 11 comments
Labels
Milestone

Comments

@syd1997
Copy link

syd1997 commented Dec 26, 2024

cuda 12.2
python3.10
transformers 4.47.0

xinference version :1.0.1

start supervisor

conda activate XXX
export XINFERENCE_HOME=/data/xinference
nohup xinference-supervisor -H $IP_ADDR

#start worker
conda activate XXX
export XINFERENCE_HOME=/data/xinference
export XINFERENCE_ENDPOINT=$IP_ADDR
nohup xinference-worker -e "$IP_ADDR:$PORT" -H $IP_ADDR
我有三个节点,分别是4卡,2卡,4卡,启动顺序也是4卡,2卡,4卡
当我启动三个不同的模型时,第一个模型指定了4卡,正常启动,第二个模型启动时最多只能选择2卡,将2卡的节点占用之后才能在启动第三个模型时选择4卡并在4卡的节点上运行。

@XprobeBot XprobeBot added the gpu label Dec 26, 2024
@XprobeBot XprobeBot added this to the v1.x milestone Dec 26, 2024
@syd1997
Copy link
Author

syd1997 commented Dec 26, 2024

都是在web上启动的

@qinxuye
Copy link
Contributor

qinxuye commented Dec 26, 2024

比较稳定的做法可能是通过 worker_ip 指定运行节点。

@syd1997
Copy link
Author

syd1997 commented Dec 26, 2024

之前有别人出现过这种情况吗?有没有优化的办法?未来会不会有优化?感谢!

Copy link

github-actions bot commented Jan 2, 2025

This issue is stale because it has been open for 7 days with no activity.

@github-actions github-actions bot added the stale label Jan 2, 2025
@FrankCreen
Copy link

定运行节点。

问题解决了吗? 我使用v1.1.1版本的部署,发现一直报模型路径找不到错误。单机部署就没问题。
image

@qinxuye
Copy link
Contributor

qinxuye commented Jan 8, 2025

需要确认你指定的机器上有这个路径。

@FrankCreen
Copy link

需要确认你指定的机器上有这个路径。

我们使用docker方式启动。确认路径正确。

并且我们测试了v0.16.3版本,单机部署方式没问题,集群模式就会出问题。

master:
docker run -itd --shm-size=12g --ulimit memlock=-1 \
-v /data/work/models/base_models:/models \
-e XINFERENCE_HOME=/models \
-p 9097:9997 \
--name xinference-master \
--gpus '"device=0"' \
--net=host \
registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:v0.16.3 \
xinference-supervisor -H 10.200.22.61 --port 9097




worker1:
docker run -itd --shm-size=12g --ulimit memlock=-1 \
-v /data/work/models/base_models:/models \
-e XINFERENCE_HOME=/models \
-p 9098:9998 \
--name xinference-worker \
--gpus '"device=0"' \
--net=host \
registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:v0.16.3 \
xinference-worker -e "http://10.200.22.61:9097" -H 10.200.22.61 --worker-port 9098

worker2:
docker run -itd --shm-size=12g --ulimit memlock=-1 \
-v /data/work/models/base_models:/models \
-e XINFERENCE_HOME=/models \
-p 9098:9998 \
--name xinference-worker \
--gpus '"device=0"' \
--net=host \
registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:v0.16.3 \
xinference-worker -e "http://10.200.22.61:9097" -H 10.200.22.62 --worker-port 9098

image

image

@qinxuye
Copy link
Contributor

qinxuye commented Jan 8, 2025

是注册的自定义模型?

@FrankCreen
Copy link

是注册的自定义模型?

是的嘞。embedding使用的是“moka-ai/m3e-large”,rerank用的是“BAAI/bge-reranker-large”。都是从huggingface上下载到了本地。

@qinxuye
Copy link
Contributor

qinxuye commented Jan 8, 2025

自定义注册的时候应该和worker有关,有设置吗

@github-actions github-actions bot removed the stale label Jan 8, 2025
@FrankCreen
Copy link

模型注册信息如下:

image

 


{
    "model_name": "embed-test",
    "model_id": null,
    "model_revision": null,
    "model_hub": "huggingface",
    "dimensions": 768,
    "max_tokens": 1024,
    "language": [
        "zh"
    ],
    "model_uri": "/models/embed_model/moka-ai_m3e-large/",
    "is_builtin": false
}

启动时设置worker ip总是找不到容器中的模型。两个worker都测试无效。

worker1:
image

worker2:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants