Docker 部署rerank模型 BAAI/bge-reranker-base

FROM python:3.9-slim

RUN sed -i 's|http://deb.debian.org|https://mirrors.ustc.edu.cn|g' /etc/apt/sources.list.d/debian.sources \
    && apt-get install -y --no-install-recommends tzdata \
    && ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime \
    && echo "Asia/Shanghai" > /etc/timezone \
    && apt-get update && apt-get install -y gcc \
    && rm -rf /var/lib/apt/lists/* \
    && pip install --no-cache-dir -i https://pypi.mirrors.ustc.edu.cn/simple \
    numpy==1.26.4 torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 sentence-transformers==4.0.1 fastapi==0.115.12 uvicorn==0.34.0

WORKDIR /app

COPY rerank/app.py /app/

EXPOSE 8000

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]


from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from sentence_transformers import CrossEncoder
import os
import time

app = FastAPI(title="BGE-Reranker-Base CPU服务")

# 强制使用CPU
os.environ["CUDA_VISIBLE_DEVICES"] = ""

# 加载base模型(约420MB)
model = CrossEncoder('BAAI/bge-reranker-base', device='cpu')

class QueryDocumentPair(BaseModel):
    query: str
    document: str

class RerankRequest(BaseModel):
    query: str
    documents: list[str]
    top_k: int = None
    batch_size: int = 16  # 默认批处理大小

@app.post("/rerank")
async def rerank_texts(request: RerankRequest):
    start_time = time.time()

    # 安全限制
    MAX_DOCS = 100
    if len(request.documents) > MAX_DOCS:
        raise HTTPException(
            status_code=400,
            detail=f"超过最大处理文档数({MAX_DOCS}),请减少文档数量或分批处理"
        )

    # 准备输入
    model_inputs = [[request.query, doc] for doc in request.documents]

    # 分批处理防止内存溢出
    scores = []
    for i in range(0, len(model_inputs), request.batch_size):
        batch = model_inputs[i:i + request.batch_size]
        scores.extend(model.predict(batch))

    # 组合结果并排序
    results = sorted(
        zip(request.documents, scores),
        key=lambda x: x[1],
        reverse=True
    )

    # 应用top_k限制
    if request.top_k is not None and request.top_k > 0:
        results = results[:request.top_k]

    processing_time = time.time() - start_time

    return {
        "model": "bge-reranker-base",
        "device": "cpu",
        "processing_time_seconds": round(processing_time, 3),
        "documents_processed": len(request.documents),
        "results": [
            {"document": doc, "score": float(score), "rank": idx+1}
            for idx, (doc, score) in enumerate(results)
        ]
    }

@app.get("/model-info")
async def get_model_info():
    return {
        "model_name": "BAAI/bge-reranker-base",
        "max_sequence_length": 512,
        "recommended_batch_size": 16,
        "device": "cpu"
    }

@app.get("/health")
async def health_check():
    return {"status": "healthy", "model_loaded": True}

容器化部署vLLM Deepseek : 从装机到交付

一、配置清单

1、14700KF

2、RTX 4090 x 2

3、内存ddr5 32G x4

5、主板华硕 Z790

6、硬盘2T固态系统、4T企业

二、BIOS配置调整

2.1 PCI接口确认

lspci | grep -i nvidia

01:00.0 VGA compatible controller: NVIDIA Corporation AD102 [GeForce RTX 4090] (rev a1)
07:00.0 VGA compatible controller: NVIDIA Corporation AD102 [GeForce RTX 4090] (rev a1)

lspci -vvv -s 01:00.0 | grep  LnkSta
lspci -vvv -s 07:00.0 | grep  LnkSta

LnkSta: Speed 2.5GT/s (downgraded), Width x16
LnkSta: Speed 2.5GT/s (downgraded), Width x4 (downgraded)

已确认PCI通道x16 x4的、需要一会在BIOS里面调中 x8 x8的。

2.2 确认是否开启  Above 4G Decoding

cat /proc/cmdline | grep -i “pci=assign-busses|enable_4g_decoding”

2.3 电源相关 关闭ASPM(链路节能)CEP,解除功耗限制。

三、驱动安装

apt update

ubuntu-drivers autoinstall

nvidia-smi

四、软件安装

cat docker-compose.yml

services:
vllm:
image: vllm/vllm-openai:v0.8.1
restart: unless-stopped
deploy:
resources:
reservations:
devices:
– driver: nvidia
count: 2
capabilities: [gpu]
ports:
– “172.17.0.1:8001:8000”
volumes:
– /home/kairui/models:/models
– /dev/shm:/dev/shm
logging:
driver: “json-file”
options:
max-size: “1g”
max-file: “10”
environment:
– HF_HOME=/models
– NVIDIA_VISIBLE_DEVICES=all
– NVIDIA_DRIVER_CAPABILITIES=all
– CUDA_VISIBLE_DEVICES=0,1
– PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
command: [
“–model”, “/models/DeepSeek-R1-Distill-Qwen-14B”,
“–served-model-name”, “deepseek-r1”,
“–tensor-parallel-size”, “2”,
“–gpu-memory-utilization”, “0.85”,
“–dtype”, “float16”,
“–max-model-len”, “8192”,
“–max-num-seqs”, “64”,
“–api-key”, “5XxBnYwkSAnlmhUVXzuYlBtG8XOfBF9K”
]