We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda 11.8 llama-cpp-python 0.2.55 python 3.10 windows 10
0.14.0
xinference-local
2024-08-05 20:41:08,332 xinference.core.worker 146864 INFO You specify to launch the model: qwen2-instruct on GPU index: [0] of the worker: 127.0.0.1:59241, xinference will automatically ignore the n_gpu option. 2024-08-05 20:41:24,186 xinference.model.llm.llm_family 146864 INFO Caching from Modelscope: qwen/Qwen2-7B-Instruct-GGUF llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from C:\Users\allan716.cache\modelscope\hub\qwen\Qwen2-7B-Instruct-GGUF\qwen2-7b-instruct-q4_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = qwen2-7b-instruct llama_model_loader: - kv 2: qwen2.block_count u32 = 28 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 3584 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 18944 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 28 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 4 llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo... llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - kv 22: quantize.imatrix.file str = ../Qwen2/gguf/qwen2-7b-imatrix/imatri... llama_model_loader: - kv 23: quantize.imatrix.dataset str = ../sft_2406.txt llama_model_loader: - kv 24: quantize.imatrix.entries_count i32 = 196 llama_model_loader: - kv 25: quantize.imatrix.chunks_count i32 = 1937 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q4_0: 194 tensors llama_model_loader: - type q4_1: 3 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 421/152064 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 3584 llm_load_print_meta: n_head = 28 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 7 llm_load_print_meta: n_embd_k_gqa = 512 llm_load_print_meta: n_embd_v_gqa = 512 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 18944 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 7.62 B llm_load_print_meta: model size = 4.13 GiB (4.66 BPW) llm_load_print_meta: general.name = qwen2-7b-instruct llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 30 '?' llm_load_tensors: ggml ctx size = 0.13 MiB Exception ignored on calling ctypes callback function: <function llama_log_callback at 0x00000259E17AFD90> Traceback (most recent call last): File "C:\ProgramData\Anaconda3\envs\xin\lib\site-packages\llama_cpp_logger.py", line 30, in llama_log_callback print(text.decode("utf-8"), end="", flush=True, file=sys.stderr) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 90: invalid continuation byte llm_load_tensors: CPU buffer size = 4232.57 MiB ...................................................................................... llama_new_context_with_model: n_ctx = 32768 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 1792.00 MiB llama_new_context_with_model: KV self size = 1792.00 MiB, K (f16): 896.00 MiB, V (f16): 896.00 MiB llama_new_context_with_model: CPU input buffer size = 72.26 MiB llama_new_context_with_model: CPU compute buffer size = 1806.00 MiB llama_new_context_with_model: graph splits (measure): 1 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | Model metadata: {'qwen2.attention.head_count': '28', 'general.name': 'qwen2-7b-instruct', 'general.architecture': 'qwen2', 'qwen2.block_count': '28', 'qwen2.context_length': '32768', 'qwen2.attention.head_count_kv': '4', 'quantize.imatrix.dataset': '../sft_2406.txt', 'qwen2.embedding_length': '3584', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '151643', 'qwen2.feed_forward_length': '18944', 'tokenizer.ggml.padding_token_id': '151643', 'qwen2.rope.freq_base': '1000000.000000', 'qwen2.attention.layer_norm_rms_epsilon': '0.000001', 'tokenizer.ggml.eos_token_id': '151645', 'general.file_type': '2', 'tokenizer.ggml.model': 'gpt2', 'tokenizer.ggml.pre': 'qwen2', 'tokenizer.chat_template': "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}", 'tokenizer.ggml.add_bos_token': 'false', 'quantize.imatrix.chunks_count': '1937', 'quantize.imatrix.file': '../Qwen2/gguf/qwen2-7b-imatrix/imatrix.dat', 'quantize.imatrix.entries_count': '196'} Using gguf chat template: {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system You are a helpful assistant.<|im_end|> ' }}{% endif %}{{'<|im_start|>' + message['role'] + ' ' + message['content'] + '<|im_end|>' + ' '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant ' }}{% endif %} Using chat eos_token: <|im_end|> Using chat bos_token: <|endoftext|>
n_gpu
能够使用 GPU 执行 qwen2 的推理任务
The text was updated successfully, but these errors were encountered:
+1,llama-3.1 gguf模型以及本地注册gguf模型 ,无法使用gpu运行,torch 2.01+cu11.8
Sorry, something went wrong.
No branches or pull requests
System Info / 系統信息
cuda 11.8
llama-cpp-python 0.2.55
python 3.10
windows 10
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
Version info / 版本信息
0.14.0
The command used to start Xinference / 用以启动 xinference 的命令
xinference-local
Reproduction / 复现过程
2024-08-05 20:41:08,332 xinference.core.worker 146864 INFO You specify to launch the model: qwen2-instruct on GPU index: [0] of the worker: 127.0.0.1:59241, xinference will automatically ignore the
n_gpu
option.2024-08-05 20:41:24,186 xinference.model.llm.llm_family 146864 INFO Caching from Modelscope: qwen/Qwen2-7B-Instruct-GGUF
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from C:\Users\allan716.cache\modelscope\hub\qwen\Qwen2-7B-Instruct-GGUF\qwen2-7b-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = qwen2-7b-instruct
llama_model_loader: - kv 2: qwen2.block_count u32 = 28
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 3584
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 18944
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 28
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 4
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - kv 22: quantize.imatrix.file str = ../Qwen2/gguf/qwen2-7b-imatrix/imatri...
llama_model_loader: - kv 23: quantize.imatrix.dataset str = ../sft_2406.txt
llama_model_loader: - kv 24: quantize.imatrix.entries_count i32 = 196
llama_model_loader: - kv 25: quantize.imatrix.chunks_count i32 = 1937
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q4_0: 194 tensors
llama_model_loader: - type q4_1: 3 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 421/152064 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 152064
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 3584
llm_load_print_meta: n_head = 28
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 7
llm_load_print_meta: n_embd_k_gqa = 512
llm_load_print_meta: n_embd_v_gqa = 512
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 18944
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 7.62 B
llm_load_print_meta: model size = 4.13 GiB (4.66 BPW)
llm_load_print_meta: general.name = qwen2-7b-instruct
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 30 '?'
llm_load_tensors: ggml ctx size = 0.13 MiB
Exception ignored on calling ctypes callback function: <function llama_log_callback at 0x00000259E17AFD90>
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\xin\lib\site-packages\llama_cpp_logger.py", line 30, in llama_log_callback
print(text.decode("utf-8"), end="", flush=True, file=sys.stderr)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 90: invalid continuation byte
llm_load_tensors: CPU buffer size = 4232.57 MiB
......................................................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 1792.00 MiB
llama_new_context_with_model: KV self size = 1792.00 MiB, K (f16): 896.00 MiB, V (f16): 896.00 MiB
llama_new_context_with_model: CPU input buffer size = 72.26 MiB
llama_new_context_with_model: CPU compute buffer size = 1806.00 MiB
llama_new_context_with_model: graph splits (measure): 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
Model metadata: {'qwen2.attention.head_count': '28', 'general.name': 'qwen2-7b-instruct', 'general.architecture': 'qwen2', 'qwen2.block_count': '28', 'qwen2.context_length': '32768', 'qwen2.attention.head_count_kv': '4', 'quantize.imatrix.dataset': '../sft_2406.txt', 'qwen2.embedding_length': '3584', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '151643', 'qwen2.feed_forward_length': '18944', 'tokenizer.ggml.padding_token_id': '151643', 'qwen2.rope.freq_base': '1000000.000000', 'qwen2.attention.layer_norm_rms_epsilon': '0.000001', 'tokenizer.ggml.eos_token_id': '151645', 'general.file_type': '2', 'tokenizer.ggml.model': 'gpt2', 'tokenizer.ggml.pre': 'qwen2', 'tokenizer.chat_template': "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}", 'tokenizer.ggml.add_bos_token': 'false', 'quantize.imatrix.chunks_count': '1937', 'quantize.imatrix.file': '../Qwen2/gguf/qwen2-7b-imatrix/imatrix.dat', 'quantize.imatrix.entries_count': '196'}
Using gguf chat template: {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
You are a helpful assistant.<|im_end|>
' }}{% endif %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}
Using chat eos_token: <|im_end|>
Using chat bos_token: <|endoftext|>
Expected behavior / 期待表现
能够使用 GPU 执行 qwen2 的推理任务
The text was updated successfully, but these errors were encountered: