rocm fork very slow on some/large models (goliath q3_k_s) #530
ChristophHaag
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The rocm fork has no issue tracker, so I'll post here.
With some smaller models the rocm fork has worked fine, but running goliath q3_k_s for example is very very slow.
Archlinux, ryzen 3950X, radeon 6900 XT, 64 gb ram 3200 MHz ram.
I know it's not going to be fast on that hardware, but with clblast it's still much much faster than rocm. As far as I can tell it's not filling vram or ram (trying zram at the moment so somewhat hard to tell).
koboldcpp-rocm 11aa596: Processing:112.84s (7522.5ms/T), Generation:886.43s (42210.8ms/T), Total:999.26s (0.02T/s)
python koboldcpp-rocm/koboldcpp.py goliath-120b.Q3_K_S.gguf --usecublas mmq --gpulayers 16 --contextsize 4096
*** Welcome to KoboldCpp - Version 1.49.yr1-ROCm Attempting to use hipBLAS library for faster prompt ingestion. A compatible AMD GPU will be required. Initializing dynamic library: koboldcpp_hipblas.so ========== Namespace(model=None, model_param='goliath-120b.Q3_K_S.gguf', port=5001, port_param=5001, host='', launch=False, lora=None, config=None, threads=15, blasthreads=15, highpriority=False, contextsize=4096, blasbatchsize=512, ropeconfig=[0.0, 10000.0], smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['mmq'], gpulayers=16, tensor_split=None, onready='', multiuser=False, remotetunnel=False, foreground=False, preloadstory='') ========== Loading model: goliath-120b.Q3_K_S.gguf [Threads: 15, BlasThreads: 15, SmartContext: False, ContextShift: True]Identified as LLAMA model: (ver 6)
Attempting to Load...
Using automatic RoPE scaling (scale:1.000, base:10000.0)
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: maybe
ggml_init_cublas: CUDA_USE_TENSOR_CORES: maybe
ggml_init_cublas: found 1 ROCm devices:
Device 0: AMD Radeon RX 6900 XT, compute capability 10.3
llama_model_loader: loaded meta data with 20 key-value pairs and 1236 tensors from goliath-120b.Q3_K_S.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 137
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = unknown, may not work
llm_load_print_meta: model params = 117.75 B
llm_load_print_meta: model size = 47.22 GiB (3.45 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '
''llm_load_print_meta: EOS token = 2 '
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.45 MB
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: mem required = 42746.17 MB
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/140 layers to GPU
llm_load_tensors: VRAM used: 5611.00 MB
....................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 2192.00 MB
llama_build_graph: non-view tensors processed: 3155/3155
llama_new_context_with_model: compute buffer total size = 574.63 MB
llama_new_context_with_model: VRAM scratch buffer: 568.00 MB
llama_new_context_with_model: total VRAM used: 6179.00 MB (model: 5611.00 MB, context: 568.00 MB)
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001
Input: {"n": 1, "max_context_length": 4096, "max_length": 1024, "rep_pen": 1.15, "temperature": 1.5, "top_p": 1, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 0.69, "rep_pen_range": 1024, "rep_pen_slope": 0.1, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "min_p": 0, "genkey": "KCPP4587", "prompt": "\nUSER: tell me a joke\nASSISTANT: ", "quiet": true, "stop_sequence": ["USER:", "ASSISTANT:"], "use_default_badwordsids": false}
Processing Prompt (15 / 15 tokens)
Generating (21 / 1024 tokens)
(EOS token triggered!)
ContextLimit: 36/4096, Processing:112.84s (7522.5ms/T), Generation:886.43s (42210.8ms/T), Total:999.26s (0.02T/s)
Output: Why did the tomato turn red?
Because it saw the salad dressing!
koboldcpp a00a32e: Processing:10.15s (676.8ms/T), Generation:24.14s (1149.3ms/T), Total:34.29s (0.61T/s)
RUSTICL_ENABLE=radeonsi python /koboldcpp/koboldcpp.py goliath-120b.Q3_K_S.gguf --useclblast 0 0 --gpulayers 16 --contextsize 4096
*** Welcome to KoboldCpp - Version 1.49 Attempting to use CLBlast library for faster prompt ingestion. A compatible clblast will be required. Initializing dynamic library: koboldcpp_clblast.so ========== Namespace(model=None, model_param='goliath-120b.Q3_K_S.gguf', port=5001, port_param=5001, host='', launch=False, lora=None, config=None, threads=15, blasthreads=15, highpriority=False, contextsize=4096, blasbatchsize=512, ropeconfig=[0.0, 10000.0], smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=[0, 0], usecublas=None, gpulayers=16, tensor_split=None, onready='', multiuser=False, remotetunnel=False, foreground=False, preloadstory='') ========== Loading model: goliath-120b.Q3_K_S.gguf [Threads: 15, BlasThreads: 15, SmartContext: False, ContextShift: True]Identified as LLAMA model: (ver 6)
Attempting to Load...
Using automatic RoPE scaling (scale:1.000, base:10000.0)
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
Platform:0 Device:0 - rusticl with AMD Radeon RX 6900 XT (navi21, LLVM 16.0.6, DRM 3.54, 6.6.1-arch1-1)
Platform:1 Device:0 - AMD Accelerated Parallel Processing with gfx1030
ggml_opencl: selecting platform: 'rusticl'
ggml_opencl: selecting device: 'AMD Radeon RX 6900 XT (navi21, LLVM 16.0.6, DRM 3.54, 6.6.1-arch1-1)'
ggml_opencl: device FP16 support: false
CL FP16 temporarily disabled pending further optimization.
llama_model_loader: loaded meta data with 20 key-value pairs and 1236 tensors from goliath-120b.Q3_K_S.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 137
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 117.75 B
llm_load_print_meta: model size = 47.22 GiB (3.45 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '
''llm_load_print_meta: EOS token = 2 '
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.45 MB
llm_load_tensors: using OpenCL for GPU acceleration
llm_load_tensors: mem required = 42746.17 MB
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/138 layers to GPU
llm_load_tensors: VRAM used: 5611.00 MB
....................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 2192.00 MB
llama_build_graph: non-view tensors processed: 3155/3155
llama_new_context_with_model: compute buffer total size = 574.63 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001
Input: {"n": 1, "max_context_length": 4096, "max_length": 1024, "rep_pen": 1.15, "temperature": 1.5, "top_p": 1, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 0.69, "rep_pen_range": 1024, "rep_pen_slope": 0.1, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "min_p": 0, "genkey": "KCPP8849", "prompt": "\nUSER: tell me a joke\nASSISTANT: ", "quiet": true, "stop_sequence": ["USER:", "ASSISTANT:"], "use_default_badwordsids": false}
Processing Prompt (15 / 15 tokens)
Generating (21 / 1024 tokens)
(EOS token triggered!)
ContextLimit: 36/4096, Processing:10.15s (676.8ms/T), Generation:24.14s (1149.3ms/T), Total:34.29s (0.61T/s)
Output: Why did the tomato turn red?
Because it saw the salad dressing!
Beta Was this translation helpful? Give feedback.
All reactions