[Infer] change block attention as default attention and remove normal attention in inference mode #9770

zeroRains · 2025-01-11T07:58:58Z

Before submitting

Lint code. If there are lint issues, please format the code first.

# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py

Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

Others

PR changes

Others

Description

在inference_model=1时，将block attention作为默认的attention并移除原本attention的使用，并在代码中移除不再使用的部分

不设置--block_attn时，默认使用block attn
--append_attn=1时，使用append attn
--append_attn=0 --block_attn=0时，抛出异常

进度List :

bloom

测试指令：

cd llm
# block attn
python ./predict/predictor.py --model_name_or_path bigscience/bloom-7b1 --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 1 --append_attn 0

# append attn
python ./predict/predictor.py --model_name_or_path bigscience/bloom-7b1 --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 1

# throw exception
python ./predict/predictor.py --model_name_or_path bigscience/bloom-7b1 --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 0

存在问题：
append attn执行报错（develop也会）

其他问题：bigscience/bloom-1b1模型的head_dim=96，block attn在head_dim=96时仍然有兼容性问题（develop也有）(已解决)

[Infer] Add head_dim=96 dispatch for block attention Paddle#70763

chatglm （没有block_attn的逻辑，保留原样）

chatglm_v2
测试指令：

cd llm
# block attn
python ./predict/predictor.py --model_name_or_path THUDM/chatglm2-6b --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 1 --append_attn 0

# append attn
python ./predict/predictor.py --model_name_or_path THUDM/chatglm2-6b --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 1

# throw exception
python ./predict/predictor.py --model_name_or_path THUDM/chatglm2-6b --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 0

存在问题：
append attn的计算会越界（develop也会）

gpt（没有block_attn的逻辑，保留原样）

llama

测试指令：

cd llm
# block attn
python ./predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 1 --append_attn 0

# append attn
python ./predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 1

# block attn img2txt
python ./predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 1 --append_attn 0 --model_type img2txt

# append attn img2txt
python ./predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 1 --model_type img2txt

# throw exception
python ./predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 0

mixtral （本地环境爆显存了，需要帮忙测一下）
测试指令：

cd llm
# block attn
python ./predict/predictor.py --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 1 --append_attn 0

# append attn
python ./predict/predictor.py --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 1

# throw exception
python ./predict/predictor.py --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 0

opt（没有block_attn的逻辑，保留原样）
qwen（没有block_attn的逻辑，保留原样）

qwen2
测试指令：

cd llm
# block attn
python ./predict/predictor.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 1 --append_attn 0

# append attn
python ./predict/predictor.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 1

# throw exception
python ./predict/predictor.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 0

qwen2_moe
测试指令：

cd llm
# block attn
python ./predict/predictor.py --model_name_or_path Qwen/Qwen1.5-MoE-A2.7B --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 1 --append_attn 0

# append attn
python ./predict/predictor.py --model_name_or_path Qwen/Qwen1.5-MoE-A2.7B --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 1

# throw exception
python ./predict/predictor.py --model_name_or_path Qwen/Qwen1.5-MoE-A2.7B --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 0

存在问题：
block attn 跑不通在实现AutoModel的时候还是跑的通的（develop也是）

append attn跑不通（develop 也是）

…p so)

…velop so)

…ppend attn (develop so)

paddle-bot · 2025-01-11T07:59:03Z

Thanks for your contribution!

codecov · 2025-01-11T08:33:09Z

Codecov Report

Attention: Patch coverage is 0% with 47 lines in your changes missing coverage. Please review.

Project coverage is 52.67%. Comparing base (fb60645) to head (77996bd).
Report is 10 commits behind head on develop.

Files with missing lines	Patch %	Lines
...dlenlp/experimental/transformers/llama/modeling.py	0.00%	15 Missing ⚠️
...dlenlp/experimental/transformers/bloom/modeling.py	0.00%	11 Missing ⚠️
...lp/experimental/transformers/qwen2_moe/modeling.py	0.00%	6 Missing ⚠️
...dlenlp/experimental/transformers/qwen2/modeling.py	0.00%	5 Missing ⚠️
...p/experimental/transformers/chatglm_v2/modeling.py	0.00%	4 Missing ⚠️
...enlp/experimental/transformers/mixtral/modeling.py	0.00%	4 Missing ⚠️
paddlenlp/transformers/auto/modeling.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9770      +/-   ##
===========================================
- Coverage    52.70%   52.67%   -0.03%     
===========================================
  Files          731      730       -1     
  Lines       117313   114591    -2722     
===========================================
- Hits         61827    60365    -1462     
+ Misses       55486    54226    -1260

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

zeroRains added 6 commits January 9, 2025 08:25

remove normal attention for qwen2

9650510

remove normal attention for bloom but have bug in append attn (develo…

b88c25f

…p so)

remove normal attention for chatglmv2 but have bug in append attn (de…

9972f0e

…velop so)

remove normal attention for llama

17c471e

remove normal attention for mixtral (out of memory in local)

622d6e9

remove normal attention for qwen2moe but have bug in block attn and a…

77996bd

…ppend attn (develop so)

paddle-bot bot added the contributor label Jan 11, 2025

paddle-bot bot assigned wawltor Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Infer] change block attention as default attention and remove normal attention in inference mode #9770

[Infer] change block attention as default attention and remove normal attention in inference mode #9770

zeroRains commented Jan 11, 2025 •

edited

Loading

paddle-bot bot commented Jan 11, 2025

codecov bot commented Jan 11, 2025 •

edited

Loading

[Infer] change block attention as default attention and remove normal attention in inference mode #9770

Are you sure you want to change the base?

[Infer] change block attention as default attention and remove normal attention in inference mode #9770

Conversation

zeroRains commented Jan 11, 2025 • edited Loading

Before submitting

PR types

PR changes

Description

paddle-bot bot commented Jan 11, 2025

codecov bot commented Jan 11, 2025 • edited Loading

Codecov Report

zeroRains commented Jan 11, 2025 •

edited

Loading

codecov bot commented Jan 11, 2025 •

edited

Loading