Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Infer] change block attention as default attention and remove normal attention in inference mode #9770

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

zeroRains
Copy link
Contributor

@zeroRains zeroRains commented Jan 11, 2025

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

Others

PR changes

Others

Description

在inference_model=1时,将block attention作为默认的attention并移除原本attention的使用,并在代码中移除不再使用的部分

  1. 不设置--block_attn时,默认使用block attn
  2. --append_attn=1时,使用append attn
  3. --append_attn=0 --block_attn=0时,抛出异常

进度List :

  • bloom

    测试指令:

    cd llm
    # block attn
    python ./predict/predictor.py --model_name_or_path bigscience/bloom-7b1 --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 1 --append_attn 0
    
    # append attn
    python ./predict/predictor.py --model_name_or_path bigscience/bloom-7b1 --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 1
    
    # throw exception
    python ./predict/predictor.py --model_name_or_path bigscience/bloom-7b1 --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 0

    存在问题:
    append attn执行报错(develop也会)
    BH GZ ({70N4LFOXVIL(629
    其他问题:bigscience/bloom-1b1模型的head_dim=96,block attn在head_dim=96时仍然有兼容性问题(develop也有)(已解决)
    P76%1HQ)1MZOY5B$Z~TD}$Q
    [Infer] Add head_dim=96 dispatch for block attention Paddle#70763

  • chatglm (没有block_attn的逻辑,保留原样)

  • chatglm_v2
    测试指令:

    cd llm
    # block attn
    python ./predict/predictor.py --model_name_or_path THUDM/chatglm2-6b --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 1 --append_attn 0
    
    # append attn
    python ./predict/predictor.py --model_name_or_path THUDM/chatglm2-6b --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 1
    
    # throw exception
    python ./predict/predictor.py --model_name_or_path THUDM/chatglm2-6b --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 0

    存在问题:
    append attn的计算会越界(develop也会)
    V3OZ`8SGC5 _N SLQBMTT7E

  • gpt(没有block_attn的逻辑,保留原样)

  • llama

    测试指令:

    cd llm
    # block attn
    python ./predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 1 --append_attn 0
    
    # append attn
    python ./predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 1
    
    # block attn img2txt
    python ./predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 1 --append_attn 0 --model_type img2txt
    
    # append attn img2txt
    python ./predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 1 --model_type img2txt
    
    # throw exception
    python ./predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 0
  • mixtral (本地环境爆显存了,需要帮忙测一下)
    测试指令:

    cd llm
    # block attn
    python ./predict/predictor.py --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 1 --append_attn 0
    
    # append attn
    python ./predict/predictor.py --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 1
    
    # throw exception
    python ./predict/predictor.py --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 0
  • opt(没有block_attn的逻辑,保留原样)

  • qwen(没有block_attn的逻辑,保留原样)

  • qwen2
    测试指令:

    cd llm
    # block attn
    python ./predict/predictor.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 1 --append_attn 0
    
    # append attn
    python ./predict/predictor.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 1
    
    # throw exception
    python ./predict/predictor.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 0
  • qwen2_moe
    测试指令:

    cd llm
    # block attn
    python ./predict/predictor.py --model_name_or_path Qwen/Qwen1.5-MoE-A2.7B --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 1 --append_attn 0
    
    # append attn
    python ./predict/predictor.py --model_name_or_path Qwen/Qwen1.5-MoE-A2.7B --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 1
    
    # throw exception
    python ./predict/predictor.py --model_name_or_path Qwen/Qwen1.5-MoE-A2.7B --dtype float16 --mode dynamic --decode_strategy greedy_search --inference_model 1 --block_attn 0 --append_attn 0

    存在问题:
    block attn 跑不通 在实现AutoModel的时候还是跑的通的(develop也是)
    J9JHI8SMJ021LLMPG~MEEKO
    append attn跑不通(develop 也是)
    4 T(1H$5VL{)K`06U38KO~N

Copy link

paddle-bot bot commented Jan 11, 2025

Thanks for your contribution!

Copy link

codecov bot commented Jan 11, 2025

Codecov Report

Attention: Patch coverage is 0% with 47 lines in your changes missing coverage. Please review.

Project coverage is 52.67%. Comparing base (fb60645) to head (77996bd).
Report is 10 commits behind head on develop.

Files with missing lines Patch % Lines
...dlenlp/experimental/transformers/llama/modeling.py 0.00% 15 Missing ⚠️
...dlenlp/experimental/transformers/bloom/modeling.py 0.00% 11 Missing ⚠️
...lp/experimental/transformers/qwen2_moe/modeling.py 0.00% 6 Missing ⚠️
...dlenlp/experimental/transformers/qwen2/modeling.py 0.00% 5 Missing ⚠️
...p/experimental/transformers/chatglm_v2/modeling.py 0.00% 4 Missing ⚠️
...enlp/experimental/transformers/mixtral/modeling.py 0.00% 4 Missing ⚠️
paddlenlp/transformers/auto/modeling.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9770      +/-   ##
===========================================
- Coverage    52.70%   52.67%   -0.03%     
===========================================
  Files          731      730       -1     
  Lines       117313   114591    -2722     
===========================================
- Hits         61827    60365    -1462     
+ Misses       55486    54226    -1260     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants