[Infer] change block attention as default attention and remove normal attention in inference mode #9770
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Before submitting
tests
folder. If there are codecov issues, please add tests cases first.PR types
Others
PR changes
Others
Description
在inference_model=1时,将block attention作为默认的attention并移除原本attention的使用,并在代码中移除不再使用的部分
--block_attn
时,默认使用block attn--append_attn=1
时,使用append attn--append_attn=0 --block_attn=0
时,抛出异常进度List :
bloom
测试指令:
存在问题:
append attn执行报错(develop也会)
其他问题:bigscience/bloom-1b1模型的head_dim=96,block attn在head_dim=96时仍然有兼容性问题(develop也有)(已解决)
[Infer] Add head_dim=96 dispatch for block attention Paddle#70763
chatglm (没有block_attn的逻辑,保留原样)
chatglm_v2
测试指令:
存在问题:
append attn的计算会越界(develop也会)
gpt(没有block_attn的逻辑,保留原样)
llama
测试指令:
mixtral (本地环境爆显存了,需要帮忙测一下)
测试指令:
opt(没有block_attn的逻辑,保留原样)
qwen(没有block_attn的逻辑,保留原样)
qwen2
测试指令:
qwen2_moe
测试指令:
存在问题:
block attn 跑不通 在实现AutoModel的时候还是跑的通的(develop也是)
append attn跑不通(develop 也是)