Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8343689: AArch64: Optimize MulReduction implementation #225

Open
wants to merge 1 commit into
base: vectorIntrinsics
Choose a base branch
from

Conversation

mikabl-arm
Copy link
Contributor

@mikabl-arm mikabl-arm commented Jan 14, 2025

Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.

Benchmarks results for an AArch64 CPU with support for SVE with 256-bit vector length:

  Benchmark                 (size)   Mode      Old        New  Units
  Byte256Vector.MULLanes      1024  thrpt  502.498  10222.717 ops/ms
  Double256Vector.MULLanes    1024  thrpt  172.116   3130.997 ops/ms
  Float256Vector.MULLanes     1024  thrpt  291.612   4164.138 ops/ms
  Int256Vector.MULLanes       1024  thrpt  362.276   3717.213 ops/ms
  Long256Vector.MULLanes      1024  thrpt  184.826   2054.345 ops/ms
  Short256Vector.MULLanes     1024  thrpt  379.231   5716.223 ops/ms

Benchmarks results for an AArch64 CPU with support for SVE with 512-bit vector length:

  Benchmark                 (size)   Mode      Old       New   Units
  Byte512Vector.MULLanes      1024  thrpt  160.129  2630.600  ops/ms
  Double512Vector.MULLanes    1024  thrpt   51.229  1033.284  ops/ms
  Float512Vector.MULLanes     1024  thrpt   84.617  1658.400  ops/ms
  Int512Vector.MULLanes       1024  thrpt  109.419  1180.310  ops/ms
  Long512Vector.MULLanes      1024  thrpt   69.036   704.144  ops/ms
  Short512Vector.MULLanes     1024  thrpt  131.029  1629.632  ops/ms

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (1 review required, with at least 1 Committer)

Issue

  • JDK-8343689: AArch64: Optimize MulReduction implementation (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/panama-vector.git pull/225/head:pull/225
$ git checkout pull/225

Update a local copy of the PR:
$ git checkout pull/225
$ git pull https://git.openjdk.org/panama-vector.git pull/225/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 225

View PR using the GUI difftool:
$ git pr show -t 225

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/panama-vector/pull/225.diff

Using Webrev

Link to Webrev Comment

Add a reduce_mul intrinsic SVE specialization for >= 256-bit long
vectors. It multiplies halves of the source vector using SVE
instructions to get to a 128-bit long vector that fits into a SIMD&FP
register. After that point, existing ASIMD implementation is used.

Benchmarks results for an AArch64 CPU with support for SVE with 256-bit
vector length:

  Benchmark                 (size)   Mode      Old        New  Units
  Byte256Vector.MULLanes      1024  thrpt  502.498  10222.717 ops/ms
  Double256Vector.MULLanes    1024  thrpt  172.116   3130.997 ops/ms
  Float256Vector.MULLanes     1024  thrpt  291.612   4164.138 ops/ms
  Int256Vector.MULLanes       1024  thrpt  362.276   3717.213 ops/ms
  Long256Vector.MULLanes      1024  thrpt  184.826   2054.345 ops/ms
  Short256Vector.MULLanes     1024  thrpt  379.231   5716.223 ops/ms

Benchmarks results for an AArch64 CPU with support for SVE with 512-bit
vector length:

  Benchmark                 (size)   Mode      Old       New   Units
  Byte512Vector.MULLanes      1024  thrpt  160.129  2630.600  ops/ms
  Double512Vector.MULLanes    1024  thrpt   51.229  1033.284  ops/ms
  Float512Vector.MULLanes     1024  thrpt   84.617  1658.400  ops/ms
  Int512Vector.MULLanes       1024  thrpt  109.419  1180.310  ops/ms
  Long512Vector.MULLanes      1024  thrpt   69.036   704.144  ops/ms
  Short512Vector.MULLanes     1024  thrpt  131.029  1629.632  ops/ms
@bridgekeeper
Copy link

bridgekeeper bot commented Jan 14, 2025

👋 Welcome back mablakatov! A progress list of the required criteria for merging this PR into vectorIntrinsics will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jan 14, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added the rfr label Jan 14, 2025
@mlbridge
Copy link

mlbridge bot commented Jan 14, 2025

Webrevs

@PaulSandoz
Copy link
Member

@mikabl-arm you can create a PR with this change against https://github.com/openjdk/jdk. Since the Vector API is incubating in the jdk/master repo we prefer to target such changes as this to that repo.

The panama-vector repo is then used for larger more speculative changes, rather than accumulating smaller changes into a larger harder to review PR to jdk/master later on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

2 participants