Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate cpp_double_fp_backend #648

Open
wants to merge 654 commits into
base: develop
Choose a base branch
from

Conversation

ckormanyos
Copy link
Member

No description provided.

sinandredemption and others added 30 commits August 21, 2021 21:34
check if integer width is adequate in split()
…into gsoc2021_double_float_chris

# Conflicts:
#	.github/workflows/multiprecision_quad_double_only.yml
#	.gitignore
#	performance/performance_test.cpp
#	test/test_arithmetic.hpp
@cosurgi
Copy link

cosurgi commented Jan 16, 2025

Hi Chris (@ckormanyos) this is very hot off the press. I have just managed to reduce the number of rd_string calls from 233480 to 48 and str calls from 194842 to 437 in the yade -n --quickperformance -j 1 benchmark!

I cannot run the timing benchmarks yet, because I need to clean up the code and remove all these std::cerr << __PRETTY_FUNCTION__ << "\n"; everywhere ;) They slow down the calculations!

Now if we manage to add pow for integer powers we will be good to go!

@ckormanyos
Copy link
Member Author

ckormanyos commented Jan 16, 2025

Hi Chris (@ckormanyos) this is very hot off the press. I have just managed to reduce the number of rd_string calls from 233480 to 48 and str calls from 194842 to 437 in the ./examples/yade -n --quickperformance -j 1 benchmark!

This is great. Nice work.

Now if we manage to add pow for integer powers we will be good to go!

I have not done that, but I verified that we go to a somewhat efficient calculation in our existing collection of default functions. If this ends up being one of the"functions-to-speed-up", then I can do that fast.

Now, we might still face some quirky issue, and the proof will be in your final bench run. If we are speedy, then good. If, for some other reason, we still face slow downs then we even now have two wins:

  • YADE is going to get better
  • And if we still need to squeeze more cpp_double_fp functions, then we can find them and push their limits down.

So no matter what actual numbers you get, your work has put us in a stronger position. Let's keep going!

Cc: @sinandredemption and @jzmaddock

@cosurgi
Copy link

cosurgi commented Jan 16, 2025

Wow, with last single small change I reduced the rd_string calls to 2 !

@cosurgi
Copy link

cosurgi commented Jan 16, 2025

Now if we manage to add pow for integer powers we will be good to go!

I have not done that, but I verified that we go to a somewhat efficient calculation in our existing collection of default functions. If this ends up being one of the"functions-to-speed-up", then I can do that fast.

Hi Chris (@ckormanyos), what happens if you replace z = z*z + c; with z = pow(z,2) + c; in your Mandelbrot benchmark?

@ckormanyos
Copy link
Member Author

Hi Chris (@ckormanyos), what happens if you replace z = z*z + c; with z = pow(z,2) + c; in your Mandelbrot benchmark?

It ruins the performance completely and entirely. What a great question Janek. It took so long that I am still waiting for the timing result. i had real/imag components separated. See the pic below.

In summary, the pow function killed performance on that particular benchmark.

I went from 17 seconds to 170 seconds, a factor of 10

image

@ckormanyos
Copy link
Member Author

ckormanyos commented Jan 16, 2025

Nightmare timing up by a factor of 10.

Oops, it is time to hand-optimize pow(x, n) where $n$ is an integer.

image

@cosurgi
Copy link

cosurgi commented Jan 16, 2025

Whew. That's good news for me actually, because yade uses pow(arg, 2) 436138 times in the calculations. And I have removed calls to rd_string entirely. And I got this benchmark result (more "iter/sec" means better result):

type calculation speed factor
float128 159.4864 iter/sec 1
cpp_double_double 31.3289 iter/sec 5.09

Meaning that in yade float128 is still 5 times faster than cpp_double_double.

Before removing calls to rd_string it was like this:

type calculation speed factor commit
float128 145.3411 iter/sec 1
cpp_double_double 30.4207 iter/sec 4.77 9f34658

So you may notice that float128 performance increased by a factor of 1.1 (10% faster) thanks to removing string streaming. And cpp_double_double is also faster, but only a tiny bit: by a factor of 1.03 (3% faster). While the comparison between the two actually got worse: from 4.77 to 5.09. Meaning that float128 benefitted more from the removal of string streaming than cpp_double_double.

@cosurgi
Copy link

cosurgi commented Jan 16, 2025

I am not sure if on that screenshot the lines 140 and 141 are correct? You have zr2 = pow(zr, zr),
shouldn't it be:zr2 = pow(zr, 2) ? Or something like this, but not raising $zr^{zr}$.

@ckormanyos
Copy link
Member Author

ckormanyos commented Jan 17, 2025

I am not sure if on that screenshot the lines 140 and 141 are correct? You have zr2 = pow(zr, zr),
shouldn't it be: zr2 = pow(zr, 2)?

You are right Janek. That was a silly, late evening, hurried blunder.

When I used the proper pow(zr, 2), the timing was worse $24s$ compared to $17s$, but not so bad as the previous report.

image

image

@jzmaddock
Copy link
Collaborator

Just curious, did we not optimize the default pow function for integer exponents?

Also just FYI the boost::math::pow(x) function is designed to optimise exactly this case: a power with a constant integer exponent. So far as I know there is no way within the language to detect that pow(T, int) is being called with an integer literal?

@ckormanyos
Copy link
Member Author

Just curious, did we not optimize the default pow function for integer exponents?

Yes John, you are right. The generic collection of functions in Multiprecision DOES include specializations of eval_pow for pure integral powers.

I am experimenting with a local version, but I am not able to get significantly faster than the default version in Multiprecision, maybe only $10-20\%$ faster.

At the moment, I am not able to see any more clear bottlenecks in the overall performance of cpp_double_fp_backend.

@jzmaddock
Copy link
Collaborator

@ckormanyos does this PR improve power performance at all: #649 ?

@cosurgi
Copy link

cosurgi commented Jan 17, 2025

Chris (@ckormanyos) can you share your Mandelbrot benchmark code? I want to make sure that I can reproduce your results. Because if I don't, then we know it's not a problem with cpp_double_fp but with my local configuration.

@cosurgi
Copy link

cosurgi commented Jan 17, 2025

Chris (@ckormanyos) in this post with the Mandelbrot benchmark which version of g++ and optimization flags (-O3, -Ofast ?) did you use to compare cpp_double_double with float128 ?

@ckormanyos
Copy link
Member Author

the Mandelbrot benchmark

See also: BoostGSoC21#190

Hi Janek (@cosurgi), I have made a dedicated issue for this discussion. In that issue, I will provide the benchmark code and, yes, it does offer the ability to compare bin-float, dec-float, float128 and double-double.

Give me a day or so to prepare a branch of the Mandelbrot for your dedicated use.

Cc: @jzmaddock and @sinandredemption

@ckormanyos
Copy link
Member Author

ckormanyos commented Jan 17, 2025

does this PR improve power performance at all

Hi John (@jzmaddock). In a word, yes. Treating small powers in that super-fast way is something we should probably do.

Another thing I have been playing around with is a more subtle issue. In my recent pushes here, I have introduced a concept called mul_unchecked. And this just cycled green.

In mul_unchecked I skip the prologue to multiplication which checks for NaN, infinity, zero and the like. Sort of making a pure multiplication that is separate from the eval_mul operation.

As it turns out, the floating-point-class checks actually do slow down these tiny, tiny backends significantly. We also found this to be relevant for the work in decimal. So a bit down the evolutionary road I will also be separating the work from the safety of mul/div operations as well.

So if you have already checked edge cases in a function like pow or exp or similar, you can squeeze away further checks in mul/div in that function's implementation.

As for your changes there, I think they definitely help all of multiprecision, but I still might end up specializing $x^n$ for the double-float backend if that squeezes out $5\%$ or more, as it seems to in my recent studies.

Cc: @cosurgi

@cosurgi
Copy link

cosurgi commented Jan 17, 2025

I posted some latest YADE benchmark results in BoostGSoC21#190 , suddenly it starts to look good with clang.
(initially I posted this here, but then I moved this post over there)

@ckormanyos
Copy link
Member Author

Note to self: TODO Hit the edge-cases of the new eval_pow method.

@ckormanyos
Copy link
Member Author

Performance of algebraic functions re-affirmed in BoostGSoC21#190

@cosurgi
Copy link

cosurgi commented Jan 18, 2025

OK, so the bad performance mystery was solved and I did benchmarks of YADE software yade -n --quickperformance -j 4 on a quite recent CPU Intel i7-14700KF and the results are good. Some are interesting. We can definitely mark the performance problem of the cpp_double_fp_backend as solved. Now only the compiler developers will have something to talk about :)

Here are the results:

cpp_double_double

type calculation speed factor
cpp_double_double g++ 12.2 449.15 iter/sec 1
float128 g++ 12.2 263.15 iter/sec 1.70
cpp_bin_float<32> g++ 12.2 211.81 iter/sec 2.12
cpp_dec_float<31> g++ 12.2 78.15 iter/sec 5.74
mpfr_float_backend<31> g++ 12.2 51.01 iter/sec 8.80

Here we can see that cpp_double_double beats everyone else by over a factor of two.

cpp_double_long_double

type calculation speed factor
cpp_bin_float<39> g++ 12.2 122.55 iter/sec 1
cpp_double_long_double clang++ 19.1.4 108.79 iter/sec 1.12
cpp_bin_float<39> clang++ 19.1.4 102.19 iter/sec 1.20
cpp_dec_float<39> g++ 12.2 71.42 iter/sec 1.71
mpfr_float_backend<39> g++ 12.2 45.75 iter/sec 2.67
cpp_double_long_double g++ 12.2 14.97 iter/sec 8.18

Here we can see that cpp_double_long_double performs very good. But the compiler developers will have a mystery to solve: cpp_bin_float<39> g++ 12.2 is faster than cpp_double_long_double clang++ 19.1.4 by just a little, which in turn is faster than cpp_double_long_double g++ 12.2 by a factor of 8.

cpp_double_float128

type calculation speed factor
cpp_bin_float<67> g++ 12.2 118.43 iter/sec 1
mpfr_float_backend<67> g++ 12.2 43.34 iter/sec 2.73
cpp_dec_float<67> g++ 12.2 40.09 iter/sec 2.95
cpp_double_float128 g++ 12.2 14.99 iter/sec 7.90

Here we can see that cpp_double_float128 has a lot of potential to beat cpp_bin_float<67> once the g++ developers sort out the problems with cpp_double_long_double g++ 12.2. The increase in performance should be about by a factor of 8 :)

So all is good. I think we can merge this branch once documentation and other small TODOs are complete.

@ckormanyos
Copy link
Member Author

ckormanyos commented Jan 19, 2025

We can definitely mark the performance problem of the cpp_double_fp_backend as solved.

Thank you Janek (@cosurgi) that was a big effort, and it really provided a lot of information and clarity.

Some of the results on cpp_double_long_double, where long double is 80-bit, 10-byte in width are interesting. That hardware version of the 10-byte floating-point representation is running on the legendary (modernized) versions of the i387 FPU, the hardware that really put 10-byte floating-point on the map.

The newer i7 processors have extremely powerful 64-bit floating-point hardware operations, and it seems like these are being very well supported nowdays in hardware and software.

Down the road I will be doing some non-x86_64 measurements on M1 and/or M2 and a few embedded bare-metal controllers like an ARM(R) Cortex(R) M7, having double-precision floating-point FPU support.

All-in-all I'm somewhat surprised at how fast cpp_double_double ended up in certain harware/software configurations. As mentioned in previous posts, this backend (and of course that type specifically) have lots of room for optimization improvement.

I'm happy enough with it to make a first release out of this state.

Cc: @sinandredemption and @jzmaddock

@jzmaddock
Copy link
Collaborator

There might be one more thing to check: that each of the backend/compiler configurations are doing (roughly) the same amount of work. Something that can happen when there is a tolerance set for termination is you can hit "unfortunate" parameters which cause the code to thrash through many needless iterations which don't actually get you any closer to the end result. I have no idea if this is the case here, but because they don't behave quite like exactly rounded IEEE types, things like double double can easily break assumptions present in the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants