Compiler incorrectly optimises away dependend code #36

Dantali0n · 2020-12-18T18:44:37Z

Hello I am writing an FFT algorithm in OpenCL and have found a pretty nasty bug in the ROCm OpenCL implementation. The problem resolves around the following kernel it's l2 variable:

void kernel fft(global double *real, global double *imag, ulong size, ulong power) {
	double c1 = -1.0;
	double c2 = 0.0;
	long l2 = 1;

	for (uint l = 0; l < power; l++) {
		uint l1 = l2;
		l2 <<= 1;
		double u1 = 1.0;
		double u2 = 0.0;

		for (uint j = 0; j < l1; j++) {
			for (uint i = j; i < size; i += l2) {
				uint i1 = i + l1;
				double t1 = u1 * real[i1] - u2 * imag[i1];
				double t2 = u1 * imag[i1] + u2 * real[i1];

				real[i1] = real[i] - t1;
				imag[i1] = imag[i] - t2;
				real[i] += t1;
				imag[i] += t2;
			}
			double z = ((u1 * c1) - (u2 * c2));
			u2 = ((u1 * c2) + (u2 * c1));
			u1 = z;
		}

		double onecm = 1.0 - c1;
		double onecp = 1.0 + c1;
		c2 = sqrt(onecm / 2.0);
		c1 = sqrt(onecp / 2.0);

		c2 = -c2;	
	}
}

This kernel is launched using a simple global range of 1. So no parallelism at all, single CU, single SE, single wavefront. However, the above kernel produces incorrect results.

I know for sure this is an optimization bug as forcefully printing l2 during execution makes the kernel produce correct results. Furthermore, adding -cl-opt-disable to the build program options also resolves the issue!

...
for (uint l = 0; l < power; l++) {
	uint l1 = l2;
	l2 <<= 1;
	printf("l2: %u\n", l2);
	double u1 = 1.0;
	double u2 = 0.0;
...

Once again, this can not be due to concurrency issues as the kernel is launched with

this->cl_queue.enqueueNDRangeKernel(kernel_add, cl::NullRange, cl::NDRange(1), cl::NullRange);

Settings -WB, -simplifycfg-sink-common=0 as mentioned in the DarkTable issue does not resolve the issue. Setting the optimization to anything above -O0 will produce incorrect results.

Please also see: ROCm/ROCm-OpenCL-Runtime#115

I have attached a standalone project with an ard-ocl target for which the source can be found in the oclfft folder. Several test cases for ard-ocl are included in the tests folder which uses boost to provide a unit test framework. The FFT function shown in a previous comment on this issue is used but produces incorrect results when compared against FFTW. The kernel is launched sequentially I.E. with a dimension of 1. When the kernel code is run on the CPU instead of using ROCM and OpenCL the results are correct.

This standalone project allows to isolate the optimization bug and test if the output is correct or not.
perf-engineering-project-3d31331f3aa00dc5d800af6e2b2210fcf104234b.tar.gz

FFTW, boost and cmake are required to run the standalone app.

The text was updated successfully, but these errors were encountered:

Dantali0n · 2021-01-07T08:15:12Z

I have noticed the isolated project showcasing the bug has an error, here is a updated version:

perf-engineering-project-ard-seq.zip

lamb-j · 2023-03-30T19:13:30Z

This was reportedly fixed here: https://reviews.llvm.org/D82603

Report back if you're still having issues with this though!

Dantali0n · 2023-03-31T06:38:17Z

This was reportedly fixed here: https://reviews.llvm.org/D82603

Report back if you're still having issues with this though!

Could you try to briefly explain how this patch solves this particular loop optimization, I think that would be most interesting.

lamb-j closed this as completed Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compiler incorrectly optimises away dependend code #36

Compiler incorrectly optimises away dependend code #36

Dantali0n commented Dec 18, 2020

Dantali0n commented Jan 7, 2021

lamb-j commented Mar 30, 2023

Dantali0n commented Mar 31, 2023

Compiler incorrectly optimises away dependend code #36

Compiler incorrectly optimises away dependend code #36

Comments

Dantali0n commented Dec 18, 2020

Dantali0n commented Jan 7, 2021

lamb-j commented Mar 30, 2023

Dantali0n commented Mar 31, 2023