Skip to content
This repository has been archived by the owner on May 14, 2024. It is now read-only.

Compiler incorrectly optimises away dependend code #36

Closed
Dantali0n opened this issue Dec 18, 2020 · 3 comments
Closed

Compiler incorrectly optimises away dependend code #36

Dantali0n opened this issue Dec 18, 2020 · 3 comments

Comments

@Dantali0n
Copy link

Hello I am writing an FFT algorithm in OpenCL and have found a pretty nasty bug in the ROCm OpenCL implementation. The problem resolves around the following kernel it's l2 variable:

void kernel fft(global double *real, global double *imag, ulong size, ulong power) {
	double c1 = -1.0;
	double c2 = 0.0;
	long l2 = 1;

	for (uint l = 0; l < power; l++) {
		uint l1 = l2;
		l2 <<= 1;
		double u1 = 1.0;
		double u2 = 0.0;

		for (uint j = 0; j < l1; j++) {
			for (uint i = j; i < size; i += l2) {
				uint i1 = i + l1;
				double t1 = u1 * real[i1] - u2 * imag[i1];
				double t2 = u1 * imag[i1] + u2 * real[i1];

				real[i1] = real[i] - t1;
				imag[i1] = imag[i] - t2;
				real[i] += t1;
				imag[i] += t2;
			}
			double z = ((u1 * c1) - (u2 * c2));
			u2 = ((u1 * c2) + (u2 * c1));
			u1 = z;
		}

		double onecm = 1.0 - c1;
		double onecp = 1.0 + c1;
		c2 = sqrt(onecm / 2.0);
		c1 = sqrt(onecp / 2.0);

		c2 = -c2;	
	}
}

This kernel is launched using a simple global range of 1. So no parallelism at all, single CU, single SE, single wavefront. However, the above kernel produces incorrect results.

I know for sure this is an optimization bug as forcefully printing l2 during execution makes the kernel produce correct results. Furthermore, adding -cl-opt-disable to the build program options also resolves the issue!

...
for (uint l = 0; l < power; l++) {
	uint l1 = l2;
	l2 <<= 1;
	printf("l2: %u\n", l2);
	double u1 = 1.0;
	double u2 = 0.0;
...

Once again, this can not be due to concurrency issues as the kernel is launched with

this->cl_queue.enqueueNDRangeKernel(kernel_add, cl::NullRange, cl::NDRange(1), cl::NullRange);

Settings -WB, -simplifycfg-sink-common=0 as mentioned in the DarkTable issue does not resolve the issue. Setting the optimization to anything above -O0 will produce incorrect results.

Please also see: ROCm/ROCm-OpenCL-Runtime#115

I have attached a standalone project with an ard-ocl target for which the source can be found in the oclfft folder. Several test cases for ard-ocl are included in the tests folder which uses boost to provide a unit test framework. The FFT function shown in a previous comment on this issue is used but produces incorrect results when compared against FFTW. The kernel is launched sequentially I.E. with a dimension of 1. When the kernel code is run on the CPU instead of using ROCM and OpenCL the results are correct.

This standalone project allows to isolate the optimization bug and test if the output is correct or not.
perf-engineering-project-3d31331f3aa00dc5d800af6e2b2210fcf104234b.tar.gz

FFTW, boost and cmake are required to run the standalone app.

@Dantali0n
Copy link
Author

I have noticed the isolated project showcasing the bug has an error, here is a updated version:

perf-engineering-project-ard-seq.zip

@lamb-j
Copy link
Collaborator

lamb-j commented Mar 30, 2023

This was reportedly fixed here: https://reviews.llvm.org/D82603

Report back if you're still having issues with this though!

@lamb-j lamb-j closed this as completed Mar 30, 2023
@Dantali0n
Copy link
Author

This was reportedly fixed here: https://reviews.llvm.org/D82603

Report back if you're still having issues with this though!

Could you try to briefly explain how this patch solves this particular loop optimization, I think that would be most interesting.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants