Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64 (aarch64) DRC backend #13162

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from
Draft

Conversation

987123879113
Copy link
Contributor

  • Updated asmjit to latest master because it has some fixes for ARM64
  • Removed the vector-related code from UML and backends. It's not used anywhere that I could find so it's better to just cut it out.
  • Added ARM64 (aarch64) DRC backend

It's a big PR. I've done my best to make it work but I'm sure there are still bugs or questionable implementation choices. It passes all of the same tests I was doing for the previous DRC changes so I think the outputs should be the same as the other DRC backends.

Additional testing and/or misc feedback like benchmarks on various devices would also be appreciated. For hardware I've tested with an M1 Pro CPU and a Raspberry Pi 4 Model B. For software I've been testing using a few games on the Naomi (SH4), Firebeat (PPC), NWK-TR (PPC), and some very small testing of dgpix (Hyperstone E1-32XT).

@cuavas cuavas self-assigned this Jan 2, 2025
@cuavas
Copy link
Member

cuavas commented Jan 2, 2025

I’ll take a look at this when I get a chance.

@987123879113
Copy link
Contributor Author

A small note about my experience cross compiling for Raspberry Pi: I had to change the configuration { "x64" } lines to configuration { } in the build scripts to get it to build with the DRC in my Docker mame_raspberrypi_cross_compile environment. I don't know if it's the environment's fault or if it's a difference between MacOS and Linux and I just didn't implement that part correctly.

> uname -mps
Darwin arm64 arm

crosstoolng@ed182c661828:/$ uname -mps
Linux aarch64 aarch64

@belegdol
Copy link
Contributor

belegdol commented Jan 2, 2025

What happens on 32-bit arm? Looking at the makefile and genie scripts changes, it looks like arm64 backend would get built. Is this what we want?

@987123879113
Copy link
Contributor Author

It shouldn't do anything on 32-bit. asmjit only supports aarch64.

@belegdol
Copy link
Contributor

belegdol commented Jan 2, 2025

asmjit only supports aarch64.

I know. But looking at makefile changes, c backend is no longer forced if arm is found in uname -mps output. This could be a problem. On RPi 1B the output looks as follows:

$ uname -mps
Linux armv6l unknown

On an odroid HC1 NAS as follows:

$ uname -mps
Linux armv7l unknown

@987123879113
Copy link
Contributor Author

That doesn't mean it won't use the C backend though. FORCE_DRC_C_BACKEND just makes it skip the check to see if the platform configuration has a DRC available. For it to not use the C backend it would need to match the platform ("arm"/"arm64") and also have the x64 configuration available.

if not _OPTIONS["FORCE_DRC_C_BACKEND"] then
	if _OPTIONS["BIGENDIAN"]~="1" then
		if (_OPTIONS["PLATFORM"]=="arm" or _OPTIONS["PLATFORM"]=="arm64") then
			configuration { "x64" }
				defines {
					"NATIVE_DRC=drcbe_arm64",
				}
			configuration {  }
...

On MacOS it matches using the ifeq ($(findstring arm,$(UNAME)),arm) check, not the aarch64 one.

@rb6502
Copy link
Contributor

rb6502 commented Jan 3, 2025

MIPS: gauntleg shows an out of memory error with the AArch64 BE but -drc_use_c boots fine. Also sfrush, calsped, and mwskins - same error. (The error is shown by the game, not MAME).

PPC: scud doesn't boot without -drc_use_c.

@rb6502
Copy link
Contributor

rb6502 commented Jan 3, 2025

ppctest results:

Mismatch: instr=MULLW., src1=0x80000000, src2=0x80000000
expected: dest=0x0, XER=0x0, CR=0x20000000
got: dest=0x0, XER=0x0, CR=0x40000000

Mismatch: instr=MULLWO., src1=0x80000000, src2=0x80000000
expected: dest=0x0, XER=0xc0000000, CR=0x30000000
got: dest=0x0, XER=0xc0000000, CR=0x50000000
Test file line #: 488

src/devices/cpu/drcbearm64.cpp Outdated Show resolved Hide resolved
src/devices/cpu/drcbearm64.cpp Outdated Show resolved Hide resolved
@987123879113
Copy link
Contributor Author

987123879113 commented Jan 3, 2025

Fixed some more multiplication errors in addition to the ones AJR commented. That fixes the Midway/MIPS-based games RB mentioned. Also tried to emulate the call stack similar to how the C backend does it so that mapvars can be recovered properly which fixes scud. Also rebased against latest master to get rid of the build error.

@cuavas
Copy link
Member

cuavas commented Jan 3, 2025

@seleuco you might want to take a look at this, as ARM CPUs are the predominant targets for Android builds.

@seleuco
Copy link

seleuco commented Jan 3, 2025

@seleuco you might want to take a look at this, as ARM CPUs are the predominant targets for Android builds.

Yes. I was already aware of this pull request. I will start testing and will report back to you.

Copy link
Member

@cuavas cuavas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few relatively minor things in the declarations. I haven’t looked at the code generation yet.

src/devices/cpu/drcbearm64.h Outdated Show resolved Hide resolved
src/devices/cpu/drcbearm64.h Outdated Show resolved Hide resolved
src/devices/cpu/drcbearm64.h Outdated Show resolved Hide resolved
src/devices/cpu/drcbearm64.h Outdated Show resolved Hide resolved
src/devices/cpu/drcbearm64.h Outdated Show resolved Hide resolved
@rb6502
Copy link
Contributor

rb6502 commented Jan 4, 2025

Next up: pmac6100. -drc_use_c will show the mouse pointer and flashing disk, aarch64 loses track of the return address during an exception and crashes to PC=0x00000004. (And to clarify, you don't need any of my WIP changes to the driver to get that).

@987123879113
Copy link
Contributor Author

987123879113 commented Jan 4, 2025

@rb6502 More mapvar/recover issues. The callstack depth wasn't actually storing the new depth so it would overwrite the hashstacksave pointer sooner than it should've.

I tested the previously mentioned scud and MIPS games and those are still working + pmac6100 is booting as far as it does with the C backend now.

@rb6502
Copy link
Contributor

rb6502 commented Jan 4, 2025

Confirmed full functionality in pmac6100 with that change. I think this is good to go in terms of my regression testing now.

@seleuco
Copy link

seleuco commented Jan 4, 2025

I've successfully cross-compiled the pull request and tested performance on these Android devices without any problem at first sight:

Qualcomm Snapdragon 685 (low-midrange; Geekbench 6 single core: 473)

Game No DRC DRC (C backend) DRC (aarch64)
SF III 3D Strike (cps3) (attract mode) 90% 60% 230%
K. Instinct (attract mode) 35% 30% 150%
Die Hard (ST-V) (attract mode) 34% 19% 44%

Qualcomm Snapdragon 8 Gen 2 (high-end; Geekbench 6 single core: 1595)

Game No DRC DRC (C backend) DRC (aarch64)
SF III 3D Strike (cps3) (attract mode) 250-280% 140-160% 600-800%
K. Instinct (attract mode) 90-110% 55-90% 370-560%
Die Hard (ST-V) (attract mode) 65-95% 40-55% 80-150%
C. Taxi (Naomi) (in game) 30% 20% 50% (60% on performance mode)

A build with these changes is available for testing:

https://drive.google.com/file/d/1aJcKK-ugzOzx_f5ridu-nPR3qnbFDYaC/view?usp=sharing

I don't plan to publish anything until the next version of MAME that includes these changes, obviously.
If you want me to remove this file from my Google Drive, just let me know.

@987123879113
Copy link
Contributor Author

@seleuco Thank you for testing and providing benchmarks. Those are some really significant improvements for Android.

@invertego
Copy link
Contributor

I am happy to report this PR is working fine on a Windows ARM64 device with a Snapdragon X Elite CPU.

A small note about my experience cross compiling for Raspberry Pi: I had to change the configuration { "x64" } lines to configuration { } in the build scripts to get it to build with the DRC in my Docker mame_raspberrypi_cross_compile environment. I don't know if it's the environment's fault or if it's a difference between MacOS and Linux and I just didn't implement that part correctly.

I was just looking at the makefile with the intent of streamlining Win/ARM64 builds, and I think I can shed some light on this.

macOS ARM64 builds use this target:

.PHONY: macosx_arm64_clang
macosx_arm64_clang: generate $(PROJECTDIR)/$(MAKETYPE)-osx-clang/Makefile
	$(SILENT) $(MAKE) $(MAKEPARAMS) -C $(PROJECTDIR)/$(MAKETYPE)-osx-clang config=$(CONFIG)64 precompile
	$(SILENT) $(MAKE) $(MAKEPARAMS) -C $(PROJECTDIR)/$(MAKETYPE)-osx-clang config=$(CONFIG)64

...whereas Linux ARM64 builds use this generic target:

.PHONY: linux
linux: generate $(PROJECTDIR)/$(MAKETYPE)-linux/Makefile
	$(SILENT) $(MAKE) $(MAKEPARAMS) -C $(PROJECTDIR)/$(MAKETYPE)-linux config=$(CONFIG) precompile
	$(SILENT) $(MAKE) $(MAKEPARAMS) -C $(PROJECTDIR)/$(MAKETYPE)-linux config=$(CONFIG)

The macOS recipe passes config=$(CONFIG)64 and the Linux one passes config=$(CONFIG). The macOS target is selected by this late ad-hoc fixup to the ARCHITECTURE variable:

ifneq ($(filter arm64%,$(UNAME_M)),)
ARCHITECTURE := _arm64_clang
else

Interestingly, the macosx_arm64_clang recipe is identical to the macosx_x64_clang recipe, so there's little added value in maintaining it as a separate target.

This leaves me with a couple of questions:

  1. Should Linux/Windows ARM64 builds use a *64 configuration (and should macOS continue to use one)?
  2. Should there be separate _arm64 targets for Linux/Windows?

The answers should clarify the right thing to do in this PR.

scripts/src/3rdparty.lua Outdated Show resolved Hide resolved
@danmons
Copy link

danmons commented Jan 6, 2025

I'm unable to compile this via crosstool-ng on a Linux x86_64 machine for a Linux aarch64 target. This is the same environment I use to package up MAME for Raspberry Pi users, which to date I've been doing with NOASM=1 set at compile time. I suspect at this stage it's a problem with crosstool-ng and/or my setup, but I'm not completely sure.

My build environment is here:

If I remove NOASM=1 from my build scripts, MAME builds, however benchmarking with -drc produces the same result as benchmarking with -drc_use_c (and -nodrc is different, and in line with my old NOASM=1 results). So it seems to not use the arm64 asmjit.

If I try the trick above of replacing configuration { "x64" } with configuration { } in scripts/genie.lua, that seems to correctly set NATIVE_DRC=drcbe_arm64, however I get errors that look like the following during Linking mame when linking with GNU ld:

undefined reference to `asmjit::_abi_1_13::BaseEmitter::_emitI(unsigned int, asmjit::_abi_1_13::Operand_ const&)'

GNU gold produces similar output. The asmjit documentation suggests that ASMJIT_STATIC is not set, but I can see that it is in various generated makefiles and when compiling with VERBOSE=1.

Interestingly I see that libasmjit.a has only !<arch> inside it and nothing else, and likewise build/projects/sdl/mame/gmake-linux/obj/Release/3rdparty/asmjit/src/asmjit only has directories and no object files. It seems like asmjit isn't being built perhaps?

I'm out of my depth here a bit, but it almost feels like asmjit isn't picking up the cross-compile environment and arm64 target instructions, and things are getting confused at compile/link time. Other than the linker errors, there are no compile time errors relating to asmjit or drc or anything similar.

I'll keep poking at it, but I'm running out of ideas on what to try next.

@seleuco
Copy link

seleuco commented Jan 6, 2025

Linux aarch64 target

You're probably getting the linkage error because you haven't modified the script in 3rdparty.lua and you haven't removed "configuration { "x64" }" there as well.

But that's not the way to go. The real problem you're having is that you're not running a rule that calls config=$(CONFIG)64 so the x64 configuration isn't being selected.

I personally haven't checked the Linux aarch64 target so I don't know where the problem is right now.

You definitely need to avoid NOASM=1 because it forces the use of FORCE_DRC_C_BACKEND no matter what you do and asmjit won't be included.

@987123879113
Copy link
Contributor Author

987123879113 commented Jan 6, 2025

@danmons seleuco is right, it's due to the same issue with the config string stuff. That'll eventually need to be address for this PR I think but I'm not very comfortable with makefiles and genie and such so it'll have to be discussed and decided on how exactly to approach that.

Having said that, I was using your build environment (inside Docker) and with some changes I did get it to work with the DRC after a few failed attempts. Here's the .patch file I made for testing with your Docker build environment:

diff --git a/scripts/genie.lua b/scripts/genie.lua
index bcdbc82236f..f6a5abc5486 100644
--- a/scripts/genie.lua
+++ b/scripts/genie.lua
@@ -711,11 +711,10 @@ end
 if not _OPTIONS["FORCE_DRC_C_BACKEND"] then
 	if _OPTIONS["BIGENDIAN"]~="1" then
 		if (_OPTIONS["PLATFORM"]=="arm" or _OPTIONS["PLATFORM"]=="arm64") then
-			configuration { "x64" }
+			configuration {  }
 				defines {
 					"NATIVE_DRC=drcbe_arm64",
 				}
-			configuration {  }
 		else
 			configuration { "x64" }
 				defines {
diff --git a/scripts/src/3rdparty.lua b/scripts/src/3rdparty.lua
index e56665ef121..6466637d760 100755
--- a/scripts/src/3rdparty.lua
+++ b/scripts/src/3rdparty.lua
@@ -1949,7 +1949,7 @@ project "asmjit"
 		}

 	if (_OPTIONS["PLATFORM"]=="arm" or _OPTIONS["PLATFORM"]=="arm64") then
-		configuration { "x64" }
+		configuration { }
 			defines {
 				"ASMJIT_NO_X86",
 			}
diff --git a/src/devices/cpu/drcbearm64.cpp b/src/devices/cpu/drcbearm64.cpp
index 5a895392852..6e0b989b8b8 100644
--- a/src/devices/cpu/drcbearm64.cpp
+++ b/src/devices/cpu/drcbearm64.cpp
@@ -826,6 +826,8 @@ drcbe_arm64::drcbe_arm64(drcuml_state &drcuml, device_t &device, drc_cache &cach
 	, m_baseptr(cache.near() + 0x80)
 	, m_near(*(near_state *)cache.alloc_near(sizeof(m_near)))
 {
+	printf("Using ARM DRC\n");
+
 	// get pointers to C functions we need to call
 	using debugger_hook_func = void (*)(device_debug *, offs_t);
 	static const debugger_hook_func debugger_inst_hook = [] (device_debug *dbg, offs_t pc) { dbg->instruction_hook(pc); };

And also removed NOASM=1 from functions/compile since that forces FORCE_DRC_C_BACKEND as seleuco said.

The printf there in drcbearm64.cpp was to be sure it was actually using the ARM DRC.

My guess is you need to also edit scripts/src/3rdparty.lua to get it working for you.

@danmons
Copy link

danmons commented Jan 6, 2025

You definitely need to avoid NOASM=1 because it forces the use of FORCE_DRC_C_BACKEND no matter what you do and asmjit won't be included.

I don't think I was clear on that - yup, NOASM=1 was removed for all of my testing/output above. Only mentioned it because it's still there in my repo (as prior to this, DRC was at best missing, and at worst caused compile time errors which caused me delays in getting builds out).

Thank you both for the notes. I'll try them out shortly.

[edit]

Ooh yeah that sorted it. Running all of these with -bench 90:

Raspberry Pi 4B, 1.5GHz ARM Cortex-A72 (older model, newer RPi4s clock at 1.8GHz):
--
sfiii3n -nodrc     : 63.70%
sfiii3n -drc_use_c : 30.35%
sfiii3n -drc       : 226.03%
--
kinst -nodrc       : 27.17%
kinst -drc_use_c   : 16.76%
kinst -drc         : 125.12% 
--
diehard -nodrc     : 22.57%
diehard -drc_use_c : 8.30% 
diehard -drc       : 32.55%
--

Orange Pi 5B, 2.3GHz ARM Cortex-A76 (roughly equivalent to an RPi5 which is 2.4GHz by spec):
--
sfiii3n -nodrc     : 160.42%
sfiii3n -drc_use_c : 87.72%
sfiii3n -drc       : 526.73%
--
kinst -nodrc       : 62.81%
kinst -drc_use_c   : 48.77%
kinst -drc         : 273.50%
--
diehard -nodrc     : 53.59%
diehard -drc_use_c : 27.00% 
diehard -drc       : 71.13%
--

@sonicboy904
Copy link

I have tested some games out on my Ayn Odin 2 Pro and i have seen a lotta small and big improvement in performance for games that ran poorly on MAME4droid or a lil jump in speed.

Here are links to my tests for the ARM64 beta for MAME4droid.

Panic Park - https://youtu.be/025FNrFFo6M?si=XLUwAjCc8GycoIl5
California Speed - https://youtu.be/K7_z6FM7Rx0?si=YYTQOzKES5bKre-N
NFL Blitz - https://youtu.be/NzdQh7bFCWM?si=rlLqVQNfJh1xDsIR
Final Furlong 2 - https://youtu.be/NzdQh7bFCWM?si=rlLqVQNfJh1xDsIR

AYN Odin 2 Pro Specs

RAM: 12GB
GPU: Adreno 740
Processor: Snapdragon 8 Gen 2
Storage: 256GB + 512GB (micro SD)
OS: Android 13

@cuavas
Copy link
Member

cuavas commented Jan 7, 2025

This leaves me with a couple of questions:

1. Should Linux/Windows ARM64 builds use a *64 configuration (and should macOS continue to use one)?

2. Should there be separate _arm64 targets for Linux/Windows?

The answers should clarify the right thing to do in this PR.

Philosophically, I’d say that Linux and Windows systems, where 32-bit ARM binaries are supported, should probably be using a *64 configuration for AArch64 builds. However, macOS no longer supports 32-bit applications at all, so there may not be any point keeping the *64 there.

@invertego
Copy link
Contributor

FYI, the latest Windows 11 feature update (24H2) dropped support for 32-bit ARM applications.

@987123879113
Copy link
Contributor Author

987123879113 commented Jan 8, 2025

I pushed a change that I think should probably clear up the build issues. I tested macOS (which reports Darwin arm64 arm) and with the Raspberry Pi build environment (which reports Linux aarch64 aarch64).

5683a37

@cuavas
Copy link
Member

cuavas commented Jan 9, 2025

FYI, the latest Windows 11 feature update (24H2) dropped support for 32-bit ARM applications.

What happened to Microsoft’s famous backwards compatibility?

Copy link
Member

@cuavas cuavas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve started going through the actual code generation part. It’s going to take several cycles for me to get through all of it. Please bear with me with stupid or obvious questions.

src/devices/cpu/drcbearm64.cpp Outdated Show resolved Hide resolved
src/devices/cpu/drcbearm64.cpp Outdated Show resolved Hide resolved
src/devices/cpu/drcbearm64.cpp Outdated Show resolved Hide resolved
src/devices/cpu/drcbearm64.cpp Outdated Show resolved Hide resolved
src/devices/cpu/drcbearm64.h Outdated Show resolved Hide resolved
src/devices/cpu/drcbearm64.cpp Outdated Show resolved Hide resolved
src/devices/cpu/drcbearm64.cpp Outdated Show resolved Hide resolved
src/devices/cpu/drcbearm64.cpp Outdated Show resolved Hide resolved
src/devices/cpu/drcbearm64.cpp Outdated Show resolved Hide resolved
src/devices/cpu/drcbearm64.cpp Outdated Show resolved Hide resolved
@k2-git
Copy link
Contributor

k2-git commented Jan 10, 2025

FYI, the latest Windows 11 feature update (24H2) dropped support for 32-bit ARM applications.

What happened to Microsoft’s famous backwards compatibility?

I think that not Microsoft decided.
ARM announced only support 64-bit.

https://newsroom.arm.com/news/pushing-the-boundaries-of-performance-and-security-to-unleash-the-power-of-64-bit-computing

Edit:
ARM recommended all application 64-bit only transition.
https://newsroom.arm.com/blog/64-bit

987123879113 and others added 2 commits January 11, 2025 10:46
The stack needs to be 16-byte aligned before the call to
debug_log_hashjmp_fail.
Copy link
Member

@cuavas cuavas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the things this time are potential optimisations. Quite a few of them are fairly straightforward. There are a couple of more serious things.

  • The delegate resolver doesn’t account for this pointer adjustment, non-virtual member functions, or the MSVC ABI.
  • There are some vestigial naked argument values.
  • The simple optimisations to use bit field manipulation instructions should be straightforward.
  • Optimising address calculations to use adrp may or may not be worth the effort of adding 1.5 more cases.
  • Optimising ROLAND/ROLINS to use bit field manipulation instructions would substantially complicate the code. I don’t know if we want to go there at this point.
  • Conditionally allocating labels is just something I’m curious about – I haven’t looked at what asmjit actually does when allocating or default constructing a label, so I don’t know whether it would be worthwhile.

The only remaining thing I haven’t reviewed is the ROLC/RORC code generation. I’m pretty sure it can be optimised, I’m just not going to be able to think it through tonight.

src/devices/cpu/drcbearm64.cpp Show resolved Hide resolved
Comment on lines +820 to +834
[] (resolved_handler &handler, address_space &space, auto accessor)
{
struct {
uintptr_t ptr;
ptrdiff_t adj;
} equiv;

assert(sizeof(accessor) == sizeof(equiv));
*reinterpret_cast<decltype(accessor) *>(&equiv) = accessor;

handler.obj = uintptr_t(reinterpret_cast<uint8_t *>(&space));

auto const vptr = *reinterpret_cast<uint8_t const *const *>(handler.obj) + equiv.ptr;
handler.func = *reinterpret_cast<uint8_t *const *>(vptr);
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this with both virtual and non-virtual member functions? It should also apply this pointer adjustment to deal with multiple inheritance Essentially, this is supposed to be a special-cased version of delegate_mfp_itanium::convert_to_generic combined with enough of delegate_mfp_msvc to get it to work with the Microsoft ABI.

See delegate.cpp for the full-fat versions: https://github.com/mamedev/mame/blob/master/src/lib/util/delegate.cpp

I would have thought it should be something like this (note that it’s different to the x86-64 version because the virtual function flag is stored in the low bit of the this pointer offset for ARM-like architectures, as the low bit of a code pointer indicates a Thumb function and the offset needs to be shifted):

			[] (resolved_handler &handler, address_space &space, auto accessor)
			{
				if (MAME_DELEGATE_USE_TYPE == MAME_DELEGATE_TYPE_ITANIUM)
				{
					struct { uintptr_t ptr; ptrdiff_t adj; } equiv;
					assert(sizeof(accessor) == sizeof(equiv));
					*reinterpret_cast<decltype(accessor) *>(&equiv) = accessor;
					handler.obj = uintptr_t(reinterpret_cast<uint8_t *>(&space) + (equiv.adj >> 1));
					if (BIT(equiv.adj, 0))
					{
						auto const vptr = *reinterpret_cast<u8 const *const *>(handler.obj) + equiv.ptr;
						handler.func = *reinterpret_cast<uint8_t *const *>(vptr);
					}
					else
					{
						handler.func = reinterpret_cast<uint8_t *>(equiv.ptr);
					}
				}
				else if (MAME_DELEGATE_USE_TYPE == MAME_DELEGATE_TYPE_MSVC)
				{
					// interpret the pointer to member function ignoring the virtual inheritance variant
					struct single { uintptr_t ptr; };
					struct multi { uintptr_t ptr; int adj; };
					struct { uintptr_t ptr; int adj; int vadj; int vindex; } unknown;
					assert(sizeof(accessor) <= sizeof(unknown));
					*reinterpret_cast<decltype(accessor) *>(&unknown) = accessor;
					uint32_t const *func = reinterpret_cast<uint32_t const *>(unknown.ptr);
					handler.obj = uintptr_t(&space);
					if ((sizeof(unknown) == sizeof(accessor)) && unknown.vindex)
					{
						handler.obj += unknown.vadj;
						auto const vptr = *reinterpret_cast<uint8_t const *const *>(handler.obj);
						handler.obj += *reinterpret_cast<int const *>(vptr + unknown.vindex);
					}
					if (sizeof(single) < sizeof(accessor))
						handler.obj += unknown.adj;

					// walk past thunks
					while (true)
					{
						if ((0x90000010 == (func[0] & 0x9f00001f)) && (0x91000210 == (func[1] & 0xffc003ff)) && (0xd61f0200 == func[2]))
						{
							// page-relative jump with +/-4GB reach - adrp xip0,... ; add xip0,xip0,#... ; br xip0
							int64_t const page =
									(uint64_t(func[0] & 0x60000000) >> 17) |
									(uint64_t(func[0] & 0x00ffffe0) << 9) |
									((func[0] & 0x00800000) ? (~std::uint64_t(0) << 33) : 0);
							uint32_t const offset = (func[1] & 0x003ffc00) >> 10;
							func = reinterpret_cast<uint32_t const *>(((uintptr_t(func) + page) & (~uintptr_t(0) << 12)) + offset);
						}
						else if ((0xf9400010 == func[0]) && (0xf9400210 == (func[1] & 0xffc003ff)) && (0xd61f0200 == func[2]))
						{
							// virtual function call thunk - ldr xip0,[x0] ; ldr xip0,[x0,#...] ; br xip0
							uint32_t const *const *const vptr = *reinterpret_cast<uint32_t const *const *const *>(object);
							func = vptr[(func[1] & 0x003ffc00) >> 10];
						}
						else
						{
							// not something we can easily bypass
							break;
						}
					}
					handler.func = reinterpret_cast<uint8_t *>(uintptr_t(func));
				}
			};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That went so far over my head it's in outer space. I completely do not understand this so I can't verify if it's correct or not.

I think the code I had is pretty close to MAME_DELEGATE_TYPE_ITANIUM, and checking MAME_DELEGATE_USE_TYPE it does get set to MAME_DELEGATE_TYPE_ITANIUM, although I wasn't using adj. I can say at least that the code I was using did work with the resolved reader/writer functions and that's primarily what I was using. I also did test the non-resolver ones previously and those worked too.

I have no idea how to test MAME_DELEGATE_TYPE_MSVC though. Maybe I could install Win 11 ARM in a VM and test that out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the code is basically copy/pasted from delegate.cpp with the impossible paths cut out, so it probably works. I should probably put this stuff somewhere separate so it can be used independently of the delegate classes. I hacked it into the x86-64 DRC backend as a quick optimisation.

If you really want to test it with the MSVC ABI, you’d need to build with either MSVC itself or with clang set to use the MSVC ABI (the clang-cl wrapper that Visual Studio uses to compile with clang does this), but I wouldn’t worry too much. MSVC is basically in “marginally supported” status, and people are even less likely to be using Visual Studio on an ARM PC. If it’s broken, someone will complain eventually, and then I can worry about it.

Comment on lines +975 to +976
// compute the base by aligning the cache top to a cache line (assumed to be 64 bytes)
uint8_t *dst = (uint8_t *)(uint64_t(m_cache.top() + 63) & ~63);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a sanity check, I looked up cache line sizes for a few CPUs:

  • Apple M series: 128 bytes
  • Cortex-A series (including Kryo): 64 bytes
  • Cortex-M series: 32 bytes

Stick with 64-byte alignment, or bump it up to 128 as the biggest widespread size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this repo that can get the cache line size from the OS: https://github.com/NickStrupat/CacheLineSize

I tested it on my M1 Pro and it correctly gives 128.

On my Raspberry Pi 4B (Arm Cortex-A72) gives 64.

Not ARM, but my Ryzen 5700X gives 64 which appears to be correct from what I could find on Google.

What do you think of pulling in that code for this? I don't know if it's something that you'd want to turn into a util function and so it could be used for the other DRC backends too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll add it to 3rdparty. Looks simple enough. I’ll worry about this later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linked code has no error checking at all. I added helpers for macOS, Windows and Linux in ff92d10. I’m not sure how they’ll behave on systems with little.BIG architecture (efficiency cores and performance cores with different characteristics), but they shouldn’t make anything worse.

src/devices/cpu/drcbearm64.cpp Outdated Show resolved Hide resolved
src/devices/cpu/drcbearm64.cpp Outdated Show resolved Hide resolved
src/devices/cpu/drcbearm64.cpp Show resolved Hide resolved
src/devices/cpu/drcbearm64.cpp Outdated Show resolved Hide resolved
Comment on lines +2369 to +2370
void drcbe_arm64::op_roland(a64::Assembler &a, const uml::instruction &inst)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth detecting cases where this can be special-cased to use a ubfx or ubfiz instruction? If the mask and shift are immediate:

  • ubfx can be used if the mask is right-aligned (i.e. starts at bit position zero)
  • ubfiz can be used if the mask is contiguous and starts at the bit position equal to the shift value

For example:

  • ROLAND(d, s, 3, 0x003f) is equivalent to ubfx(d, s, bits - 3, 6)
  • ROLAND(d, s, 5, 0x07e0) is equivalent to ubfiz(d, s, 5, 6)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real-world PowerPC RLW* usage will probably fall into one of those two cases a majority of the time, and that's not even counting the hard-coded instances the DRC generates for instructions like MFCR.

I'm not sure how far into the weeds we want to go with optimizations before landing the first working version, but this seems safe and useful.

Copy link
Contributor Author

@987123879113 987123879113 Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can optimize ROLAND any more than it is now. You can't simplify the command down to just one ubfx/ubfiz because you need to rotate the value.

Here's what the current implementation does:

  • If the mask is 0 then the value doesn't change
  • If the shift is a constant then it gets simplified to a ror
  • The final command is always just a single AND of the rotated value

The only way you could probably optimize it even further is to check if srcp is a constant and if then just calculate it as a constant and a mov to the register.

Comment on lines +2431 to +2432
void drcbe_arm64::op_rolins(a64::Assembler &a, const uml::instruction &inst)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth detecting cases where this can be special-cased to use a bfi instruction? If the mask and shift are immediate, bfi can be used if the mask is contiguous and starts at the bit position equal to the shift value.

For example:

  • ROLINS(d, s, 7, 0x0380) is equivalent to bfi(d, s, 7, 3)

In cases where the shift and mask are immediate, the mask is contiguous, and the mask does not start at the bit position equal to the shift value, you can use ror followed by bfi.

For example, ROLINS(d, s, 4, 0x1f80) is equivalent to ror(tmp, s, 3) followed by bfi(d, tmp, 7, 6)

Copy link
Contributor Author

@987123879113 987123879113 Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth detecting cases where this can be special-cased to use a bfi instruction? If the mask and shift are immediate, bfi can be used if the mask is contiguous and starts at the bit position equal to the shift value.

Two things to note:

But the use of bfi to simplify the second half of the code probably will work so I'll try it.

Copy link
Contributor Author

@987123879113 987123879113 Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My implementation of the bfi thing broke rolins. bfi requires the value to start at bit 0 to insert it into the destination register, so it just seems like too much effort to save 1 instruction so I'm going to revert that change.


if (shiftp.is_immediate())
{
const auto shift = ((inst.size() * 8) - (shiftp.immediate() % (inst.size() * 8))) % (inst.size() * 8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to optimise this expression – assuming the size is always a power of two, isn’t it equivalent to:

			const auto shift = ~shiftp.immediate() & ((inst.size() * 8) - 1);

Copy link
Contributor Author

@987123879113 987123879113 Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I threw some numbers at it but they're not equivalent. Your version is consistently off by 1.

int main(int argc, char **argv)
{
    const uint32_t instsize = 4;
    const uint32_t val = 0x400;

    for (int shiftp = 0; shiftp < 8; shiftp++)
    {
        uint32_t shiftcalc1 = ((instsize * 8) - (shiftp % (instsize * 8))) % (instsize * 8);
        uint32_t shiftcalc2 = ~shiftp & ((instsize * 8) - 1);

        uint32_t res1 = rotr_32(val, shiftcalc1);
        uint32_t res2 = rotr_32(val, shiftcalc2);
        uint32_t expected = rotl_32(val, shiftp);

        printf("%d %d %d | %08x %08x %08x\n", shiftp, shiftcalc1, shiftcalc2, expected, res1, res2);
    }

    return 0;
}

result

0 0 31 | 00000400 00000400 00000800
1 31 30 | 00000800 00000800 00001000
2 30 29 | 00001000 00001000 00002000
3 29 28 | 00002000 00002000 00004000
4 28 27 | 00004000 00004000 00008000
5 27 26 | 00008000 00008000 00010000
6 26 25 | 00010000 00010000 00020000
7 25 24 | 00020000 00020000 00040000

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well you can fix the off-by-one by using twos-complement negation rather than inversion:

			const auto shift = -int64_t(shiftp.immediate()) & ((inst.size() * 8) - 1);

It should still be cleaner than needing the bit count in three places and requiring two modulo operations.

Copy link
Member

@cuavas cuavas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just suggested optimisation of ROLC/RORC with immediate shift. You can get them down to just five bit field manipulation instructions.

Sorry, haven’t had time to write up optimisation of ROLAND/ROLINS tonight, or to look at optimising ROLC/RORC with variable shift.

Comment on lines +349 to +363
const uint64_t pagebase = codeoffs & ~make_bitmask<uint64_t>(12);
const int64_t pagerel = (int64_t)ptr - pagebase;
if (is_valid_immediate(abs(pagerel), 12) || ((abs(pagerel) & 0xfff) == 0 && is_valid_immediate(abs(pagerel) >> 12, 12)))
{
a.brk(1);

a.adrp(MEM_SCRATCH_REG, pagebase);

if (is_valid_immediate_signed(pagerel, 9))
return arm::Mem(MEM_SCRATCH_REG, pagerel);
else if (pagerel > 0 && emit_add_optimized(a, MEM_SCRATCH_REG, MEM_SCRATCH_REG, pagerel))
return arm::Mem(MEM_SCRATCH_REG);
else if (pagerel < 0 && emit_sub_optimized(a, MEM_SCRATCH_REG, MEM_SCRATCH_REG, pagerel))
return arm::Mem(MEM_SCRATCH_REG);
}
Copy link
Member

@cuavas cuavas Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn’t seem to be right. You’re using adrp on pagebase which is the page base of the current PC for the generated code. Wouldn’t that always produce a zero immediate? You want to use it on the target address.

	const uint64_t pagebase = codeoffs & ~make_bitmask<uint64_t>(12);
	const int64_t pagerel = (int64_t)ptr - pagebase;
	if (is_valid_immediate_signed(pagerel, 33))
	{
		const uint64_t targetpage = ptr & ~make_bitmask<uint64_t>(12);
		const uint64_t pageoffs = ptr & util::make_bitmask<uint64_t>(12);

		a.adrp(MEM_SCRATCH_REG, targetpage);

		if (is_valid_immediate_signed(pageoffs, 9))
		{
			return arm::Mem(MEM_SCRATCH_REG, pageoffs);
		}
		else
		{
			a.add(MEM_SCRATCH_REG, MEM_SCRATCH_REG, pageoffs);
			return arm::Mem(MEM_SCRATCH_REG);
		}
	}

You know the offset relative to the target page will always be a valid 12-bit unsigned value, as it’s just the low twelve bits of the value. It’s always a positive offset from the start of the target page. You don’t need to check for a negative value.

Comment on lines +494 to +515
const uint64_t pagebase = codeoffs & ~make_bitmask<uint64_t>(12);
const int64_t pagerel = (int64_t)ptr - pagebase;
if (is_valid_immediate(abs(pagerel), 12) || ((abs(pagerel) & 0xfff) == 0 && is_valid_immediate(abs(pagerel) >> 12, 12)))
{
a.adrp(MEM_SCRATCH_REG, pagebase);

if (is_valid_immediate_signed(pagerel, 9))
{
a.emit(opcode, reg, arm::Mem(MEM_SCRATCH_REG, pagerel));
return;
}
else if (pagerel > 0 && emit_add_optimized(a, MEM_SCRATCH_REG, MEM_SCRATCH_REG, pagerel))
{
a.emit(opcode, reg, arm::Mem(MEM_SCRATCH_REG));
return;
}
else if (pagerel < 0 && emit_sub_optimized(a, MEM_SCRATCH_REG, MEM_SCRATCH_REG, pagerel))
{
a.emit(opcode, reg, arm::Mem(MEM_SCRATCH_REG));
return;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the other one – this seems to be using adrp for the page of the current PC, which will always give zero, rather than for the target address.

Comment on lines +1677 to +1686
a.mrs(TEMP_REG1, a64::Predicate::SysReg::kNZCV);

a.bfc(TEMP_REG1, 28, 4); // clear upper 4 status bits

a.and_(TEMP_REG2, FLAGS_REG, 0b1100); // zero + sign
a.ubfx(TEMP_REG3, FLAGS_REG, 1, 1); // overflow flag
a.orr(TEMP_REG2, TEMP_REG2, TEMP_REG3);
a.bfi(TEMP_REG1, TEMP_REG2, 28, 4);

a.msr(a64::Predicate::SysReg::kNZCV, TEMP_REG1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don’t need the a.bfc(TEMP_REG1, 28, 4); any more – the a.bfi(TEMP_REG1, TEMP_REG2, 28, 4); will completely replace the contents of the upper four bits.

Comment on lines +1794 to +1803
a.mrs(TEMP_REG1, a64::Predicate::SysReg::kNZCV);

a.bfc(TEMP_REG1, 28, 4); // clear upper 4 status bits

a.and_(TEMP_REG2, FLAGS_REG, 0b1100); // zero + sign
a.ubfx(TEMP_REG3, FLAGS_REG, 1, 1); // overflow flag
a.orr(TEMP_REG2, TEMP_REG2, TEMP_REG3);
a.bfi(TEMP_REG1, TEMP_REG2, 28, 4);

a.msr(a64::Predicate::SysReg::kNZCV, TEMP_REG1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here – the bfc isn’t necessary any more.

Comment on lines +3466 to +3467
void drcbe_arm64::op_rolc(a64::Assembler &a, const uml::instruction &inst)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can special-case this if the shift is an immediate and do it with bit manipulation operations. Let’s illustrate with an 8-bit example, and a 3-bit rotation.

Before:

x  |7|6|5|4|3|2|1|0|
f  |c|

Desired state after y = ROLC(x, 3):

y  |4|3|2|1|0|c|7|6|
f  |5|

Now obviously, we don’t have a native operation that does this, but we can synthesise it:

Before:

x  |7|6|5|4|3|2|1|0|
f  |c|

Copy desired carry flag output to temp ubfx t,x,8-3,1:

x  |7|6|5|4|3|2|1|0|
t  |5|
f  |c|

Rotate upper bits around ubfx y,x,8-3+1,3-1:

x  |7|6|5|4|3|2|1|0|
y  |-|-|-|-|-|-|7|6|
t  |5|
f  |c|

Insert current carry bfi y,f,3-1,1:

x  |7|6|5|4|3|2|1|0|
y  |-|-|-|-|-|c|7|6|
t  |5|
f  |c|

Rotate lower bits around bfi y,x,3,8-3:

x  |7|6|5|4|3|2|1|0|
y  |4|3|2|1|0|c|7|6|
t  |5|
f  |c|

Now you’ve got the desired output in y and the desired carry flag in t. Assuming I understand correctly that carry is bit 0 of the flags register, conceptually it’s:

ubfx   carry,src1p,bits-shift,1               ; copy desired carry
ubfx   output,src1p,bits-shift+1,shift-1      ; rotate upper bits around
bfi    output,FLAGS_REG,shift-1,1             ; insert current carry
bfi    output,src1p,shift,bits-shift          ; rotate lower bits around
bfi    FLAGS_REG,carry,0,1                    ; copy carry to flags register

So the code to do this would be something like:

...
	if (src2p.is_immediate())
	{
		const auto s = src2p.immediate() % (inst.size() * 8);

		if (s != 0)
		{
			a.ubfx(carry, param1, (inst.size() * 8) - s, 1);
			a.ubfx(output, param1, (inst.size() * 8) - s + 1, s - 1);
			a.bfi(output.x(), FLAGS_REG, s - 1, 1);
			a.bfi(output, param1, shift, (inst.size() * 8) - s);
			a.bfi(FLAGS_REG, carry.x(), 0, 1);
		}
		else
		{
			...
		}
	}
	else
	{
		const a64::Gp shift = src2p.select_register(TEMP_REG2, inst.size());
		mov_reg_param(a, inst.size(), shift, src2p);
...

Comment on lines +3532 to +3533
void drcbe_arm64::op_rorc(a64::Assembler &a, const uml::instruction &inst)
{
Copy link
Member

@cuavas cuavas Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do the same optimisation as for ROLC. Let’s illustrate with an 8-bit example, and a 3-bit rotation again.

Before:

x  |7|6|5|4|3|2|1|0|
f  |c|

Desired state after y = RORC(x, 3):

x  |1|0|c|7|6|5|4|3|
f  |2|

To synthesise it:

Before:

x  |1|0|c|7|6|5|4|3|
f  |2|

Copy desired carry flag output to temp ubfx t,x,3-1,1:

x  |7|6|5|4|3|2|1|0|
t  |2|
f  |c|

Rotate upper bits around ubfx y,x,3,8-3:

x  |7|6|5|4|3|2|1|0|
y  |-|-|-|7|6|5|4|3|
t  |2|
f  |c|

Insert current carry bfi y,f,8-3,1:

x  |7|6|5|4|3|2|1|0|
y  |-|-|c|7|6|5|4|3|
t  |2|
f  |c|

Rotate lower bits around bfi y,x,8-3+1,3-1:

x  |7|6|5|4|3|2|1|0|
y  |1|0|c|7|6|5|4|3|
t  |2|
f  |c|

So the code to do this would be something like:

...
	if (src2p.is_immediate())
	{
		const auto s = src2p.immediate() % (inst.size() * 8);

		if (s != 0)
		{
			a.ubfx(carry, param1, s - 1, 1);
			a.ubfx(output, param1, s, (inst.size() * 8) - s);
			a.bfi(output.x(), FLAGS_REG, (inst.size() * 8) - s, 1);
			a.bfi(output, param1, (inst.size() * 8) - s + 1, s - 1);
			a.bfi(FLAGS_REG, carry.x(), 0, 1);
		}
		else
		{
			...
		}
	}
	else
	{
		const a64::Gp shift = src2p.select_register(TEMP_REG2, inst.size());
		mov_reg_param(a, inst.size(), shift, src2p);
...

@cuavas
Copy link
Member

cuavas commented Jan 12, 2025

Sorry, had to edit two of the review comments due to errors in the suggested code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants