ARM64 (aarch64) DRC backend #13162

987123879113 · 2025-01-02T14:02:31Z

Updated asmjit to latest master because it has some fixes for ARM64
Removed the vector-related code from UML and backends. It's not used anywhere that I could find so it's better to just cut it out.
Added ARM64 (aarch64) DRC backend

It's a big PR. I've done my best to make it work but I'm sure there are still bugs or questionable implementation choices. It passes all of the same tests I was doing for the previous DRC changes so I think the outputs should be the same as the other DRC backends.

Additional testing and/or misc feedback like benchmarks on various devices would also be appreciated. For hardware I've tested with an M1 Pro CPU and a Raspberry Pi 4 Model B. For software I've been testing using a few games on the Naomi (SH4), Firebeat (PPC), NWK-TR (PPC), and some very small testing of dgpix (Hyperstone E1-32XT).

cuavas · 2025-01-02T14:03:45Z

I’ll take a look at this when I get a chance.

987123879113 · 2025-01-02T14:27:22Z

A small note about my experience cross compiling for Raspberry Pi: I had to change the configuration { "x64" } lines to configuration { } in the build scripts to get it to build with the DRC in my Docker mame_raspberrypi_cross_compile environment. I don't know if it's the environment's fault or if it's a difference between MacOS and Linux and I just didn't implement that part correctly.

> uname -mps
Darwin arm64 arm

crosstoolng@ed182c661828:/$ uname -mps
Linux aarch64 aarch64

belegdol · 2025-01-02T16:28:33Z

What happens on 32-bit arm? Looking at the makefile and genie scripts changes, it looks like arm64 backend would get built. Is this what we want?

987123879113 · 2025-01-02T16:31:47Z

It shouldn't do anything on 32-bit. asmjit only supports aarch64.

belegdol · 2025-01-02T16:42:38Z

asmjit only supports aarch64.

I know. But looking at makefile changes, c backend is no longer forced if arm is found in uname -mps output. This could be a problem. On RPi 1B the output looks as follows:

$ uname -mps
Linux armv6l unknown

On an odroid HC1 NAS as follows:

$ uname -mps
Linux armv7l unknown

987123879113 · 2025-01-02T16:55:21Z

That doesn't mean it won't use the C backend though. FORCE_DRC_C_BACKEND just makes it skip the check to see if the platform configuration has a DRC available. For it to not use the C backend it would need to match the platform ("arm"/"arm64") and also have the x64 configuration available.

if not _OPTIONS["FORCE_DRC_C_BACKEND"] then
	if _OPTIONS["BIGENDIAN"]~="1" then
		if (_OPTIONS["PLATFORM"]=="arm" or _OPTIONS["PLATFORM"]=="arm64") then
			configuration { "x64" }
				defines {
					"NATIVE_DRC=drcbe_arm64",
				}
			configuration {  }
...

On MacOS it matches using the ifeq ($(findstring arm,$(UNAME)),arm) check, not the aarch64 one.

rb6502 · 2025-01-03T00:21:27Z

MIPS: gauntleg shows an out of memory error with the AArch64 BE but -drc_use_c boots fine. Also sfrush, calsped, and mwskins - same error. (The error is shown by the game, not MAME).

PPC: scud doesn't boot without -drc_use_c.

rb6502 · 2025-01-03T02:30:35Z

ppctest results:

Mismatch: instr=MULLW., src1=0x80000000, src2=0x80000000
expected: dest=0x0, XER=0x0, CR=0x20000000
got: dest=0x0, XER=0x0, CR=0x40000000

Mismatch: instr=MULLWO., src1=0x80000000, src2=0x80000000
expected: dest=0x0, XER=0xc0000000, CR=0x30000000
got: dest=0x0, XER=0xc0000000, CR=0x50000000
Test file line #: 488

src/devices/cpu/drcbearm64.cpp

987123879113 · 2025-01-03T12:01:39Z

Fixed some more multiplication errors in addition to the ones AJR commented. That fixes the Midway/MIPS-based games RB mentioned. Also tried to emulate the call stack similar to how the C backend does it so that mapvars can be recovered properly which fixes scud. Also rebased against latest master to get rid of the build error.

cuavas · 2025-01-03T16:23:56Z

@seleuco you might want to take a look at this, as ARM CPUs are the predominant targets for Android builds.

seleuco · 2025-01-03T16:32:37Z

@seleuco you might want to take a look at this, as ARM CPUs are the predominant targets for Android builds.

Yes. I was already aware of this pull request. I will start testing and will report back to you.

cuavas

Just a few relatively minor things in the declarations. I haven’t looked at the code generation yet.

src/devices/cpu/drcbearm64.h

rb6502 · 2025-01-04T00:53:36Z

Next up: pmac6100. -drc_use_c will show the mouse pointer and flashing disk, aarch64 loses track of the return address during an exception and crashes to PC=0x00000004. (And to clarify, you don't need any of my WIP changes to the driver to get that).

987123879113 · 2025-01-04T05:52:49Z

@rb6502 More mapvar/recover issues. The callstack depth wasn't actually storing the new depth so it would overwrite the hashstacksave pointer sooner than it should've.

I tested the previously mentioned scud and MIPS games and those are still working + pmac6100 is booting as far as it does with the C backend now.

rb6502 · 2025-01-04T13:25:32Z

Confirmed full functionality in pmac6100 with that change. I think this is good to go in terms of my regression testing now.

seleuco · 2025-01-04T14:30:33Z

I've successfully cross-compiled the pull request and tested performance on these Android devices without any problem at first sight:

Qualcomm Snapdragon 685 (low-midrange; Geekbench 6 single core: 473)

Game	No DRC	DRC (C backend)	DRC (aarch64)
SF III 3D Strike (cps3) (attract mode)	90%	60%	230%
K. Instinct (attract mode)	35%	30%	150%
Die Hard (ST-V) (attract mode)	34%	19%	44%

Qualcomm Snapdragon 8 Gen 2 (high-end; Geekbench 6 single core: 1595)

Game	No DRC	DRC (C backend)	DRC (aarch64)
SF III 3D Strike (cps3) (attract mode)	250-280%	140-160%	600-800%
K. Instinct (attract mode)	90-110%	55-90%	370-560%
Die Hard (ST-V) (attract mode)	65-95%	40-55%	80-150%
C. Taxi (Naomi) (in game)	30%	20%	50% (60% on performance mode)

A build with these changes is available for testing:

https://drive.google.com/file/d/1aJcKK-ugzOzx_f5ridu-nPR3qnbFDYaC/view?usp=sharing

I don't plan to publish anything until the next version of MAME that includes these changes, obviously.
If you want me to remove this file from my Google Drive, just let me know.

987123879113 · 2025-01-04T14:41:08Z

@seleuco Thank you for testing and providing benchmarks. Those are some really significant improvements for Android.

invertego · 2025-01-05T00:32:42Z

I am happy to report this PR is working fine on a Windows ARM64 device with a Snapdragon X Elite CPU.

A small note about my experience cross compiling for Raspberry Pi: I had to change the configuration { "x64" } lines to configuration { } in the build scripts to get it to build with the DRC in my Docker mame_raspberrypi_cross_compile environment. I don't know if it's the environment's fault or if it's a difference between MacOS and Linux and I just didn't implement that part correctly.

I was just looking at the makefile with the intent of streamlining Win/ARM64 builds, and I think I can shed some light on this.

macOS ARM64 builds use this target:

.PHONY: macosx_arm64_clang
macosx_arm64_clang: generate $(PROJECTDIR)/$(MAKETYPE)-osx-clang/Makefile
	$(SILENT) $(MAKE) $(MAKEPARAMS) -C $(PROJECTDIR)/$(MAKETYPE)-osx-clang config=$(CONFIG)64 precompile
	$(SILENT) $(MAKE) $(MAKEPARAMS) -C $(PROJECTDIR)/$(MAKETYPE)-osx-clang config=$(CONFIG)64

...whereas Linux ARM64 builds use this generic target:

.PHONY: linux
linux: generate $(PROJECTDIR)/$(MAKETYPE)-linux/Makefile
	$(SILENT) $(MAKE) $(MAKEPARAMS) -C $(PROJECTDIR)/$(MAKETYPE)-linux config=$(CONFIG) precompile
	$(SILENT) $(MAKE) $(MAKEPARAMS) -C $(PROJECTDIR)/$(MAKETYPE)-linux config=$(CONFIG)

The macOS recipe passes config=$(CONFIG)64 and the Linux one passes config=$(CONFIG). The macOS target is selected by this late ad-hoc fixup to the ARCHITECTURE variable:

ifneq ($(filter arm64%,$(UNAME_M)),)
ARCHITECTURE := _arm64_clang
else

Interestingly, the macosx_arm64_clang recipe is identical to the macosx_x64_clang recipe, so there's little added value in maintaining it as a separate target.

This leaves me with a couple of questions:

Should Linux/Windows ARM64 builds use a *64 configuration (and should macOS continue to use one)?
Should there be separate _arm64 targets for Linux/Windows?

The answers should clarify the right thing to do in this PR.

scripts/src/3rdparty.lua

danmons · 2025-01-06T08:41:53Z

I'm unable to compile this via crosstool-ng on a Linux x86_64 machine for a Linux aarch64 target. This is the same environment I use to package up MAME for Raspberry Pi users, which to date I've been doing with NOASM=1 set at compile time. I suspect at this stage it's a problem with crosstool-ng and/or my setup, but I'm not completely sure.

My build environment is here:

https://github.com/danmons/mame_raspberrypi_cross_compile

If I remove NOASM=1 from my build scripts, MAME builds, however benchmarking with -drc produces the same result as benchmarking with -drc_use_c (and -nodrc is different, and in line with my old NOASM=1 results). So it seems to not use the arm64 asmjit.

If I try the trick above of replacing configuration { "x64" } with configuration { } in scripts/genie.lua, that seems to correctly set NATIVE_DRC=drcbe_arm64, however I get errors that look like the following during Linking mame when linking with GNU ld:

undefined reference to `asmjit::_abi_1_13::BaseEmitter::_emitI(unsigned int, asmjit::_abi_1_13::Operand_ const&)'

GNU gold produces similar output. The asmjit documentation suggests that ASMJIT_STATIC is not set, but I can see that it is in various generated makefiles and when compiling with VERBOSE=1.

Interestingly I see that libasmjit.a has only !<arch> inside it and nothing else, and likewise build/projects/sdl/mame/gmake-linux/obj/Release/3rdparty/asmjit/src/asmjit only has directories and no object files. It seems like asmjit isn't being built perhaps?

I'm out of my depth here a bit, but it almost feels like asmjit isn't picking up the cross-compile environment and arm64 target instructions, and things are getting confused at compile/link time. Other than the linker errors, there are no compile time errors relating to asmjit or drc or anything similar.

I'll keep poking at it, but I'm running out of ideas on what to try next.

seleuco · 2025-01-06T09:10:40Z

Linux aarch64 target

You're probably getting the linkage error because you haven't modified the script in 3rdparty.lua and you haven't removed "configuration { "x64" }" there as well.

But that's not the way to go. The real problem you're having is that you're not running a rule that calls config=$(CONFIG)64 so the x64 configuration isn't being selected.

I personally haven't checked the Linux aarch64 target so I don't know where the problem is right now.

You definitely need to avoid NOASM=1 because it forces the use of FORCE_DRC_C_BACKEND no matter what you do and asmjit won't be included.

987123879113 · 2025-01-06T09:26:40Z

@danmons seleuco is right, it's due to the same issue with the config string stuff. That'll eventually need to be address for this PR I think but I'm not very comfortable with makefiles and genie and such so it'll have to be discussed and decided on how exactly to approach that.

Having said that, I was using your build environment (inside Docker) and with some changes I did get it to work with the DRC after a few failed attempts. Here's the .patch file I made for testing with your Docker build environment:

diff --git a/scripts/genie.lua b/scripts/genie.lua
index bcdbc82236f..f6a5abc5486 100644
--- a/scripts/genie.lua
+++ b/scripts/genie.lua
@@ -711,11 +711,10 @@ end
 if not _OPTIONS["FORCE_DRC_C_BACKEND"] then
 	if _OPTIONS["BIGENDIAN"]~="1" then
 		if (_OPTIONS["PLATFORM"]=="arm" or _OPTIONS["PLATFORM"]=="arm64") then
-			configuration { "x64" }
+			configuration {  }
 				defines {
 					"NATIVE_DRC=drcbe_arm64",
 				}
-			configuration {  }
 		else
 			configuration { "x64" }
 				defines {
diff --git a/scripts/src/3rdparty.lua b/scripts/src/3rdparty.lua
index e56665ef121..6466637d760 100755
--- a/scripts/src/3rdparty.lua
+++ b/scripts/src/3rdparty.lua
@@ -1949,7 +1949,7 @@ project "asmjit"
 		}

 	if (_OPTIONS["PLATFORM"]=="arm" or _OPTIONS["PLATFORM"]=="arm64") then
-		configuration { "x64" }
+		configuration { }
 			defines {
 				"ASMJIT_NO_X86",
 			}
diff --git a/src/devices/cpu/drcbearm64.cpp b/src/devices/cpu/drcbearm64.cpp
index 5a895392852..6e0b989b8b8 100644
--- a/src/devices/cpu/drcbearm64.cpp
+++ b/src/devices/cpu/drcbearm64.cpp
@@ -826,6 +826,8 @@ drcbe_arm64::drcbe_arm64(drcuml_state &drcuml, device_t &device, drc_cache &cach
 	, m_baseptr(cache.near() + 0x80)
 	, m_near(*(near_state *)cache.alloc_near(sizeof(m_near)))
 {
+	printf("Using ARM DRC\n");
+
 	// get pointers to C functions we need to call
 	using debugger_hook_func = void (*)(device_debug *, offs_t);
 	static const debugger_hook_func debugger_inst_hook = [] (device_debug *dbg, offs_t pc) { dbg->instruction_hook(pc); };

And also removed NOASM=1 from functions/compile since that forces FORCE_DRC_C_BACKEND as seleuco said.

The printf there in drcbearm64.cpp was to be sure it was actually using the ARM DRC.

My guess is you need to also edit scripts/src/3rdparty.lua to get it working for you.

danmons · 2025-01-06T09:31:59Z

You definitely need to avoid NOASM=1 because it forces the use of FORCE_DRC_C_BACKEND no matter what you do and asmjit won't be included.

I don't think I was clear on that - yup, NOASM=1 was removed for all of my testing/output above. Only mentioned it because it's still there in my repo (as prior to this, DRC was at best missing, and at worst caused compile time errors which caused me delays in getting builds out).

Thank you both for the notes. I'll try them out shortly.

[edit]

Ooh yeah that sorted it. Running all of these with -bench 90:

Raspberry Pi 4B, 1.5GHz ARM Cortex-A72 (older model, newer RPi4s clock at 1.8GHz):
--
sfiii3n -nodrc     : 63.70%
sfiii3n -drc_use_c : 30.35%
sfiii3n -drc       : 226.03%
--
kinst -nodrc       : 27.17%
kinst -drc_use_c   : 16.76%
kinst -drc         : 125.12% 
--
diehard -nodrc     : 22.57%
diehard -drc_use_c : 8.30% 
diehard -drc       : 32.55%
--

Orange Pi 5B, 2.3GHz ARM Cortex-A76 (roughly equivalent to an RPi5 which is 2.4GHz by spec):
--
sfiii3n -nodrc     : 160.42%
sfiii3n -drc_use_c : 87.72%
sfiii3n -drc       : 526.73%
--
kinst -nodrc       : 62.81%
kinst -drc_use_c   : 48.77%
kinst -drc         : 273.50%
--
diehard -nodrc     : 53.59%
diehard -drc_use_c : 27.00% 
diehard -drc       : 71.13%
--

sonicboy904 · 2025-01-07T19:47:31Z

I have tested some games out on my Ayn Odin 2 Pro and i have seen a lotta small and big improvement in performance for games that ran poorly on MAME4droid or a lil jump in speed.

Here are links to my tests for the ARM64 beta for MAME4droid.

Panic Park - https://youtu.be/025FNrFFo6M?si=XLUwAjCc8GycoIl5
California Speed - https://youtu.be/K7_z6FM7Rx0?si=YYTQOzKES5bKre-N
NFL Blitz - https://youtu.be/NzdQh7bFCWM?si=rlLqVQNfJh1xDsIR
Final Furlong 2 - https://youtu.be/NzdQh7bFCWM?si=rlLqVQNfJh1xDsIR

AYN Odin 2 Pro Specs

RAM: 12GB
GPU: Adreno 740
Processor: Snapdragon 8 Gen 2
Storage: 256GB + 512GB (micro SD)
OS: Android 13

cuavas · 2025-01-07T20:20:43Z

This leaves me with a couple of questions:
1. Should Linux/Windows ARM64 builds use a *64 configuration (and should macOS continue to use one)?

2. Should there be separate _arm64 targets for Linux/Windows?
The answers should clarify the right thing to do in this PR.

Philosophically, I’d say that Linux and Windows systems, where 32-bit ARM binaries are supported, should probably be using a *64 configuration for AArch64 builds. However, macOS no longer supports 32-bit applications at all, so there may not be any point keeping the *64 there.

invertego · 2025-01-07T21:24:44Z

FYI, the latest Windows 11 feature update (24H2) dropped support for 32-bit ARM applications.

987123879113 · 2025-01-08T03:04:38Z

I pushed a change that I think should probably clear up the build issues. I tested macOS (which reports Darwin arm64 arm) and with the Raspberry Pi build environment (which reports Linux aarch64 aarch64).

5683a37

cuavas · 2025-01-09T07:33:34Z

FYI, the latest Windows 11 feature update (24H2) dropped support for 32-bit ARM applications.

What happened to Microsoft’s famous backwards compatibility?

cuavas

I’ve started going through the actual code generation part. It’s going to take several cycles for me to get through all of it. Please bear with me with stupid or obvious questions.

src/devices/cpu/drcbearm64.cpp

src/devices/cpu/drcbearm64.h

src/devices/cpu/drcbearm64.cpp

k2-git · 2025-01-10T00:01:20Z

FYI, the latest Windows 11 feature update (24H2) dropped support for 32-bit ARM applications.

What happened to Microsoft’s famous backwards compatibility?

I think that not Microsoft decided.
ARM announced only support 64-bit.

https://newsroom.arm.com/news/pushing-the-boundaries-of-performance-and-security-to-unleash-the-power-of-64-bit-computing

Edit:
ARM recommended all application 64-bit only transition.
https://newsroom.arm.com/blog/64-bit

The stack needs to be 16-byte aligned before the call to debug_log_hashjmp_fail.

src/devices/cpu/drcbearm64.cpp

cuavas

Most of the things this time are potential optimisations. Quite a few of them are fairly straightforward. There are a couple of more serious things.

The delegate resolver doesn’t account for this pointer adjustment, non-virtual member functions, or the MSVC ABI.
There are some vestigial naked argument values.
The simple optimisations to use bit field manipulation instructions should be straightforward.
Optimising address calculations to use adrp may or may not be worth the effort of adding 1.5 more cases.
Optimising ROLAND/ROLINS to use bit field manipulation instructions would substantially complicate the code. I don’t know if we want to go there at this point.
Conditionally allocating labels is just something I’m curious about – I haven’t looked at what asmjit actually does when allocating or default constructing a label, so I don’t know whether it would be worthwhile.

The only remaining thing I haven’t reviewed is the ROLC/RORC code generation. I’m pretty sure it can be optimised, I’m just not going to be able to think it through tonight.

src/devices/cpu/drcbearm64.cpp

cuavas · 2025-01-11T13:46:06Z

src/devices/cpu/drcbearm64.cpp

+			[] (resolved_handler &handler, address_space &space, auto accessor)
+			{
+				struct {
+					uintptr_t ptr;
+					ptrdiff_t adj;
+				} equiv;
+
+				assert(sizeof(accessor) == sizeof(equiv));
+				*reinterpret_cast<decltype(accessor) *>(&equiv) = accessor;
+
+				handler.obj = uintptr_t(reinterpret_cast<uint8_t *>(&space));
+
+				auto const vptr = *reinterpret_cast<uint8_t const *const *>(handler.obj) + equiv.ptr;
+				handler.func = *reinterpret_cast<uint8_t *const *>(vptr);
+			};


Have you tested this with both virtual and non-virtual member functions? It should also apply this pointer adjustment to deal with multiple inheritance Essentially, this is supposed to be a special-cased version of delegate_mfp_itanium::convert_to_generic combined with enough of delegate_mfp_msvc to get it to work with the Microsoft ABI.

See delegate.cpp for the full-fat versions: https://github.com/mamedev/mame/blob/master/src/lib/util/delegate.cpp

I would have thought it should be something like this (note that it’s different to the x86-64 version because the virtual function flag is stored in the low bit of the this pointer offset for ARM-like architectures, as the low bit of a code pointer indicates a Thumb function and the offset needs to be shifted):

[] (resolved_handler &handler, address_space &space, auto accessor) { if (MAME_DELEGATE_USE_TYPE == MAME_DELEGATE_TYPE_ITANIUM) { struct { uintptr_t ptr; ptrdiff_t adj; } equiv; assert(sizeof(accessor) == sizeof(equiv)); *reinterpret_cast<decltype(accessor) *>(&equiv) = accessor; handler.obj = uintptr_t(reinterpret_cast<uint8_t *>(&space) + (equiv.adj >> 1)); if (BIT(equiv.adj, 0)) { auto const vptr = *reinterpret_cast<u8 const *const *>(handler.obj) + equiv.ptr; handler.func = *reinterpret_cast<uint8_t *const *>(vptr); } else { handler.func = reinterpret_cast<uint8_t *>(equiv.ptr); } } else if (MAME_DELEGATE_USE_TYPE == MAME_DELEGATE_TYPE_MSVC) { // interpret the pointer to member function ignoring the virtual inheritance variant struct single { uintptr_t ptr; }; struct multi { uintptr_t ptr; int adj; }; struct { uintptr_t ptr; int adj; int vadj; int vindex; } unknown; assert(sizeof(accessor) <= sizeof(unknown)); *reinterpret_cast<decltype(accessor) *>(&unknown) = accessor; uint32_t const *func = reinterpret_cast<uint32_t const *>(unknown.ptr); handler.obj = uintptr_t(&space); if ((sizeof(unknown) == sizeof(accessor)) && unknown.vindex) { handler.obj += unknown.vadj; auto const vptr = *reinterpret_cast<uint8_t const *const *>(handler.obj); handler.obj += *reinterpret_cast<int const *>(vptr + unknown.vindex); } if (sizeof(single) < sizeof(accessor)) handler.obj += unknown.adj; // walk past thunks while (true) { if ((0x90000010 == (func[0] & 0x9f00001f)) && (0x91000210 == (func[1] & 0xffc003ff)) && (0xd61f0200 == func[2])) { // page-relative jump with +/-4GB reach - adrp xip0,... ; add xip0,xip0,#... ; br xip0 int64_t const page = (uint64_t(func[0] & 0x60000000) >> 17) | (uint64_t(func[0] & 0x00ffffe0) << 9) | ((func[0] & 0x00800000) ? (~std::uint64_t(0) << 33) : 0); uint32_t const offset = (func[1] & 0x003ffc00) >> 10; func = reinterpret_cast<uint32_t const *>(((uintptr_t(func) + page) & (~uintptr_t(0) << 12)) + offset); } else if ((0xf9400010 == func[0]) && (0xf9400210 == (func[1] & 0xffc003ff)) && (0xd61f0200 == func[2])) { // virtual function call thunk - ldr xip0,[x0] ; ldr xip0,[x0,#...] ; br xip0 uint32_t const *const *const vptr = *reinterpret_cast<uint32_t const *const *const *>(object); func = vptr[(func[1] & 0x003ffc00) >> 10]; } else { // not something we can easily bypass break; } } handler.func = reinterpret_cast<uint8_t *>(uintptr_t(func)); } };

That went so far over my head it's in outer space. I completely do not understand this so I can't verify if it's correct or not.

I think the code I had is pretty close to MAME_DELEGATE_TYPE_ITANIUM, and checking MAME_DELEGATE_USE_TYPE it does get set to MAME_DELEGATE_TYPE_ITANIUM, although I wasn't using adj. I can say at least that the code I was using did work with the resolved reader/writer functions and that's primarily what I was using. I also did test the non-resolver ones previously and those worked too.

I have no idea how to test MAME_DELEGATE_TYPE_MSVC though. Maybe I could install Win 11 ARM in a VM and test that out?

Well, the code is basically copy/pasted from delegate.cpp with the impossible paths cut out, so it probably works. I should probably put this stuff somewhere separate so it can be used independently of the delegate classes. I hacked it into the x86-64 DRC backend as a quick optimisation.

If you really want to test it with the MSVC ABI, you’d need to build with either MSVC itself or with clang set to use the MSVC ABI (the clang-cl wrapper that Visual Studio uses to compile with clang does this), but I wouldn’t worry too much. MSVC is basically in “marginally supported” status, and people are even less likely to be using Visual Studio on an ARM PC. If it’s broken, someone will complain eventually, and then I can worry about it.

cuavas · 2025-01-11T13:55:52Z

src/devices/cpu/drcbearm64.cpp

+	// compute the base by aligning the cache top to a cache line (assumed to be 64 bytes)
+	uint8_t *dst = (uint8_t *)(uint64_t(m_cache.top() + 63) & ~63);


Just as a sanity check, I looked up cache line sizes for a few CPUs:

Apple M series: 128 bytes

Cortex-A series (including Kryo): 64 bytes

Cortex-M series: 32 bytes

Stick with 64-byte alignment, or bump it up to 128 as the biggest widespread size?

I found this repo that can get the cache line size from the OS: https://github.com/NickStrupat/CacheLineSize

I tested it on my M1 Pro and it correctly gives 128.

On my Raspberry Pi 4B (Arm Cortex-A72) gives 64.

Not ARM, but my Ryzen 5700X gives 64 which appears to be correct from what I could find on Google.

What do you think of pulling in that code for this? I don't know if it's something that you'd want to turn into a util function and so it could be used for the other DRC backends too.

I’ll add it to 3rdparty. Looks simple enough. I’ll worry about this later.

The linked code has no error checking at all. I added helpers for macOS, Windows and Linux in ff92d10. I’m not sure how they’ll behave on systems with little.BIG architecture (efficiency cores and performance cores with different characteristics), but they shouldn’t make anything worse.

src/devices/cpu/drcbearm64.cpp

cuavas · 2025-01-11T15:12:45Z

src/devices/cpu/drcbearm64.cpp

+void drcbe_arm64::op_roland(a64::Assembler &a, const uml::instruction &inst)
+{


Is it worth detecting cases where this can be special-cased to use a ubfx or ubfiz instruction? If the mask and shift are immediate:

ubfx can be used if the mask is right-aligned (i.e. starts at bit position zero)

ubfiz can be used if the mask is contiguous and starts at the bit position equal to the shift value

For example:

ROLAND(d, s, 3, 0x003f) is equivalent to ubfx(d, s, bits - 3, 6)

ROLAND(d, s, 5, 0x07e0) is equivalent to ubfiz(d, s, 5, 6)

Real-world PowerPC RLW* usage will probably fall into one of those two cases a majority of the time, and that's not even counting the hard-coded instances the DRC generates for instructions like MFCR.

I'm not sure how far into the weeds we want to go with optimizations before landing the first working version, but this seems safe and useful.

I don't think you can optimize ROLAND any more than it is now. You can't simplify the command down to just one ubfx/ubfiz because you need to rotate the value.

Here's what the current implementation does:

If the mask is 0 then the value doesn't change

If the shift is a constant then it gets simplified to a ror

The final command is always just a single AND of the rotated value

The only way you could probably optimize it even further is to check if srcp is a constant and if then just calculate it as a constant and a mov to the register.

cuavas · 2025-01-11T15:15:24Z

src/devices/cpu/drcbearm64.cpp

+void drcbe_arm64::op_rolins(a64::Assembler &a, const uml::instruction &inst)
+{


Is it worth detecting cases where this can be special-cased to use a bfi instruction? If the mask and shift are immediate, bfi can be used if the mask is contiguous and starts at the bit position equal to the shift value.

For example:

ROLINS(d, s, 7, 0x0380) is equivalent to bfi(d, s, 7, 3)

In cases where the shift and mask are immediate, the mask is contiguous, and the mask does not start at the bit position equal to the shift value, you can use ror followed by bfi.

For example, ROLINS(d, s, 4, 0x1f80) is equivalent to ror(tmp, s, 3) followed by bfi(d, tmp, 7, 6)

Is it worth detecting cases where this can be special-cased to use a bfi instruction? If the mask and shift are immediate, bfi can be used if the mask is contiguous and starts at the bit position equal to the shift value.

Two things to note:

If the rolins function is called with all immediates and does not use flags, it gets optimized out (rolins -> roland -> various simplications for roland). This applies to most opcodes if they use constants https://github.com/mamedev/mame/blob/master/src/devices/cpu/uml.cpp#L436-L474

My current implementation will simplify shift immediate down to ror or mov as appropriate already

But the use of bfi to simplify the second half of the code probably will work so I'll try it.

My implementation of the bfi thing broke rolins. bfi requires the value to start at bit 0 to insert it into the destination register, so it just seems like too much effort to save 1 instruction so I'm going to revert that change.

cuavas · 2025-01-11T15:18:32Z

src/devices/cpu/drcbearm64.cpp

+
+		if (shiftp.is_immediate())
+		{
+			const auto shift = ((inst.size() * 8) - (shiftp.immediate() % (inst.size() * 8))) % (inst.size() * 8);


It should be possible to optimise this expression – assuming the size is always a power of two, isn’t it equivalent to:

const auto shift = ~shiftp.immediate() & ((inst.size() * 8) - 1);

I threw some numbers at it but they're not equivalent. Your version is consistently off by 1.

int main(int argc, char **argv) { const uint32_t instsize = 4; const uint32_t val = 0x400; for (int shiftp = 0; shiftp < 8; shiftp++) { uint32_t shiftcalc1 = ((instsize * 8) - (shiftp % (instsize * 8))) % (instsize * 8); uint32_t shiftcalc2 = ~shiftp & ((instsize * 8) - 1); uint32_t res1 = rotr_32(val, shiftcalc1); uint32_t res2 = rotr_32(val, shiftcalc2); uint32_t expected = rotl_32(val, shiftp); printf("%d %d %d | %08x %08x %08x\n", shiftp, shiftcalc1, shiftcalc2, expected, res1, res2); } return 0; }

result

0 0 31 | 00000400 00000400 00000800 1 31 30 | 00000800 00000800 00001000 2 30 29 | 00001000 00001000 00002000 3 29 28 | 00002000 00002000 00004000 4 28 27 | 00004000 00004000 00008000 5 27 26 | 00008000 00008000 00010000 6 26 25 | 00010000 00010000 00020000 7 25 24 | 00020000 00020000 00040000

Well you can fix the off-by-one by using twos-complement negation rather than inversion:

const auto shift = -int64_t(shiftp.immediate()) & ((inst.size() * 8) - 1);

It should still be cleaner than needing the bit count in three places and requiring two modulo operations.

cuavas

Mostly just suggested optimisation of ROLC/RORC with immediate shift. You can get them down to just five bit field manipulation instructions.

Sorry, haven’t had time to write up optimisation of ROLAND/ROLINS tonight, or to look at optimising ROLC/RORC with variable shift.

cuavas · 2025-01-12T16:04:43Z

src/devices/cpu/drcbearm64.cpp

+	const uint64_t pagebase = codeoffs & ~make_bitmask<uint64_t>(12);
+	const int64_t pagerel = (int64_t)ptr - pagebase;
+	if (is_valid_immediate(abs(pagerel), 12) || ((abs(pagerel) & 0xfff) == 0 && is_valid_immediate(abs(pagerel) >> 12, 12)))
+	{
+		a.brk(1);
+
+		a.adrp(MEM_SCRATCH_REG, pagebase);
+
+		if (is_valid_immediate_signed(pagerel, 9))
+			return arm::Mem(MEM_SCRATCH_REG, pagerel);
+		else if (pagerel > 0 && emit_add_optimized(a, MEM_SCRATCH_REG, MEM_SCRATCH_REG, pagerel))
+			return arm::Mem(MEM_SCRATCH_REG);
+		else if (pagerel < 0 && emit_sub_optimized(a, MEM_SCRATCH_REG, MEM_SCRATCH_REG, pagerel))
+			return arm::Mem(MEM_SCRATCH_REG);
+	}


This doesn’t seem to be right. You’re using adrp on pagebase which is the page base of the current PC for the generated code. Wouldn’t that always produce a zero immediate? You want to use it on the target address.

const uint64_t pagebase = codeoffs & ~make_bitmask<uint64_t>(12); const int64_t pagerel = (int64_t)ptr - pagebase; if (is_valid_immediate_signed(pagerel, 33)) { const uint64_t targetpage = ptr & ~make_bitmask<uint64_t>(12); const uint64_t pageoffs = ptr & util::make_bitmask<uint64_t>(12); a.adrp(MEM_SCRATCH_REG, targetpage); if (is_valid_immediate_signed(pageoffs, 9)) { return arm::Mem(MEM_SCRATCH_REG, pageoffs); } else { a.add(MEM_SCRATCH_REG, MEM_SCRATCH_REG, pageoffs); return arm::Mem(MEM_SCRATCH_REG); } }

You know the offset relative to the target page will always be a valid 12-bit unsigned value, as it’s just the low twelve bits of the value. It’s always a positive offset from the start of the target page. You don’t need to check for a negative value.

cuavas · 2025-01-12T16:06:24Z

src/devices/cpu/drcbearm64.cpp

+	const uint64_t pagebase = codeoffs & ~make_bitmask<uint64_t>(12);
+	const int64_t pagerel = (int64_t)ptr - pagebase;
+	if (is_valid_immediate(abs(pagerel), 12) || ((abs(pagerel) & 0xfff) == 0 && is_valid_immediate(abs(pagerel) >> 12, 12)))
+	{
+		a.adrp(MEM_SCRATCH_REG, pagebase);
+
+		if (is_valid_immediate_signed(pagerel, 9))
+		{
+			a.emit(opcode, reg, arm::Mem(MEM_SCRATCH_REG, pagerel));
+			return;
+		}
+		else if (pagerel > 0 && emit_add_optimized(a, MEM_SCRATCH_REG, MEM_SCRATCH_REG, pagerel))
+		{
+			a.emit(opcode, reg, arm::Mem(MEM_SCRATCH_REG));
+			return;
+		}
+		else if (pagerel < 0 && emit_sub_optimized(a, MEM_SCRATCH_REG, MEM_SCRATCH_REG, pagerel))
+		{
+			a.emit(opcode, reg, arm::Mem(MEM_SCRATCH_REG));
+			return;
+		}
+	}


Same as the other one – this seems to be using adrp for the page of the current PC, which will always give zero, rather than for the target address.

cuavas · 2025-01-12T17:28:05Z

src/devices/cpu/drcbearm64.cpp

+	a.mrs(TEMP_REG1, a64::Predicate::SysReg::kNZCV);
+
+	a.bfc(TEMP_REG1, 28, 4); // clear upper 4 status bits
+
+	a.and_(TEMP_REG2, FLAGS_REG, 0b1100); // zero + sign
+	a.ubfx(TEMP_REG3, FLAGS_REG, 1, 1); // overflow flag
+	a.orr(TEMP_REG2, TEMP_REG2, TEMP_REG3);
+	a.bfi(TEMP_REG1, TEMP_REG2, 28, 4);
+
+	a.msr(a64::Predicate::SysReg::kNZCV, TEMP_REG1);


You don’t need the a.bfc(TEMP_REG1, 28, 4); any more – the a.bfi(TEMP_REG1, TEMP_REG2, 28, 4); will completely replace the contents of the upper four bits.

cuavas · 2025-01-12T17:28:44Z

src/devices/cpu/drcbearm64.cpp

+	a.mrs(TEMP_REG1, a64::Predicate::SysReg::kNZCV);
+
+	a.bfc(TEMP_REG1, 28, 4); // clear upper 4 status bits
+
+	a.and_(TEMP_REG2, FLAGS_REG, 0b1100); // zero + sign
+	a.ubfx(TEMP_REG3, FLAGS_REG, 1, 1); // overflow flag
+	a.orr(TEMP_REG2, TEMP_REG2, TEMP_REG3);
+	a.bfi(TEMP_REG1, TEMP_REG2, 28, 4);
+
+	a.msr(a64::Predicate::SysReg::kNZCV, TEMP_REG1);


Same here – the bfc isn’t necessary any more.

cuavas · 2025-01-12T18:29:29Z

src/devices/cpu/drcbearm64.cpp

+void drcbe_arm64::op_rolc(a64::Assembler &a, const uml::instruction &inst)
+{


You can special-case this if the shift is an immediate and do it with bit manipulation operations. Let’s illustrate with an 8-bit example, and a 3-bit rotation.

Before:

x |7|6|5|4|3|2|1|0| f |c|

Desired state after y = ROLC(x, 3):

y |4|3|2|1|0|c|7|6| f |5|

Now obviously, we don’t have a native operation that does this, but we can synthesise it:

Before:

x |7|6|5|4|3|2|1|0| f |c|

Copy desired carry flag output to temp ubfx t,x,8-3,1:

x |7|6|5|4|3|2|1|0| t |5| f |c|

Rotate upper bits around ubfx y,x,8-3+1,3-1:

x |7|6|5|4|3|2|1|0| y |-|-|-|-|-|-|7|6| t |5| f |c|

Insert current carry bfi y,f,3-1,1:

x |7|6|5|4|3|2|1|0| y |-|-|-|-|-|c|7|6| t |5| f |c|

Rotate lower bits around bfi y,x,3,8-3:

x |7|6|5|4|3|2|1|0| y |4|3|2|1|0|c|7|6| t |5| f |c|

Now you’ve got the desired output in y and the desired carry flag in t. Assuming I understand correctly that carry is bit 0 of the flags register, conceptually it’s:

ubfx carry,src1p,bits-shift,1 ; copy desired carry ubfx output,src1p,bits-shift+1,shift-1 ; rotate upper bits around bfi output,FLAGS_REG,shift-1,1 ; insert current carry bfi output,src1p,shift,bits-shift ; rotate lower bits around bfi FLAGS_REG,carry,0,1 ; copy carry to flags register

So the code to do this would be something like:

... if (src2p.is_immediate()) { const auto s = src2p.immediate() % (inst.size() * 8); if (s != 0) { a.ubfx(carry, param1, (inst.size() * 8) - s, 1); a.ubfx(output, param1, (inst.size() * 8) - s + 1, s - 1); a.bfi(output.x(), FLAGS_REG, s - 1, 1); a.bfi(output, param1, shift, (inst.size() * 8) - s); a.bfi(FLAGS_REG, carry.x(), 0, 1); } else { ... } } else { const a64::Gp shift = src2p.select_register(TEMP_REG2, inst.size()); mov_reg_param(a, inst.size(), shift, src2p); ...

cuavas · 2025-01-12T18:39:29Z

src/devices/cpu/drcbearm64.cpp

+void drcbe_arm64::op_rorc(a64::Assembler &a, const uml::instruction &inst)
+{


You can do the same optimisation as for ROLC. Let’s illustrate with an 8-bit example, and a 3-bit rotation again.

Before:

x |7|6|5|4|3|2|1|0| f |c|

Desired state after y = RORC(x, 3):

x |1|0|c|7|6|5|4|3| f |2|

To synthesise it:

Before:

x |1|0|c|7|6|5|4|3| f |2|

Copy desired carry flag output to temp ubfx t,x,3-1,1:

x |7|6|5|4|3|2|1|0| t |2| f |c|

Rotate upper bits around ubfx y,x,3,8-3:

x |7|6|5|4|3|2|1|0| y |-|-|-|7|6|5|4|3| t |2| f |c|

Insert current carry bfi y,f,8-3,1:

x |7|6|5|4|3|2|1|0| y |-|-|c|7|6|5|4|3| t |2| f |c|

Rotate lower bits around bfi y,x,8-3+1,3-1:

x |7|6|5|4|3|2|1|0| y |1|0|c|7|6|5|4|3| t |2| f |c|

So the code to do this would be something like:

... if (src2p.is_immediate()) { const auto s = src2p.immediate() % (inst.size() * 8); if (s != 0) { a.ubfx(carry, param1, s - 1, 1); a.ubfx(output, param1, s, (inst.size() * 8) - s); a.bfi(output.x(), FLAGS_REG, (inst.size() * 8) - s, 1); a.bfi(output, param1, (inst.size() * 8) - s + 1, s - 1); a.bfi(FLAGS_REG, carry.x(), 0, 1); } else { ... } } else { const a64::Gp shift = src2p.select_register(TEMP_REG2, inst.size()); mov_reg_param(a, inst.size(), shift, src2p); ...

cuavas · 2025-01-12T19:10:33Z

Sorry, had to edit two of the review comments due to errors in the suggested code.

cuavas self-assigned this Jan 2, 2025

ajrhacker reviewed Jan 3, 2025

View reviewed changes

src/devices/cpu/drcbearm64.cpp Outdated Show resolved Hide resolved

src/devices/cpu/drcbearm64.cpp Outdated Show resolved Hide resolved

987123879113 added 5 commits January 3, 2025 20:53

uml: Remove vector type from code

07bbdf6

3rdparty/asmjit: Update asmjit

964a486

drcbearm64: Add new ARM64 (aarch64) DRC

63a458f

Emulate call stack for mapvars

673e322

Fix multiplication calculations

cc4bafa

987123879113 force-pushed the drc_arm_pr branch from fd7af38 to cc4bafa Compare January 3, 2025 11:59

cuavas reviewed Jan 3, 2025

View reviewed changes

Fix callstack depth calculation

f0e0561

987123879113 force-pushed the drc_arm_pr branch from e3af311 to f0e0561 Compare January 4, 2025 05:57

Update header files

5fc82fb

invertego reviewed Jan 5, 2025

View reviewed changes

scripts/src/3rdparty.lua Outdated Show resolved Hide resolved

Fix build settings

5683a37

987123879113 force-pushed the drc_arm_pr branch from 5d891f2 to 5683a37 Compare January 8, 2025 03:07

cuavas reviewed Jan 9, 2025

View reviewed changes

PR feedback pt 1

7ac5cf4

987123879113 force-pushed the drc_arm_pr branch from f14c9b2 to 7ac5cf4 Compare January 10, 2025 03:07

Rewrite bitmask immediate check

88acdd8

987123879113 force-pushed the drc_arm_pr branch from d5fdae1 to 88acdd8 Compare January 10, 2025 04:23

987123879113 and others added 2 commits January 11, 2025 10:46

Clean up calls

7256951

cpu/drcbex64.cpp: Fixed crash with LOG_HASHJMPS enabled.

2a15f40

The stack needs to be 16-byte aligned before the call to debug_log_hashjmp_fail.

cuavas reviewed Jan 11, 2025

View reviewed changes

src/devices/cpu/drcbearm64.cpp Outdated Show resolved Hide resolved

cuavas reviewed Jan 11, 2025

View reviewed changes

987123879113 added 4 commits January 12, 2025 20:04

PR feedback changes

7112a3a

More optimizations using add shift and sub and sub shift

6b307cf

Fix order of subtraction for adrp

bd8f16e

Simplify rolins and roland

32869cb

cuavas reviewed Jan 12, 2025

View reviewed changes

		// compute the base by aligning the cache top to a cache line (assumed to be 64 bytes)
		uint8_t dst = (uint8_t )(uint64_t(m_cache.top() + 63) & ~63);

		void drcbe_arm64::op_roland(a64::Assembler &a, const uml::instruction &inst)
		{

		void drcbe_arm64::op_rolins(a64::Assembler &a, const uml::instruction &inst)
		{

		void drcbe_arm64::op_rolc(a64::Assembler &a, const uml::instruction &inst)
		{

		void drcbe_arm64::op_rorc(a64::Assembler &a, const uml::instruction &inst)
		{

ARM64 (aarch64) DRC backend #13162

Are you sure you want to change the base?

ARM64 (aarch64) DRC backend #13162

Conversation

987123879113 commented Jan 2, 2025

cuavas commented Jan 2, 2025

987123879113 commented Jan 2, 2025

belegdol commented Jan 2, 2025

987123879113 commented Jan 2, 2025

belegdol commented Jan 2, 2025 • edited Loading

987123879113 commented Jan 2, 2025

rb6502 commented Jan 3, 2025 • edited Loading

rb6502 commented Jan 3, 2025

987123879113 commented Jan 3, 2025 • edited Loading

cuavas commented Jan 3, 2025

seleuco commented Jan 3, 2025

cuavas left a comment

Choose a reason for hiding this comment

rb6502 commented Jan 4, 2025 • edited Loading

987123879113 commented Jan 4, 2025 • edited Loading

rb6502 commented Jan 4, 2025

seleuco commented Jan 4, 2025 • edited Loading

987123879113 commented Jan 4, 2025

invertego commented Jan 5, 2025

danmons commented Jan 6, 2025

seleuco commented Jan 6, 2025

987123879113 commented Jan 6, 2025 • edited Loading

danmons commented Jan 6, 2025 • edited Loading

sonicboy904 commented Jan 7, 2025

cuavas commented Jan 7, 2025

invertego commented Jan 7, 2025

987123879113 commented Jan 8, 2025 • edited Loading

cuavas commented Jan 9, 2025

cuavas left a comment

Choose a reason for hiding this comment

k2-git commented Jan 10, 2025 • edited Loading

cuavas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

987123879113 Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

987123879113 Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

987123879113 Jan 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

987123879113 Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cuavas left a comment

Choose a reason for hiding this comment

cuavas Jan 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cuavas Jan 12, 2025 • edited Loading

Choose a reason for hiding this comment

cuavas commented Jan 12, 2025

belegdol commented Jan 2, 2025 •

edited

Loading

rb6502 commented Jan 3, 2025 •

edited

Loading

987123879113 commented Jan 3, 2025 •

edited

Loading

rb6502 commented Jan 4, 2025 •

edited

Loading

987123879113 commented Jan 4, 2025 •

edited

Loading

seleuco commented Jan 4, 2025 •

edited

Loading

987123879113 commented Jan 6, 2025 •

edited

Loading

danmons commented Jan 6, 2025 •

edited

Loading

987123879113 commented Jan 8, 2025 •

edited

Loading

k2-git commented Jan 10, 2025 •

edited

Loading

987123879113 Jan 11, 2025 •

edited

Loading

987123879113 Jan 11, 2025 •

edited

Loading

987123879113 Jan 12, 2025 •

edited

Loading

987123879113 Jan 11, 2025 •

edited

Loading

cuavas Jan 12, 2025 •

edited

Loading

cuavas Jan 12, 2025 •

edited

Loading