-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM64 (aarch64) DRC backend #13162
base: master
Are you sure you want to change the base?
ARM64 (aarch64) DRC backend #13162
Conversation
I’ll take a look at this when I get a chance. |
A small note about my experience cross compiling for Raspberry Pi: I had to change the
|
What happens on 32-bit arm? Looking at the makefile and genie scripts changes, it looks like arm64 backend would get built. Is this what we want? |
It shouldn't do anything on 32-bit. asmjit only supports aarch64. |
I know. But looking at makefile changes, c backend is no longer forced if
On an odroid HC1 NAS as follows:
|
That doesn't mean it won't use the C backend though.
On MacOS it matches using the |
MIPS: gauntleg shows an out of memory error with the AArch64 BE but -drc_use_c boots fine. Also sfrush, calsped, and mwskins - same error. (The error is shown by the game, not MAME). PPC: scud doesn't boot without -drc_use_c. |
ppctest results: Mismatch: instr=MULLW., src1=0x80000000, src2=0x80000000 Mismatch: instr=MULLWO., src1=0x80000000, src2=0x80000000 |
fd7af38
to
cc4bafa
Compare
Fixed some more multiplication errors in addition to the ones AJR commented. That fixes the Midway/MIPS-based games RB mentioned. Also tried to emulate the call stack similar to how the C backend does it so that mapvars can be recovered properly which fixes |
@seleuco you might want to take a look at this, as ARM CPUs are the predominant targets for Android builds. |
Yes. I was already aware of this pull request. I will start testing and will report back to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few relatively minor things in the declarations. I haven’t looked at the code generation yet.
Next up: pmac6100. -drc_use_c will show the mouse pointer and flashing disk, aarch64 loses track of the return address during an exception and crashes to PC=0x00000004. (And to clarify, you don't need any of my WIP changes to the driver to get that). |
@rb6502 More mapvar/recover issues. The callstack depth wasn't actually storing the new depth so it would overwrite the hashstacksave pointer sooner than it should've. I tested the previously mentioned |
e3af311
to
f0e0561
Compare
Confirmed full functionality in pmac6100 with that change. I think this is good to go in terms of my regression testing now. |
I've successfully cross-compiled the pull request and tested performance on these Android devices without any problem at first sight: Qualcomm Snapdragon 685 (low-midrange; Geekbench 6 single core: 473)
Qualcomm Snapdragon 8 Gen 2 (high-end; Geekbench 6 single core: 1595)
A build with these changes is available for testing: https://drive.google.com/file/d/1aJcKK-ugzOzx_f5ridu-nPR3qnbFDYaC/view?usp=sharing I don't plan to publish anything until the next version of MAME that includes these changes, obviously. |
@seleuco Thank you for testing and providing benchmarks. Those are some really significant improvements for Android. |
I am happy to report this PR is working fine on a Windows ARM64 device with a Snapdragon X Elite CPU.
I was just looking at the makefile with the intent of streamlining Win/ARM64 builds, and I think I can shed some light on this. macOS ARM64 builds use this target:
...whereas Linux ARM64 builds use this generic target:
The macOS recipe passes
Interestingly, the This leaves me with a couple of questions:
The answers should clarify the right thing to do in this PR. |
I'm unable to compile this via My build environment is here: If I remove If I try the trick above of replacing
GNU gold produces similar output. The asmjit documentation suggests that Interestingly I see that I'm out of my depth here a bit, but it almost feels like asmjit isn't picking up the cross-compile environment and arm64 target instructions, and things are getting confused at compile/link time. Other than the linker errors, there are no compile time errors relating to asmjit or drc or anything similar. I'll keep poking at it, but I'm running out of ideas on what to try next. |
You're probably getting the linkage error because you haven't modified the script in 3rdparty.lua and you haven't removed "configuration { "x64" }" there as well. But that's not the way to go. The real problem you're having is that you're not running a rule that calls config=$(CONFIG)64 so the x64 configuration isn't being selected. I personally haven't checked the Linux aarch64 target so I don't know where the problem is right now. You definitely need to avoid NOASM=1 because it forces the use of FORCE_DRC_C_BACKEND no matter what you do and asmjit won't be included. |
@danmons seleuco is right, it's due to the same issue with the config string stuff. That'll eventually need to be address for this PR I think but I'm not very comfortable with makefiles and genie and such so it'll have to be discussed and decided on how exactly to approach that. Having said that, I was using your build environment (inside Docker) and with some changes I did get it to work with the DRC after a few failed attempts. Here's the .patch file I made for testing with your Docker build environment: diff --git a/scripts/genie.lua b/scripts/genie.lua
index bcdbc82236f..f6a5abc5486 100644
--- a/scripts/genie.lua
+++ b/scripts/genie.lua
@@ -711,11 +711,10 @@ end
if not _OPTIONS["FORCE_DRC_C_BACKEND"] then
if _OPTIONS["BIGENDIAN"]~="1" then
if (_OPTIONS["PLATFORM"]=="arm" or _OPTIONS["PLATFORM"]=="arm64") then
- configuration { "x64" }
+ configuration { }
defines {
"NATIVE_DRC=drcbe_arm64",
}
- configuration { }
else
configuration { "x64" }
defines {
diff --git a/scripts/src/3rdparty.lua b/scripts/src/3rdparty.lua
index e56665ef121..6466637d760 100755
--- a/scripts/src/3rdparty.lua
+++ b/scripts/src/3rdparty.lua
@@ -1949,7 +1949,7 @@ project "asmjit"
}
if (_OPTIONS["PLATFORM"]=="arm" or _OPTIONS["PLATFORM"]=="arm64") then
- configuration { "x64" }
+ configuration { }
defines {
"ASMJIT_NO_X86",
}
diff --git a/src/devices/cpu/drcbearm64.cpp b/src/devices/cpu/drcbearm64.cpp
index 5a895392852..6e0b989b8b8 100644
--- a/src/devices/cpu/drcbearm64.cpp
+++ b/src/devices/cpu/drcbearm64.cpp
@@ -826,6 +826,8 @@ drcbe_arm64::drcbe_arm64(drcuml_state &drcuml, device_t &device, drc_cache &cach
, m_baseptr(cache.near() + 0x80)
, m_near(*(near_state *)cache.alloc_near(sizeof(m_near)))
{
+ printf("Using ARM DRC\n");
+
// get pointers to C functions we need to call
using debugger_hook_func = void (*)(device_debug *, offs_t);
static const debugger_hook_func debugger_inst_hook = [] (device_debug *dbg, offs_t pc) { dbg->instruction_hook(pc); }; And also removed The printf there in drcbearm64.cpp was to be sure it was actually using the ARM DRC. My guess is you need to also edit |
I don't think I was clear on that - yup, Thank you both for the notes. I'll try them out shortly. [edit] Ooh yeah that sorted it. Running all of these with
|
I have tested some games out on my Ayn Odin 2 Pro and i have seen a lotta small and big improvement in performance for games that ran poorly on MAME4droid or a lil jump in speed. Here are links to my tests for the ARM64 beta for MAME4droid. Panic Park - https://youtu.be/025FNrFFo6M?si=XLUwAjCc8GycoIl5 AYN Odin 2 Pro Specs RAM: 12GB |
Philosophically, I’d say that Linux and Windows systems, where 32-bit ARM binaries are supported, should probably be using a *64 configuration for AArch64 builds. However, macOS no longer supports 32-bit applications at all, so there may not be any point keeping the *64 there. |
FYI, the latest Windows 11 feature update (24H2) dropped support for 32-bit ARM applications. |
I pushed a change that I think should probably clear up the build issues. I tested macOS (which reports |
5d891f2
to
5683a37
Compare
What happened to Microsoft’s famous backwards compatibility? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve started going through the actual code generation part. It’s going to take several cycles for me to get through all of it. Please bear with me with stupid or obvious questions.
I think that not Microsoft decided. Edit: |
f14c9b2
to
7ac5cf4
Compare
d5fdae1
to
88acdd8
Compare
The stack needs to be 16-byte aligned before the call to debug_log_hashjmp_fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the things this time are potential optimisations. Quite a few of them are fairly straightforward. There are a couple of more serious things.
- The delegate resolver doesn’t account for
this
pointer adjustment, non-virtual member functions, or the MSVC ABI. - There are some vestigial
naked
argument values. - The simple optimisations to use bit field manipulation instructions should be straightforward.
- Optimising address calculations to use
adrp
may or may not be worth the effort of adding 1.5 more cases. - Optimising
ROLAND
/ROLINS
to use bit field manipulation instructions would substantially complicate the code. I don’t know if we want to go there at this point. - Conditionally allocating labels is just something I’m curious about – I haven’t looked at what asmjit actually does when allocating or default constructing a label, so I don’t know whether it would be worthwhile.
The only remaining thing I haven’t reviewed is the ROLC
/RORC
code generation. I’m pretty sure it can be optimised, I’m just not going to be able to think it through tonight.
[] (resolved_handler &handler, address_space &space, auto accessor) | ||
{ | ||
struct { | ||
uintptr_t ptr; | ||
ptrdiff_t adj; | ||
} equiv; | ||
|
||
assert(sizeof(accessor) == sizeof(equiv)); | ||
*reinterpret_cast<decltype(accessor) *>(&equiv) = accessor; | ||
|
||
handler.obj = uintptr_t(reinterpret_cast<uint8_t *>(&space)); | ||
|
||
auto const vptr = *reinterpret_cast<uint8_t const *const *>(handler.obj) + equiv.ptr; | ||
handler.func = *reinterpret_cast<uint8_t *const *>(vptr); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested this with both virtual and non-virtual member functions? It should also apply this
pointer adjustment to deal with multiple inheritance Essentially, this is supposed to be a special-cased version of delegate_mfp_itanium::convert_to_generic
combined with enough of delegate_mfp_msvc
to get it to work with the Microsoft ABI.
See delegate.cpp for the full-fat versions: https://github.com/mamedev/mame/blob/master/src/lib/util/delegate.cpp
I would have thought it should be something like this (note that it’s different to the x86-64 version because the virtual function flag is stored in the low bit of the this
pointer offset for ARM-like architectures, as the low bit of a code pointer indicates a Thumb function and the offset needs to be shifted):
[] (resolved_handler &handler, address_space &space, auto accessor)
{
if (MAME_DELEGATE_USE_TYPE == MAME_DELEGATE_TYPE_ITANIUM)
{
struct { uintptr_t ptr; ptrdiff_t adj; } equiv;
assert(sizeof(accessor) == sizeof(equiv));
*reinterpret_cast<decltype(accessor) *>(&equiv) = accessor;
handler.obj = uintptr_t(reinterpret_cast<uint8_t *>(&space) + (equiv.adj >> 1));
if (BIT(equiv.adj, 0))
{
auto const vptr = *reinterpret_cast<u8 const *const *>(handler.obj) + equiv.ptr;
handler.func = *reinterpret_cast<uint8_t *const *>(vptr);
}
else
{
handler.func = reinterpret_cast<uint8_t *>(equiv.ptr);
}
}
else if (MAME_DELEGATE_USE_TYPE == MAME_DELEGATE_TYPE_MSVC)
{
// interpret the pointer to member function ignoring the virtual inheritance variant
struct single { uintptr_t ptr; };
struct multi { uintptr_t ptr; int adj; };
struct { uintptr_t ptr; int adj; int vadj; int vindex; } unknown;
assert(sizeof(accessor) <= sizeof(unknown));
*reinterpret_cast<decltype(accessor) *>(&unknown) = accessor;
uint32_t const *func = reinterpret_cast<uint32_t const *>(unknown.ptr);
handler.obj = uintptr_t(&space);
if ((sizeof(unknown) == sizeof(accessor)) && unknown.vindex)
{
handler.obj += unknown.vadj;
auto const vptr = *reinterpret_cast<uint8_t const *const *>(handler.obj);
handler.obj += *reinterpret_cast<int const *>(vptr + unknown.vindex);
}
if (sizeof(single) < sizeof(accessor))
handler.obj += unknown.adj;
// walk past thunks
while (true)
{
if ((0x90000010 == (func[0] & 0x9f00001f)) && (0x91000210 == (func[1] & 0xffc003ff)) && (0xd61f0200 == func[2]))
{
// page-relative jump with +/-4GB reach - adrp xip0,... ; add xip0,xip0,#... ; br xip0
int64_t const page =
(uint64_t(func[0] & 0x60000000) >> 17) |
(uint64_t(func[0] & 0x00ffffe0) << 9) |
((func[0] & 0x00800000) ? (~std::uint64_t(0) << 33) : 0);
uint32_t const offset = (func[1] & 0x003ffc00) >> 10;
func = reinterpret_cast<uint32_t const *>(((uintptr_t(func) + page) & (~uintptr_t(0) << 12)) + offset);
}
else if ((0xf9400010 == func[0]) && (0xf9400210 == (func[1] & 0xffc003ff)) && (0xd61f0200 == func[2]))
{
// virtual function call thunk - ldr xip0,[x0] ; ldr xip0,[x0,#...] ; br xip0
uint32_t const *const *const vptr = *reinterpret_cast<uint32_t const *const *const *>(object);
func = vptr[(func[1] & 0x003ffc00) >> 10];
}
else
{
// not something we can easily bypass
break;
}
}
handler.func = reinterpret_cast<uint8_t *>(uintptr_t(func));
}
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That went so far over my head it's in outer space. I completely do not understand this so I can't verify if it's correct or not.
I think the code I had is pretty close to MAME_DELEGATE_TYPE_ITANIUM, and checking MAME_DELEGATE_USE_TYPE it does get set to MAME_DELEGATE_TYPE_ITANIUM, although I wasn't using adj
. I can say at least that the code I was using did work with the resolved reader/writer functions and that's primarily what I was using. I also did test the non-resolver ones previously and those worked too.
I have no idea how to test MAME_DELEGATE_TYPE_MSVC though. Maybe I could install Win 11 ARM in a VM and test that out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the code is basically copy/pasted from delegate.cpp with the impossible paths cut out, so it probably works. I should probably put this stuff somewhere separate so it can be used independently of the delegate classes. I hacked it into the x86-64 DRC backend as a quick optimisation.
If you really want to test it with the MSVC ABI, you’d need to build with either MSVC itself or with clang set to use the MSVC ABI (the clang-cl wrapper that Visual Studio uses to compile with clang does this), but I wouldn’t worry too much. MSVC is basically in “marginally supported” status, and people are even less likely to be using Visual Studio on an ARM PC. If it’s broken, someone will complain eventually, and then I can worry about it.
// compute the base by aligning the cache top to a cache line (assumed to be 64 bytes) | ||
uint8_t *dst = (uint8_t *)(uint64_t(m_cache.top() + 63) & ~63); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a sanity check, I looked up cache line sizes for a few CPUs:
- Apple M series: 128 bytes
- Cortex-A series (including Kryo): 64 bytes
- Cortex-M series: 32 bytes
Stick with 64-byte alignment, or bump it up to 128 as the biggest widespread size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this repo that can get the cache line size from the OS: https://github.com/NickStrupat/CacheLineSize
I tested it on my M1 Pro and it correctly gives 128.
On my Raspberry Pi 4B (Arm Cortex-A72) gives 64.
Not ARM, but my Ryzen 5700X gives 64 which appears to be correct from what I could find on Google.
What do you think of pulling in that code for this? I don't know if it's something that you'd want to turn into a util function and so it could be used for the other DRC backends too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll add it to 3rdparty. Looks simple enough. I’ll worry about this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The linked code has no error checking at all. I added helpers for macOS, Windows and Linux in ff92d10. I’m not sure how they’ll behave on systems with little.BIG architecture (efficiency cores and performance cores with different characteristics), but they shouldn’t make anything worse.
void drcbe_arm64::op_roland(a64::Assembler &a, const uml::instruction &inst) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth detecting cases where this can be special-cased to use a ubfx
or ubfiz
instruction? If the mask and shift are immediate:
ubfx
can be used if the mask is right-aligned (i.e. starts at bit position zero)ubfiz
can be used if the mask is contiguous and starts at the bit position equal to the shift value
For example:
ROLAND(d, s, 3, 0x003f)
is equivalent toubfx(d, s, bits - 3, 6)
ROLAND(d, s, 5, 0x07e0)
is equivalent toubfiz(d, s, 5, 6)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Real-world PowerPC RLW* usage will probably fall into one of those two cases a majority of the time, and that's not even counting the hard-coded instances the DRC generates for instructions like MFCR.
I'm not sure how far into the weeds we want to go with optimizations before landing the first working version, but this seems safe and useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you can optimize ROLAND any more than it is now. You can't simplify the command down to just one ubfx/ubfiz because you need to rotate the value.
Here's what the current implementation does:
- If the mask is 0 then the value doesn't change
- If the shift is a constant then it gets simplified to a ror
- The final command is always just a single AND of the rotated value
The only way you could probably optimize it even further is to check if srcp is a constant and if then just calculate it as a constant and a mov to the register.
void drcbe_arm64::op_rolins(a64::Assembler &a, const uml::instruction &inst) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth detecting cases where this can be special-cased to use a bfi
instruction? If the mask and shift are immediate, bfi
can be used if the mask is contiguous and starts at the bit position equal to the shift value.
For example:
ROLINS(d, s, 7, 0x0380)
is equivalent tobfi(d, s, 7, 3)
In cases where the shift and mask are immediate, the mask is contiguous, and the mask does not start at the bit position equal to the shift value, you can use ror
followed by bfi
.
For example, ROLINS(d, s, 4, 0x1f80)
is equivalent to ror(tmp, s, 3)
followed by bfi(d, tmp, 7, 6)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth detecting cases where this can be special-cased to use a bfi instruction? If the mask and shift are immediate, bfi can be used if the mask is contiguous and starts at the bit position equal to the shift value.
Two things to note:
- If the rolins function is called with all immediates and does not use flags, it gets optimized out (rolins -> roland -> various simplications for roland). This applies to most opcodes if they use constants https://github.com/mamedev/mame/blob/master/src/devices/cpu/uml.cpp#L436-L474
- My current implementation will simplify shift immediate down to ror or mov as appropriate already
But the use of bfi
to simplify the second half of the code probably will work so I'll try it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My implementation of the bfi thing broke rolins. bfi requires the value to start at bit 0 to insert it into the destination register, so it just seems like too much effort to save 1 instruction so I'm going to revert that change.
|
||
if (shiftp.is_immediate()) | ||
{ | ||
const auto shift = ((inst.size() * 8) - (shiftp.immediate() % (inst.size() * 8))) % (inst.size() * 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to optimise this expression – assuming the size is always a power of two, isn’t it equivalent to:
const auto shift = ~shiftp.immediate() & ((inst.size() * 8) - 1);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I threw some numbers at it but they're not equivalent. Your version is consistently off by 1.
int main(int argc, char **argv)
{
const uint32_t instsize = 4;
const uint32_t val = 0x400;
for (int shiftp = 0; shiftp < 8; shiftp++)
{
uint32_t shiftcalc1 = ((instsize * 8) - (shiftp % (instsize * 8))) % (instsize * 8);
uint32_t shiftcalc2 = ~shiftp & ((instsize * 8) - 1);
uint32_t res1 = rotr_32(val, shiftcalc1);
uint32_t res2 = rotr_32(val, shiftcalc2);
uint32_t expected = rotl_32(val, shiftp);
printf("%d %d %d | %08x %08x %08x\n", shiftp, shiftcalc1, shiftcalc2, expected, res1, res2);
}
return 0;
}
result
0 0 31 | 00000400 00000400 00000800
1 31 30 | 00000800 00000800 00001000
2 30 29 | 00001000 00001000 00002000
3 29 28 | 00002000 00002000 00004000
4 28 27 | 00004000 00004000 00008000
5 27 26 | 00008000 00008000 00010000
6 26 25 | 00010000 00010000 00020000
7 25 24 | 00020000 00020000 00040000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well you can fix the off-by-one by using twos-complement negation rather than inversion:
const auto shift = -int64_t(shiftp.immediate()) & ((inst.size() * 8) - 1);
It should still be cleaner than needing the bit count in three places and requiring two modulo operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly just suggested optimisation of ROLC/RORC with immediate shift. You can get them down to just five bit field manipulation instructions.
Sorry, haven’t had time to write up optimisation of ROLAND/ROLINS tonight, or to look at optimising ROLC/RORC with variable shift.
const uint64_t pagebase = codeoffs & ~make_bitmask<uint64_t>(12); | ||
const int64_t pagerel = (int64_t)ptr - pagebase; | ||
if (is_valid_immediate(abs(pagerel), 12) || ((abs(pagerel) & 0xfff) == 0 && is_valid_immediate(abs(pagerel) >> 12, 12))) | ||
{ | ||
a.brk(1); | ||
|
||
a.adrp(MEM_SCRATCH_REG, pagebase); | ||
|
||
if (is_valid_immediate_signed(pagerel, 9)) | ||
return arm::Mem(MEM_SCRATCH_REG, pagerel); | ||
else if (pagerel > 0 && emit_add_optimized(a, MEM_SCRATCH_REG, MEM_SCRATCH_REG, pagerel)) | ||
return arm::Mem(MEM_SCRATCH_REG); | ||
else if (pagerel < 0 && emit_sub_optimized(a, MEM_SCRATCH_REG, MEM_SCRATCH_REG, pagerel)) | ||
return arm::Mem(MEM_SCRATCH_REG); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn’t seem to be right. You’re using adrp
on pagebase
which is the page base of the current PC for the generated code. Wouldn’t that always produce a zero immediate? You want to use it on the target address.
const uint64_t pagebase = codeoffs & ~make_bitmask<uint64_t>(12);
const int64_t pagerel = (int64_t)ptr - pagebase;
if (is_valid_immediate_signed(pagerel, 33))
{
const uint64_t targetpage = ptr & ~make_bitmask<uint64_t>(12);
const uint64_t pageoffs = ptr & util::make_bitmask<uint64_t>(12);
a.adrp(MEM_SCRATCH_REG, targetpage);
if (is_valid_immediate_signed(pageoffs, 9))
{
return arm::Mem(MEM_SCRATCH_REG, pageoffs);
}
else
{
a.add(MEM_SCRATCH_REG, MEM_SCRATCH_REG, pageoffs);
return arm::Mem(MEM_SCRATCH_REG);
}
}
You know the offset relative to the target page will always be a valid 12-bit unsigned value, as it’s just the low twelve bits of the value. It’s always a positive offset from the start of the target page. You don’t need to check for a negative value.
const uint64_t pagebase = codeoffs & ~make_bitmask<uint64_t>(12); | ||
const int64_t pagerel = (int64_t)ptr - pagebase; | ||
if (is_valid_immediate(abs(pagerel), 12) || ((abs(pagerel) & 0xfff) == 0 && is_valid_immediate(abs(pagerel) >> 12, 12))) | ||
{ | ||
a.adrp(MEM_SCRATCH_REG, pagebase); | ||
|
||
if (is_valid_immediate_signed(pagerel, 9)) | ||
{ | ||
a.emit(opcode, reg, arm::Mem(MEM_SCRATCH_REG, pagerel)); | ||
return; | ||
} | ||
else if (pagerel > 0 && emit_add_optimized(a, MEM_SCRATCH_REG, MEM_SCRATCH_REG, pagerel)) | ||
{ | ||
a.emit(opcode, reg, arm::Mem(MEM_SCRATCH_REG)); | ||
return; | ||
} | ||
else if (pagerel < 0 && emit_sub_optimized(a, MEM_SCRATCH_REG, MEM_SCRATCH_REG, pagerel)) | ||
{ | ||
a.emit(opcode, reg, arm::Mem(MEM_SCRATCH_REG)); | ||
return; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the other one – this seems to be using adrp
for the page of the current PC, which will always give zero, rather than for the target address.
a.mrs(TEMP_REG1, a64::Predicate::SysReg::kNZCV); | ||
|
||
a.bfc(TEMP_REG1, 28, 4); // clear upper 4 status bits | ||
|
||
a.and_(TEMP_REG2, FLAGS_REG, 0b1100); // zero + sign | ||
a.ubfx(TEMP_REG3, FLAGS_REG, 1, 1); // overflow flag | ||
a.orr(TEMP_REG2, TEMP_REG2, TEMP_REG3); | ||
a.bfi(TEMP_REG1, TEMP_REG2, 28, 4); | ||
|
||
a.msr(a64::Predicate::SysReg::kNZCV, TEMP_REG1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don’t need the a.bfc(TEMP_REG1, 28, 4);
any more – the a.bfi(TEMP_REG1, TEMP_REG2, 28, 4);
will completely replace the contents of the upper four bits.
a.mrs(TEMP_REG1, a64::Predicate::SysReg::kNZCV); | ||
|
||
a.bfc(TEMP_REG1, 28, 4); // clear upper 4 status bits | ||
|
||
a.and_(TEMP_REG2, FLAGS_REG, 0b1100); // zero + sign | ||
a.ubfx(TEMP_REG3, FLAGS_REG, 1, 1); // overflow flag | ||
a.orr(TEMP_REG2, TEMP_REG2, TEMP_REG3); | ||
a.bfi(TEMP_REG1, TEMP_REG2, 28, 4); | ||
|
||
a.msr(a64::Predicate::SysReg::kNZCV, TEMP_REG1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here – the bfc
isn’t necessary any more.
void drcbe_arm64::op_rolc(a64::Assembler &a, const uml::instruction &inst) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can special-case this if the shift is an immediate and do it with bit manipulation operations. Let’s illustrate with an 8-bit example, and a 3-bit rotation.
Before:
x |7|6|5|4|3|2|1|0|
f |c|
Desired state after y = ROLC(x, 3)
:
y |4|3|2|1|0|c|7|6|
f |5|
Now obviously, we don’t have a native operation that does this, but we can synthesise it:
Before:
x |7|6|5|4|3|2|1|0|
f |c|
Copy desired carry flag output to temp ubfx t,x,8-3,1
:
x |7|6|5|4|3|2|1|0|
t |5|
f |c|
Rotate upper bits around ubfx y,x,8-3+1,3-1
:
x |7|6|5|4|3|2|1|0|
y |-|-|-|-|-|-|7|6|
t |5|
f |c|
Insert current carry bfi y,f,3-1,1
:
x |7|6|5|4|3|2|1|0|
y |-|-|-|-|-|c|7|6|
t |5|
f |c|
Rotate lower bits around bfi y,x,3,8-3
:
x |7|6|5|4|3|2|1|0|
y |4|3|2|1|0|c|7|6|
t |5|
f |c|
Now you’ve got the desired output in y and the desired carry flag in t. Assuming I understand correctly that carry is bit 0 of the flags register, conceptually it’s:
ubfx carry,src1p,bits-shift,1 ; copy desired carry
ubfx output,src1p,bits-shift+1,shift-1 ; rotate upper bits around
bfi output,FLAGS_REG,shift-1,1 ; insert current carry
bfi output,src1p,shift,bits-shift ; rotate lower bits around
bfi FLAGS_REG,carry,0,1 ; copy carry to flags register
So the code to do this would be something like:
...
if (src2p.is_immediate())
{
const auto s = src2p.immediate() % (inst.size() * 8);
if (s != 0)
{
a.ubfx(carry, param1, (inst.size() * 8) - s, 1);
a.ubfx(output, param1, (inst.size() * 8) - s + 1, s - 1);
a.bfi(output.x(), FLAGS_REG, s - 1, 1);
a.bfi(output, param1, shift, (inst.size() * 8) - s);
a.bfi(FLAGS_REG, carry.x(), 0, 1);
}
else
{
...
}
}
else
{
const a64::Gp shift = src2p.select_register(TEMP_REG2, inst.size());
mov_reg_param(a, inst.size(), shift, src2p);
...
void drcbe_arm64::op_rorc(a64::Assembler &a, const uml::instruction &inst) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do the same optimisation as for ROLC. Let’s illustrate with an 8-bit example, and a 3-bit rotation again.
Before:
x |7|6|5|4|3|2|1|0|
f |c|
Desired state after y = RORC(x, 3)
:
x |1|0|c|7|6|5|4|3|
f |2|
To synthesise it:
Before:
x |1|0|c|7|6|5|4|3|
f |2|
Copy desired carry flag output to temp ubfx t,x,3-1,1
:
x |7|6|5|4|3|2|1|0|
t |2|
f |c|
Rotate upper bits around ubfx y,x,3,8-3
:
x |7|6|5|4|3|2|1|0|
y |-|-|-|7|6|5|4|3|
t |2|
f |c|
Insert current carry bfi y,f,8-3,1
:
x |7|6|5|4|3|2|1|0|
y |-|-|c|7|6|5|4|3|
t |2|
f |c|
Rotate lower bits around bfi y,x,8-3+1,3-1
:
x |7|6|5|4|3|2|1|0|
y |1|0|c|7|6|5|4|3|
t |2|
f |c|
So the code to do this would be something like:
...
if (src2p.is_immediate())
{
const auto s = src2p.immediate() % (inst.size() * 8);
if (s != 0)
{
a.ubfx(carry, param1, s - 1, 1);
a.ubfx(output, param1, s, (inst.size() * 8) - s);
a.bfi(output.x(), FLAGS_REG, (inst.size() * 8) - s, 1);
a.bfi(output, param1, (inst.size() * 8) - s + 1, s - 1);
a.bfi(FLAGS_REG, carry.x(), 0, 1);
}
else
{
...
}
}
else
{
const a64::Gp shift = src2p.select_register(TEMP_REG2, inst.size());
mov_reg_param(a, inst.size(), shift, src2p);
...
Sorry, had to edit two of the review comments due to errors in the suggested code. |
It's a big PR. I've done my best to make it work but I'm sure there are still bugs or questionable implementation choices. It passes all of the same tests I was doing for the previous DRC changes so I think the outputs should be the same as the other DRC backends.
Additional testing and/or misc feedback like benchmarks on various devices would also be appreciated. For hardware I've tested with an M1 Pro CPU and a Raspberry Pi 4 Model B. For software I've been testing using a few games on the Naomi (SH4), Firebeat (PPC), NWK-TR (PPC), and some very small testing of dgpix (Hyperstone E1-32XT).