-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpufreq: armada-37xx: forbid cpufreq for 1.2 GHz variant #20
Comments
@kostapr: Hi! Could you look at this bug report? |
I have no problem with this patch. BTW, Ken Ma, Igal Liberman and Victor Gu are not with Marvell anymore. For the future Armada-related patches, please add Stefan ([email protected]) and Nadav ([email protected]) to the CC list. |
@kostapr But that patch is not a solution, it's just a hotfix to get the devices booting and not constantly crashing due to voltage issues. |
@robimarko I cannot comment on the problem. I personally think that not all 37xx dies are capable to work stable at 1.2GHz. However, this should be confirmed by HW design or production team at Marvell. Hopefully @haklai can add more on this matter. |
There is part order number So @robimarko could you confirm that you have the right 37xx die which is designed for 1.2 GHz and in this case @kostapr or @haklai could you get more information about HW design / production team where is the issue? |
@pali I opened one of the Esspresobin Ultras I have and the SoC PN is: |
@kostapr so for sure above @robimarko's SoC is designed for 1.2 GHz. |
@pali If you disable DFS feature and boot with 1.2GHz frequency only, do you see any crashes? |
@stefanchulski currently I do not have 1.2GHz variant of A3720 SoC. @robimarko and @erdoukki could you please do required tests for @stefanchulski? |
Sure, with pleasure, as usual... |
@stefanchulski If I am seeing it correctly, it's using 1200MHz by default after booting as the kernel is not scaling it anymore.
I need to really stress test it before claiming that it's stable with the WTMI set VDD. But I have seen samples that use 1.26V as well, and I don't think that the CPUFreq has a way to know this and uses too low voltage for most boards. UPDATE:
|
My guess is that in wtmi firmware is missing some init sequence related to CPU voltage configuration. See function There is array |
CPUFreq driver armada-37xx-cpufreq.c know this, it grabs this value from OTP (but indirectly, it reads it from register which is filled by wtmi code, which fills it from OTP). Driver uses following Marvell algorithm:
For max_freq 1200 MHz are: div1=2, div2=4, div3=6; for 1000 MHz are: div1=2, div2=4, div3=5; and for 800 MHz are: div1=2, div2=3, div3=4. But what is source of above Marvell algorithm and these constants (specially those substracted 100mV and 150mV for div1/2/3) I do not know. I was not able to find this documented neither in Armada 3720 Functional or Hardware specification. And I suspect that these 100mV and 150mV constants are incorrect too as for CPU with max_freq=1GHz I had to do small adjustment in cpufreq driver. I was told that Marvell reproduced this issue on their 3720 development board last year and was preparing some fix for it, including documentation/errata update. But I have not seen anything. So it means that somebody in Marvell must have been aware of this issue and should have know more details about it (or somebody who is not with Marvell anymore as @kostapr wrote). Also look at Armada 3720 Errata document, there is for a long time documented issue related to 1.2GHz mode. |
@stefanchulski: Do you need some more tests? Or is above crash confirmation with log from @robimarko enough? |
@pali So issue related to cpufreq as described in the patch or do you have an issue with 1.2GHz? |
@stefanchulski seems that both. There is issue related to cpufreq as described on mailing list. And @robimarko has problems with 1.2GHz as described in post #20 (comment) |
@pali All other frequencies stable? Its a specific board issue occurred on many boards? |
It is on many boards. Problem occurs when either running on L0 load (ie without divisor) or when switching from L1 load (uses div1) to L0. |
After lot of experiments we somehow workarounded this crash on 1GHz variant of A3720 with this commit: But fix/workaround does not work for 1.2GHz variant of A3720 and as @robimarko wrote it still crashes. |
I not familiar with all these AVS configurations on A37XX. But hardcoded values look strange, should you take into account chip skew and calculate AVS from SVC? |
Yes, I only have 1.2GHz A3720 models and for me, all of the boards I tried are crashing. |
Yes, but we have absolutely no idea what is happening here. And if you look at referenced changed from above commit dc33b62 those hardcoded values were done by Marvell developers...
Probably, but we have no idea how... There is missing documentation about this topic. I have not seen any SVC documentation. So this is something which is probably only internally in Marvell. |
Same for me...
This one had crash quickly... |
another ULTRA
UPDATE : OK |
More from third ULTRA board :
from lscpu :
pretty stable :
UPDATE : OK |
More also from my fourth ULTRA board :
from lscpu
pretty stable :
UPDATE : OK |
Due to bugs in a37xx cpu driver, reported cpu frequency (e.g. by lscpu) could be incorrect. So the best check for (maximal) cpu frequency is to use mhz userspace tool from https://github.com/wtarreau/mhz which reports correct value, even when kernel reports it incorrectly. |
@pali It's running at 1200MHz as that is set by WTMI and since CPUFreq is blacklisted for the 1200MHz model kernel won't touch it.
@erdoukki That's the issue that depending on the exact board you test some are stable with the WTMI set voltages while others are not, for me most of them will crash. |
@pali @robimarko |
Héllo all, I get some new issues around the CPU bug on the 37xx. I may look at it deeper if needed, because it is one of my working ULTRA, which get reboot only one CPU load... Add: I have to get in the testing of the official Marvell SDK, but it was postpone for now... |
FROM SDK10 (SDK-10.3.9.0) and OpenWrt 21.02.0, r16279-5cc0535800
Booting an ESPRESSObin-ULTRA (one of my mostly unstable... checked before tests and confirmed to still CRASHING few seconds only after boot in OpenWrt with default kernel)
Then after few seconds...
Now entering the SDK10 tests ! Just booting with SDK10 Image (and modules) in OpenWrt 21.02.0...
stressed with: crash (but not reset):
PANICs l ooks like to be from something else:
more information with the working kernel from SDK10
|
Sorry for the delay... |
still no issue;
|
|
QUICK-CRASH with KERNEL-SDK10 & 0x58e3ffff
|
NOCRASH with KERNEL-SDK10 & 0x78e3ffff
stress... (EDIT: OK no more crash on this one with kernel from SDK10) |
ANOTHER ESPRESSObin-ULTRA (which crash before the boot process end up !) SIMPLY WORKS FINE (BOOT: OK - STRESSTEST: WIP / TBD) with the SDK10 kernel !
Tests with SDK10 kernel:
|
I have reflashed the same ULTRA from snapshot to 21.02.1
STRESS-NG: EDITED with results (CRASH)...
|
Same ULTRA with only 0x5a69 forced value before boot:
|
flashing my last (old but working) custom UBOOT on the same BUGGY ULTRA:
EDIT: RESULTS:
CRASH (FREEZE)
May be from something else than CPU ? |
May be I am mixing some kernel modules for the stress tools ? Because, this box which hang/freeze or panic very quickly with mainline linux, is mostly working with the Marvell-SDK10 kernel:
Who has some proposal for testing and studying this issue deeper ? Advice welcome ! |
Well, we know from beginning that A3720 crashes when running at 1.2 GHz frequency. And it needs to be fixed. AFAIK there is no patch which is fixing this issue for 1.2 GHz mode neither in Marvell-4.14 kernel nor in mainline kernel. So posting new and new crash log does not bring nothing new, I guess everybody knows it from first few posts and people rather unsubscribe from spamming thread. Has Marvell provided to you privately any fix for this issue? |
Sure, you're right, and I agreed ! Some (more) technical details of the ULTRA on which I do my tests (the most buggy I have):
It is the SDK10 Kernel I compile for testing the NDA-SDK10 from Marvell.
the CPU is working at 1.2 GHz
CPU Governor are not implemented and the CPU is always at 1.2 GHz
The BOX do not bug/freeze/crash/oops at all.
Sorry about this, and apologize.
They only give me access to the SDK with a NDA and not directly, but with my own contacts. The tests may be also buggy because of mixed kernel / libs and my OpenWrt based system... What I see, is two ultra which crash on mainline linux but look to not bug and work with SDK10 kernel. Theses SDK are based on linux-4.14.x There is also a SDK-11 based on linux-5.4.x where are removed all a37xx supported build names. So, because I do not very know closely these SDK from marvell, I do not know if they include more, less, or all from the Marvell Public Git patches, but I think not, because of the NDA... So, sorry again for this long message, no spam, it is again my own and personal facility to transparently share the works. @pali, I am completely wrong with my first results analysis ? I do it all at free time, for free, no sponsor, no more target that help to fix this issue... |
I think that no more tests are needed unless people from Marvell explicitly ask what they need (or somebody else who is going to fix it). It is now up to Marvell to provide fix. If you have NDA contract with Marvell, you could report this issue to them and ask them what they need for fixing this issue. |
@stefanchulski or anybody from Marvell: Could you please provide some reply/feedback what is needed for fixing this issue? And if you need some more tests from @erdoukki with 1.2GHz A3720? |
I can confirm the SDK10 is mostly stable at 1.2 GHz...
As you can see, I stress only the CPU, which stay at 1.2 GHz, for more than 2 hours, with the SDK10 kernel, and no FREEZE, nor BUG, or OOPS ! It is, again, my mostly buggy EspressoBin-ULTRA, which just hang before the end of normal bootup with the OpenWrt 21.02.x default kernel... |
@kostapr Could you please advice who to ask for help or any feedback here? |
Marvell published a new release of the SDK10 on 2022.02. |
@erdoukki How did you get to run at 1200MHz? |
@robimarko Can you share your actual work (private fork or anyway?) I have not modified the SDK10 kernel sources. I flash the kernel and modules obtained with the SDK10 compilation in a OpenWrt snapshot image obtained with OpenWrt ToolChain. I can share the simple patch used on SDK10 if you want... |
There is nothing to share really, I just replaced the cpufreq driver in 5.4 kernel with the one from SDK10. |
Initially, Marvell confirm that the SDK10 was the only supported, and also that the 1200 MHz freeze was solved in the latest SDK10, (end of 2020). Why do you tests only parts of SDK10 on top of community kernel ? I want to repeat that I have proposed another possible solution to study the problem. My tests shown that the SDK10 do not froze at 1200 MHz. For SDK10, to build kernel and modules, I use this patch:
Hope this can help... |
Why wouldn't I test it on mainline kernels? It doesn't freeze on 750MHz obviously, I have been forced to run it at 750MHz aka the DDR clock for a while now cause if you disable CPUFreq then it will be left on 1200MHz as WTMI set it there and crash under light load and with CPUFreq it won't even boot properly. I am interested in solving this for everybody, not just running SDK10 and pretending its all fine now cause it isn't. |
I am sure you want to, as I want also...
It is only to verify the veracity of what Marvell officially pretend...
Who say that ? I just keep my answer to your only latest sentence, no problem at all, anyway, on what you said, or what I may understood. I may say again, and precise, what I suggest as a proposal of direction to get a possible solution for this overall problem. #1. Did Marvell words of SDK10 fixes the 1200 MHz crashes true ? Then, if it is really OK, we can say there is a CLUE or a solution from the SDK10 which may be found and offered to the community and latest kernel... #2. We can then try some directions to find the problem (BUG) or, better, find the solution (FIX(ES)) ? #3. Then we can make some code, compile, and debug, with deeper tests ? #4. Then we can "slashing champagne" ? Sorry to not be technical, or just a little... For now; keep this work in private if you prefer, or share any experiences with me to help on this analysis [email protected] |
The last part wasn't directed at you at all, I agree with at least most things you said. |
New SDK10-QA-10.22.03 ! |
kostapr commented on 9 Oct 2021
@kostapr @stefanchulski Could you check what is the current state? Because this issue is still present. |
openwrt/openwrt@f407b2f
How can we contact Marvell to have the needed information ?
The text was updated successfully, but these errors were encountered: