Optimize stdout&stderr decoding for Windows ANSI Code Page #16

asinbow · 2018-12-24T03:53:13Z

master...asinbow:patch-1

This is actually is NOT a PR, but a suggestion.
I have seen whole content of issue #9, and changes in

Line 44 in 6e18ee8

static async enableUnicode() {

.
cmd.exe /c chcp has side effect, and I think it should be the only choice.
In our own application we have implemented a function ChangeCodePage by using node-ffi:

const ChangeCodePage = (buffer, options = {}) => {
  const {
    fromCodePage = 'acp',
    toCodePage = 'utf8',
    toString = true,
    nullTerminated = false,
  } = options;
  // using Windows API WideCharToMultiByte and MultiByteToWideChar
  return WideCharToMultiByte(MultiByteToWideChar(buffer, fromCodePage), {
    encoding: toCodePage,
    toString,
    nullTerminated,
  });
};

So, if there is a Registry.decodeStdout option, life would be much easier!
💯

The text was updated successfully, but these errors were encountered:

OddMorning · 2018-12-24T09:01:35Z

At the moment Registry.enableUnicode is undocumented, but it and its side effect are already described in the readme of the next version:
https://github.com/MikeKovarik/rage-edit/tree/2.0.0-beta#options.unicode
In short, the current implementation reads current code page, changes cp to 65001, reads registry data, and instantly changes cp back.

It would make sense to include ffi for perfectly safe unicode operations, but:

rage-edit has no dependencies and is only ≈74 KB, while ffi is ≈23 MB and requires node-gyp related stuff. It would be silly to include so huge library just to correct unicode output (which is not even needed, mostly).
The package depends on built-in reg.exe cli app, and broken unicode is not reg.exe's fault but cmd.exe's output. If ffi was a part of rage-edit, it would be more prudent to use Registry API instead of just fixing cmd.exe related problems.

The only idea that I can think of is internally check the ffi availability (without including it to the dependencies list) and use your ChangeCodePage instead of chcp if ffi is found, but does that worth it? I mean, both chcp and ChangeCodePage look like hacks. Wouldn't other packages affect this package's output? Or should ChangeCodePage be called before each reg.exe call? The latter sounds like just a faster version of current implementation.

Honestly, I don't know how ffi works so I may be wrong.

asinbow · 2018-12-24T10:48:00Z

Thanks for you reply!
It's not a good idea to make rage-edit depend on ffi, yes, I agee. And I only integrate ffi in my own application.
Directly converting stdout buffer to string makes no chance to fix text encoding manually.

proc.stdout.on('data', data => stdout += data.toString())

In my own opinion, it would be much better and friendlier if there is an option Registry.decodeStdout(function).
How do you think about this?

OddMorning · 2018-12-24T13:30:52Z

Oh, that's what you mean. But data still depends on default code page, at least in my case.

On my current PC the default code page is 866 (Russian) and I can read cyrillic letters well using your method and a bit of iconv-lite's magic:

proc.stdout.on('data', data => stdout += iconv.decode(data, 'cp866'))

But when I try to read something else (for example, 谷歌翻譯幫助我), it turns into a bunch of nonconvertible data:

await Registry.get('HKLM\\SOFTWARE\\aaaa', 'Unicode')
    ==>
// No encodings, just turning Buffer into a hex string
proc.stdout.on('data', data => stdout += data.toString('hex'))
    ==>
0d0a484b45595f4c4f43414c5f4d414348494e455c534f4654574152455c616161610d0a20202020556e69636f6465202020205245475f535a202020203f3f3f3f3f3f3f0d0a0d0a
    ==>
HKEY_LOCAL_MACHINE\SOFTWARE\aaaa
    Unicode    REG_SZ    ???????

In my case 谷歌翻譯幫助我 turned into ??????? (3f3f3f3f3f3f3f in hex) because cp866 doesn't support those characters.

So, the only thing you can do to raw data is convert it from your system code page to utf8 (which most probably could be automated). And again, it's just my own research so I may be wrong.

asinbow · 2018-12-25T01:17:10Z

Вот, automating this process is a really great idea!
Get Windows console codepage reveals how to use Windows API GetConsoleOutputCP and SetConsoleOutputCP. And in document of SetConsoleOutputCP, Microsoft tells us the current code page is stored in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\ACP:

We can query it out and then apply iconv-lite decoding.

OddMorning · 2018-12-25T09:18:01Z

In my case (and in your too it seems) it's OEMCP, not ACP:

Well, we will see what can be done. All of that looks like a good "lite version" of unicode option.

OddMorning · 2018-12-25T12:42:31Z

I just did some tests and:

Raw reg.exe calls take ≈15ms;
chcp 65001 + reg.exe + chcp (old value) takes ≈50ms;
iconv.decode(data, 'cp866') takes ≈45ms.

So both chcp and iconv are almost equally slow compared to "raw calls", while chcp gives more abilities (full confidence that data is not corrupted). I doubt that an app would be interrupted after the first chcp call and before the second one. Of course there's a tiny chance of that, but even the get-port package has a similar "failure" chance, and that's not the package's fail.

I want to say, either Registry.decodeStdout or iconv can be added, but again, would that worth it? Both of them depend on system. If an app counts on 936 code page, but system locale is English (437), you won't be able to just decode input from cp936 (just like chcp 936 will always say "Invalid code page"), and decoding from 437 is pointless. There are too many cases where those options can behave different in different situations. In my opinion, for public usage iconv/decodeStdout is not safe, and for personal usage chcp is already not that bad.

In short, the idea looks good for particular cases, but it would make code bloated without any great benefit globally. I wonder what @MikeKovarik thinks about all of that.

asinbow changed the title ~~Expose decodeStdout option for stdout&stderr~~ Optimize stdout&stderr decoding for Windows ANSI Code Page Dec 25, 2018

Taitava mentioned this issue Sep 24, 2021

Problems with non-ASCII characters (e.g. Å Ä Ö) (Windows) Taitava/obsidian-shellcommands#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize stdout&stderr decoding for Windows ANSI Code Page #16

Optimize stdout&stderr decoding for Windows ANSI Code Page #16

asinbow commented Dec 24, 2018

OddMorning commented Dec 24, 2018

asinbow commented Dec 24, 2018 •

edited

Loading

OddMorning commented Dec 24, 2018

asinbow commented Dec 25, 2018

OddMorning commented Dec 25, 2018

OddMorning commented Dec 25, 2018

Optimize stdout&stderr decoding for Windows ANSI Code Page #16

Optimize stdout&stderr decoding for Windows ANSI Code Page #16

Comments

asinbow commented Dec 24, 2018

OddMorning commented Dec 24, 2018

asinbow commented Dec 24, 2018 • edited Loading

OddMorning commented Dec 24, 2018

asinbow commented Dec 25, 2018

OddMorning commented Dec 25, 2018

OddMorning commented Dec 25, 2018

asinbow commented Dec 24, 2018 •

edited

Loading