Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize stdout&stderr decoding for Windows ANSI Code Page #16

Open
asinbow opened this issue Dec 24, 2018 · 6 comments
Open

Optimize stdout&stderr decoding for Windows ANSI Code Page #16

asinbow opened this issue Dec 24, 2018 · 6 comments

Comments

@asinbow
Copy link

asinbow commented Dec 24, 2018

master...asinbow:patch-1

This is actually is NOT a PR, but a suggestion.
I have seen whole content of issue #9, and changes in

static async enableUnicode() {
.
cmd.exe /c chcp has side effect, and I think it should be the only choice.
In our own application we have implemented a function ChangeCodePage by using node-ffi:

const ChangeCodePage = (buffer, options = {}) => {
  const {
    fromCodePage = 'acp',
    toCodePage = 'utf8',
    toString = true,
    nullTerminated = false,
  } = options;
  // using Windows API WideCharToMultiByte and MultiByteToWideChar
  return WideCharToMultiByte(MultiByteToWideChar(buffer, fromCodePage), {
    encoding: toCodePage,
    toString,
    nullTerminated,
  });
};

So, if there is a Registry.decodeStdout option, life would be much easier!
💯

@OddMorning
Copy link
Collaborator

At the moment Registry.enableUnicode is undocumented, but it and its side effect are already described in the readme of the next version:
https://github.com/MikeKovarik/rage-edit/tree/2.0.0-beta#options.unicode
In short, the current implementation reads current code page, changes cp to 65001, reads registry data, and instantly changes cp back.

It would make sense to include ffi for perfectly safe unicode operations, but:

  1. rage-edit has no dependencies and is only ≈74 KB, while ffi is ≈23 MB and requires node-gyp related stuff. It would be silly to include so huge library just to correct unicode output (which is not even needed, mostly).
  2. The package depends on built-in reg.exe cli app, and broken unicode is not reg.exe's fault but cmd.exe's output. If ffi was a part of rage-edit, it would be more prudent to use Registry API instead of just fixing cmd.exe related problems.

The only idea that I can think of is internally check the ffi availability (without including it to the dependencies list) and use your ChangeCodePage instead of chcp if ffi is found, but does that worth it? I mean, both chcp and ChangeCodePage look like hacks. Wouldn't other packages affect this package's output? Or should ChangeCodePage be called before each reg.exe call? The latter sounds like just a faster version of current implementation.

Honestly, I don't know how ffi works so I may be wrong.

@asinbow
Copy link
Author

asinbow commented Dec 24, 2018

Thanks for you reply!
It's not a good idea to make rage-edit depend on ffi, yes, I agee. And I only integrate ffi in my own application.
Directly converting stdout buffer to string makes no chance to fix text encoding manually.

proc.stdout.on('data', data => stdout += data.toString())

In my own opinion, it would be much better and friendlier if there is an option Registry.decodeStdout(function).
How do you think about this?

@OddMorning
Copy link
Collaborator

Oh, that's what you mean. But data still depends on default code page, at least in my case.

On my current PC the default code page is 866 (Russian) and I can read cyrillic letters well using your method and a bit of iconv-lite's magic:

proc.stdout.on('data', data => stdout += iconv.decode(data, 'cp866'))

But when I try to read something else (for example, 谷歌翻譯幫助我), it turns into a bunch of nonconvertible data:

await Registry.get('HKLM\\SOFTWARE\\aaaa', 'Unicode')
    ==>
// No encodings, just turning Buffer into a hex string
proc.stdout.on('data', data => stdout += data.toString('hex'))
    ==>
0d0a484b45595f4c4f43414c5f4d414348494e455c534f4654574152455c616161610d0a20202020556e69636f6465202020205245475f535a202020203f3f3f3f3f3f3f0d0a0d0a
    ==>
HKEY_LOCAL_MACHINE\SOFTWARE\aaaa
    Unicode    REG_SZ    ???????

In my case 谷歌翻譯幫助我 turned into ??????? (3f3f3f3f3f3f3f in hex) because cp866 doesn't support those characters.

So, the only thing you can do to raw data is convert it from your system code page to utf8 (which most probably could be automated). And again, it's just my own research so I may be wrong.

@asinbow
Copy link
Author

asinbow commented Dec 25, 2018

Вот, automating this process is a really great idea!
Get Windows console codepage reveals how to use Windows API GetConsoleOutputCP and SetConsoleOutputCP. And in document of SetConsoleOutputCP, Microsoft tells us the current code page is stored in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\ACP:
image
We can query it out and then apply iconv-lite decoding.

@asinbow asinbow changed the title Expose decodeStdout option for stdout&stderr Optimize stdout&stderr decoding for Windows ANSI Code Page Dec 25, 2018
@OddMorning
Copy link
Collaborator

In my case (and in your too it seems) it's OEMCP, not ACP:
25 12 2018 11-35-38 vivaldi

Well, we will see what can be done. All of that looks like a good "lite version" of unicode option.

@OddMorning
Copy link
Collaborator

I just did some tests and:

  • Raw reg.exe calls take ≈15ms;
  • chcp 65001 + reg.exe + chcp (old value) takes ≈50ms;
  • iconv.decode(data, 'cp866') takes ≈45ms.

So both chcp and iconv are almost equally slow compared to "raw calls", while chcp gives more abilities (full confidence that data is not corrupted). I doubt that an app would be interrupted after the first chcp call and before the second one. Of course there's a tiny chance of that, but even the get-port package has a similar "failure" chance, and that's not the package's fail.

I want to say, either Registry.decodeStdout or iconv can be added, but again, would that worth it? Both of them depend on system. If an app counts on 936 code page, but system locale is English (437), you won't be able to just decode input from cp936 (just like chcp 936 will always say "Invalid code page"), and decoding from 437 is pointless. There are too many cases where those options can behave different in different situations. In my opinion, for public usage iconv/decodeStdout is not safe, and for personal usage chcp is already not that bad.

In short, the idea looks good for particular cases, but it would make code bloated without any great benefit globally. I wonder what @MikeKovarik thinks about all of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants