Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Novel escaping algorithm #1384

Open
franz1981 opened this issue Jan 10, 2025 · 5 comments
Open

Novel escaping algorithm #1384

franz1981 opened this issue Jan 10, 2025 · 5 comments

Comments

@franz1981
Copy link

Is your feature request related to a problem? Please describe.

Hi,

I made a novel, simple and super fast branch-free escaping algorithm at lemire/Code-used-on-Daniel-Lemire-s-blog#116 and I would like to contributi here if it is welcome. Any feedback? Where I should look at?

Describe the solution you'd like

It's based on lemire/Code-used-on-Daniel-Lemire-s-blog#116

Usage example

No response

Additional context

No response

@pjfanning
Copy link
Member

Can you provide an example in jackson-databind where this would be used?

@franz1981
Copy link
Author

franz1981 commented Jan 10, 2025

yes, actually I see this could be "partially" applied here -> https://github.com/FasterXML/jackson-core/blob/2.19/src/main/java/com/fasterxml/jackson/core/json/UTF8JsonGenerator.java#L1758-L1765

What you got here is exactly the same as https://lemire.me/blog/2024/10/14/table-lookups-are-efficient/ which I have improved in https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/6236934cffb0de1d3f17dca43d242b735a4a2125/2024/10/14/src/main/java/me/lemire/MyBenchmark.java#L258-L269

i.e.

  • the latin replacement path should lookup into a int[] table which pack: how many chars the replacement has + the two bytes replacements
  • the non replaceable chars should still have an entry in the lookup table but which contains 1 as replacement length and the original char with a bogus byte (0? who cares, the length of the replacement won't make it important)
  • we should always write 2 chars regardless and move the output index using what we read int the lookup table (yes, it means always writing and than overwriting)

In this way we can reduce the number of branches trying to optimize whatever is latin and NOT belonging to special control chars (0-31 IIRC?).
Not fully branch-free since it should requires a much deeper change, really - but near to for the "common" cases

@pjfanning pjfanning transferred this issue from FasterXML/jackson-databind Jan 10, 2025
@pjfanning
Copy link
Member

Might be some partial overlap with #1349. Jackson-core would probably be a better place for an escaping implementation.

@franz1981
Copy link
Author

franz1981 commented Jan 10, 2025

yep it looks like it could be part of such, but please 🙏 suggest to that user to use JMH and not handrolled bench... 😢

@cowtowncoder
Copy link
Member

Yes, escaping definitely in jackson-core, UTF8JsonGenerator sounds like the place as well.

Would perhaps suggest working on master branch (for Jackson 3.0), although jackson-core diffs b/w 2.x and 3.0 are not huge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants