-
-
Notifications
You must be signed in to change notification settings - Fork 798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate allowing UTF8StreamJsonParser to be used without canonicalization (see #994) #995
base: 2.16
Are you sure you want to change the base?
Conversation
Ok thanks! This definitely needs to wait until 2.16; at which point also need to consider usage from other format modules (mostly Smile, I think). |
Great, thank you! |
024c3ca
to
7854515
Compare
7854515
to
3d565bd
Compare
I've rebased this and updated the |
@carterkozak That sounds good. Wrt Smile there's still some benefit from trying to find canonicalized instance by quad, although direct decoding is indeed faster (as there's no escaping). I vaguely remember part about canonicalization setting not being honored; would be nice to fix, so +1 for doing that if you figure out a way. |
On JMH benchmarks: variations are way high so I think you'd want |
I haven't had a chance to dig into the smile async parser changes yet, but I've re-run the java -Xmx256m -jar perf.jar ".*JsonNoCanonicalizeReadVanilla.*" -wi 5 -w 3 -i 5 -r 3 -f 5 -t max -rf json I've uploaded the structured json results here: https://gist.github.com/carterkozak/5f736d2348edbbc873b9629b18de6929 It's getting a bit late, I haven't had a chance to sit down and analyze the results yet, but I thought I'd share what I've collected nonetheless :-) Screenshot of the visualization summary for posterity: edit: I've shared the benchmark I used here: FasterXML/jackson-benchmarks#8 |
So far it looks like the original comment was correct, I suspect the performance difference for small inputs is based in part on the 8kb buffer created within InputStreamReader. If I add |
@carterkozak thanks for the interesting analysis. Would you be able to try different Java LTS versions? 8, 11, 17 and 20 (as proxy for the forthcoming 21 LTS release). There might be different results - in particular, for the SIMD related support. jackson-core v2.15 is a multi-release jar and in theory, we can support different implementations of classes to suit different Java release. |
Happy to re-run benchmarks across jdk releases, I only tested with the latest build of jdk17 in the results above. It may be a few days before I have a chance to re-run benchmarks though. regarding multi-release jars, I don't think that will help us quite yet because jep 448 is re-incubating the vector API as a preview feature for jdk 21, so it won't be available as a stable feature until a later release. |
On Multi-release jars: I don't have much appetite for JDK-specific variants at this point. Just fwtw. I concur with the suggestion that buffer allocation of So at this point I am not sure it makes sense to merge this PR, if it's not quite clear there are consistent performance improvements. And although we could try creating custom Conversely making So I suspect this PR is sort of pending for now. |
I agree with your assessment that the original intent of this PR is unnecessary. It may still be helpful to ensure all factory methods respect the canonicalization setting (perhaps under a new issue+PR, if you agree). I can update this PR+title to solely add information from this investigation to the comment here if you'd like: jackson-core/src/main/java/com/fasterxml/jackson/core/json/ByteSourceJsonBootstrapper.java Lines 260 to 262 in f98e22a
I agree that another implementation likely isn't worthwhile. As an aside with regard to conversion from bytes to quads and chars: I suspect on jdk9+ after the string compactness changes from jep-254 it may be more efficient to create strings directly from byte-arrays rather than converting to an intermediate char-array because strings are no longer backed by chars, rather an underlying byte-array with metadata describing whether the encoding is |
Yes, +1 for respecting canonicalization + intern settings.
Sounds good.
Yeah, although this is kind of... ugly and nasty, because the real challenge is UTF-8 encoding/decoding. I don't know how other decoding libraries deal with this tho. Seems like a challenge for anything that couples tokenizing with Charset decoding (which is beneficial for performance but has the challenges of tight coupling). |
I haven't forgotten about this, it has been a busy week. Planning to put together the updates as described tomorrow! Thanks for bearing with me :-) |
… performance comment
3d565bd
to
93bc91a
Compare
I've updated this PR to expand upon the performance characteristics and point this discussion for additional context. |
If there's anything I can do to help here, I'd be happy to assist. Similar to #593 this pops up on JFRs when using Jackson to parse smaller JSON documents that are already in |
Is this PR still active? For time being it seems it might not be active any more; but don't want to close if others find it useful |
Note: I've created this PR against the default branch because I don't know where it is the best fit, I don't wish to cause contention around what should or shouldn't be shipped in 2.15
Previously, the ReaderBasedJsonParser was used instead, which is less performant when reading from an InputStream (and handling charset decoding in addition to json parsing).
This commit updates the JsonFactory factory methods to respect the canonicalization configuration, where previously a canonicalizing implementaiton was always used.
I have added guards around both
_symbols.addName
and_symbols.findName
based on the existing implementation fromSmileParser
. For correctness, only the guards aroundaddName
are required, but we avoid unnecessary hashing by guarding both.Note that several methods on the JsonFactory failed to take the canonicalization configuration into account, so I have updated them. I can extract that to a separate PR if you prefer.
Testing
Testing is tricky here because we don't expect any behavior changes, only for a different implementation to be used to get there. I could test that specific implementations are returned, but I suspect that would cause more problems than it would prevent in future refactors.
Benchmarking
Standard caveats apply, resulting numbers are specific to my system, which may have been running background tasks at the time, and isn't representative of all environments.
Testing using the standard benchmark suite from jackson-benchmarks on my workstation with a new
JsonNoCanonicalizeReadVanilla
benchmark that's identical toJsonStdReadVanilla
except it turns off canonicalization. I haven't created a PR to add this class to the jackson-benchmarks project because I suspect it isn't necessary in addition to the arbitrary key benchmark, but I'd be happy to push it up if you like.Both run with 4 iterations of 4 seconds a piece for both warmup and measurement, with 14 threads:
java -Xmx256m -jar target/perf.jar ".*JsonNoCanonicalizeReadVanilla.*" -wi 4 -w 4 -i 4 -r 4 -f 1 -t 14
Before:
After:
Values changed quite a bit, unclear if this is due to generally high variance, or the change itself.
JsonArbitraryFieldNameBenchmark
shows improvements in theINPUT_STREAM
cases without canonicalization, which were previously identical to theREADER
results:java -Xmx256m -jar target/perf.jar ".*JsonArbitraryFieldNameBenchmark.*" -wi 4 -w 4 -i 4 -r 4 -f 1 -t 14