Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some benchmarks have concerningly high uop miss rates #702

Open
brandtbucher opened this issue Nov 4, 2024 · 3 comments
Open

Some benchmarks have concerningly high uop miss rates #702

brandtbucher opened this issue Nov 4, 2024 · 3 comments

Comments

@brandtbucher
Copy link
Member

When looking at the aggregate stats across all benchmarks, none of the uops have very high miss rates (which is obviously good). However, when looking at the stats for individual benchmarks, there are clearly some outliers that deserve to be addressed.

First, here are all of the instructions with miss rates above 50% for an individual benchmark. Out of the entire benchmark suite, almost half of all benchmarks have at least one instruction with a high miss rate. I've listed affected instructions per benchmark here, in the format NAME (TOTAL% @ MISS_RATE%). For example, pylint: _GUARD_TYPE_VERSION (5.2% @ 73.8%) means that _GUARD_TYPE_VERSION makes up 5.2% of the instructions for the pylint benchmark, and it has a miss rate of 73.8%. Some instructions look like _FOR_ITER_TIER_TWO (0.0% @ 100.0%). This means that the instruction occurred only a handful of times, but failed every single time.

  • async_tree_cpu_io_mixed_tg: _GUARD_NOT_EXHAUSTED_LIST (0.0% @ 100.0%)
  • async_tree_io_tg: _GUARD_IS_TRUE_POP (0.1% @ 50.5%)
  • async_tree_memoization_tg: _GUARD_IS_TRUE_POP (0.2% @ 50.3%), _GUARD_NOT_EXHAUSTED_LIST (0.0% @ 100.0%)
  • async_tree_tg: _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 100.0%)
  • asyncio_tcp: _GUARD_NOT_EXHAUSTED_LIST (0.8% @ 73.1%), _GUARD_NOT_EXHAUSTED_RANGE (0.4% @ 90.1%), _FOR_ITER_TIER_TWO (0.0% @ 100.0%)
  • asyncio_tcp_ssl: _GUARD_NOT_EXHAUSTED_LIST (0.4% @ 61.0%), _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 100.0%), _FOR_ITER_TIER_TWO (0.0% @ 100.0%)
  • asyncio_websockets: _GUARD_NOT_EXHAUSTED_LIST (0.2% @ 58.1%)
  • bpe_tokenizer: _GUARD_IS_NONE_POP (0.0% @ 80.6%), _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 62.5%)
  • comprehensions: _TO_BOOL_NONE (0.9% @ 89.8%), _GUARD_IS_FALSE_POP (0.4% @ 75.0%), _CHECK_ATTR_CLASS (0.0% @ 100.0%)
  • concurrent_imap: _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 95.6%), _LOAD_ATTR_INSTANCE_VALUE_1 (0.0% @ 100.0%)
  • deepcopy: _GUARD_DORV_NO_DICT (0.2% @ 100.0%)
  • deltablue: _CALL_METHOD_DESCRIPTOR_FAST (0.2% @ 100.0%), _GUARD_IS_NONE_POP (0.0% @ 98.0%)
  • django_template: _GUARD_NOT_EXHAUSTED_RANGE (0.0% @ 100%)
  • docutils: _GUARD_TYPE_VERSION (6.3% @ 55.7%), _CHECK_ATTR_CLASS (0.0% @ 79.2%), _CHECK_AND_ALLOCATE_OBJECT (0.0% @ 92.5%)
  • dulwich_log: _GUARD_NOT_EXHAUSTED_LIST (0.1% @ 98.0%), _GUARD_NOT_EXHAUSTED_TUPLE (0.1% @ 100.0%)
  • generators: _GUARD_IS_FALSE_POP (0.2% @ 100.0%), _GUARD_NOT_EXHAUSTED_RANGE (0.0% @ 100.0%)
  • genchi: _TO_BOOL_STR (0.1% @ 83.3%), _GUARD_NOT_EXHAUSTED_TUPLE (0.1% @ 99.8%)
  • go: _TO_BOOL_NONE (0.1% @ 52.2%)
  • html5lib: _CHECK_FUNCTION_VERSION (4.1% @ 70.2%), _GUARD_IS_NOT_NONE_POP (0.2% @ 99.7%), _CALL_METHOD_DESCRIPTOR_O (0.1% @ 97.7%), _GUARD_NOT_EXHAUSTED_LIST (0.0% @ 85.7%)
  • logging: _GUARD_NOT_EXHAUSTED_LIST (0.1% @ 100.0%), _FOR_ITER_TIER_TWO (0.0% @ 100.0%)
  • mako: _CALL_METHOD_DESCRIPTOR_FAST (0.0% @ 100.0%)
  • mdp: _CALL_METHOD_DESCRIPTOR_NOARGS (0.0% @ 100.0%)
  • pathlib: _GUARD_NOT_EXHAUSTED_TUPLE (1.2% @ 100.0%)
  • pprint: _FOR_ITER_TIER_TWO (0.0% @ 68.4%)
  • pycparser: _CHECK_FUNCTION_VERSION (1.4% @ 80.0%), _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 88.3%)
  • pylint: _GUARD_TYPE_VERSION (5.2% @ 73.8%), _CHECK_FUNCTION_VERSION (1.6% @ 64.7%), _TO_BOOL_NONE (0.1% @ 71.8%), _LOAD_ATTR_INSTANCE_VALUE_1 (0.0% @ 100.0%), _GUARD_NOT_EXHAUSTED_RANGE (0.0% @ 54.0%), _CHECK_AND_ALLOCATE_OBJECT (0.0% @ 74.6%), _CHECK_METHOD_VERSION (0.0% @ 100.0%), _CHECK_METHOD_VERSION_KW (0.0% @ 100.0%)
  • regex_compile: _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 50.1%)
  • regex_effbot: _GUARD_NOT_EXHAUSTED_RANGE (0.0% @ 100%)
  • sphinx: _GUARD_TYPE_VERSION (5.1% @ 69.0%), _GUARD_NOT_EXHAUSTED_TUPLE (0.8% @ 64.3%), _CHECK_AND_ALLOCATE_OBJECT (0.0% @ 84.2%)
  • sqlglot: _FOR_ITER_TIER_TWO (1.3% @ 62.5%), _GUARD_NOT_EXHAUSTED_LIST (0.7% @ 63.0%), _CONTAINS_OP_SET (0.0% @ 100.0%)
  • sqlglot_optimize: _GUARD_NOT_EXHAUSTED_LIST (0.9% @ 58%), _GUARD_IS_NONE_POP (0.1% @ 89.1%)
  • sqlglot_parse: _TO_BOOL_NONE (0.5% @ 66.8%)
  • sqlglot_transpile: _TO_BOOL_NONE (0.5% @ 59.0%)
  • sympy: _GUARD_IS_NOT_NONE_POP (0.5% @ 58.2%), _CALL_METHOD_DESCRIPTOR_FAST (0.4% @ 94.4%), _SEND_GEN_FRAME (0.0% @ 100.0%)
  • thrift: _GUARD_NOT_EXHAUSTED_TUPLE (1.9% @ 100.0%)
  • unpickle_pure_python: _TO_BOOL_NONE (1.6% @ 99.4%)
  • xml_etree: _GUARD_NOT_EXHAUSTED_LIST (0.4% @ 86.6%), _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 86.6%), _GUARD_IS_NONE_POP (0.0% @ 100%)

For ease of analysis, I've broken these up into 3 groups:

Weird for loops

Almost a third of all benchmarks have a for loop header with a very high miss rate. These aren't failed type checks; the stats indicate that these loops are exited after just zero or one iterations. This is almost certainly a bug in our tracing logic, or a bug in the instructions themselves. I didn't see any bugs in the actual instructions on first glance, so I suspect we may need to dive a bit deeper into a particular benchmark to really learn what is happening here. I suspect that if we "fix" one, we'll fix them all:

  • async_tree_cpu_io_mixed_tg: _GUARD_NOT_EXHAUSTED_LIST (0.0% @ 100.0%)
  • async_tree_memoization_tg: _GUARD_NOT_EXHAUSTED_LIST (0.0% @ 100.0%)
  • async_tree_tg: _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 100.0%)
  • asyncio_tcp: _GUARD_NOT_EXHAUSTED_LIST (0.8% @ 73.1%), _GUARD_NOT_EXHAUSTED_RANGE (0.4% @ 90.1%), _FOR_ITER_TIER_TWO (0.0% @ 100.0%)
  • asyncio_tcp_ssl: _GUARD_NOT_EXHAUSTED_LIST (0.4% @ 61.0%), _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 100.0%), _FOR_ITER_TIER_TWO (0.0% @ 100.0%)
  • asyncio_websockets: _GUARD_NOT_EXHAUSTED_LIST (0.2% @ 58.1%)
  • bpe_tokenizer: _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 62.5%)
  • concurrent_imap: _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 95.6%)
  • django_template: _GUARD_NOT_EXHAUSTED_RANGE (0.0% @ 100%)
  • dulwich_log: _GUARD_NOT_EXHAUSTED_LIST (0.1% @ 98.0%), _GUARD_NOT_EXHAUSTED_TUPLE (0.1% @ 100.0%)
  • generators: _GUARD_NOT_EXHAUSTED_RANGE (0.0% @ 100.0%)
  • genchi: _GUARD_NOT_EXHAUSTED_TUPLE (0.1% @ 99.8%)
  • html5lib: _GUARD_NOT_EXHAUSTED_LIST (0.0% @ 85.7%)
  • logging: _GUARD_NOT_EXHAUSTED_LIST (0.1% @ 100.0%), _FOR_ITER_TIER_TWO (0.0% @ 100.0%)
  • pathlib: _GUARD_NOT_EXHAUSTED_TUPLE (1.2% @ 100.0%)
  • pprint: _FOR_ITER_TIER_TWO (0.0% @ 68.4%)
  • pycparser: _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 88.3%)
  • pylint: _GUARD_NOT_EXHAUSTED_RANGE (0.0% @ 54.0%)
  • regex_compile: _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 50.1%)
  • regex_effbot: _GUARD_NOT_EXHAUSTED_RANGE (0.0% @ 100%)
  • sphinx: _GUARD_NOT_EXHAUSTED_TUPLE (0.8% @ 64.3%)
  • sqlglot: _FOR_ITER_TIER_TWO (1.3% @ 62.5%), _GUARD_NOT_EXHAUSTED_LIST (0.7% @ 63.0%)
  • sqlglot_optimize: _GUARD_NOT_EXHAUSTED_LIST (0.9% @ 58%)
  • thrift: _GUARD_NOT_EXHAUSTED_TUPLE (1.9% @ 100.0%)
  • xml_etree: _GUARD_NOT_EXHAUSTED_LIST (0.4% @ 86.6%), _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 86.6%)

Poorly-traced branches

All branches only have two directions, so a miss rate above 50% means that we aren't predicting the likely direction of a branch well (or at least not adapting to phase shifts in the program). Over 10% of the benchmarks show opportunities for improvement here:

  • async_tree_io_tg: _GUARD_IS_TRUE_POP (0.1% @ 50.5%)
  • async_tree_memoization_tg: _GUARD_IS_TRUE_POP (0.2% @ 50.3%)
  • bpe_tokenizer: _GUARD_IS_NONE_POP (0.0% @ 80.6%)
  • comprehensions: _GUARD_IS_FALSE_POP (0.4% @ 75.0%)
  • deltablue: _GUARD_IS_NONE_POP (0.0% @ 98.0%)
  • generators: _GUARD_IS_FALSE_POP (0.2% @ 100.0%)
  • html5lib: _GUARD_IS_NOT_NONE_POP (0.2% @ 99.7%)
  • sqlglot_optimize: _GUARD_IS_NONE_POP (0.1% @ 89.1%)
  • sympy: _GUARD_IS_NOT_NONE_POP (0.5% @ 58.2%)
  • xml_etree: _GUARD_IS_NONE_POP (0.0% @ 100%)

Polymorphism

This one is a little fuzzier, since if a site is equally polymorphic across three types, it's basically impossible to avoid seeing a 66% miss rate on the "likely" path. With that said, almost a quarter of the benchmarks have high miss rates due to polymorphism, with some of them concerningly high:

  • comprehensions: _TO_BOOL_NONE (0.9% @ 89.8%), _CHECK_ATTR_CLASS (0.0% @ 100.0%)
  • concurrent_imap: _LOAD_ATTR_INSTANCE_VALUE_1 (0.0% @ 100.0%)
  • deepcopy: _GUARD_DORV_NO_DICT (0.2% @ 100.0%)
  • deltablue: _CALL_METHOD_DESCRIPTOR_FAST (0.2% @ 100.0%)
  • docutils: _GUARD_TYPE_VERSION (6.3% @ 55.7%), _CHECK_ATTR_CLASS (0.0% @ 79.2%), _CHECK_AND_ALLOCATE_OBJECT (0.0% @ 92.5%)
  • genchi: _TO_BOOL_STR (0.1% @ 83.3%)
  • go: _TO_BOOL_NONE (0.1% @ 52.2%)
  • html5lib: _CHECK_FUNCTION_VERSION (4.1% @ 70.2%), _CALL_METHOD_DESCRIPTOR_O (0.1% @ 97.7%)
  • mako: _CALL_METHOD_DESCRIPTOR_FAST (0.0% @ 100.0%)
  • mdp: _CALL_METHOD_DESCRIPTOR_NOARGS (0.0% @ 100.0%)
  • pycparser: _CHECK_FUNCTION_VERSION (1.4% @ 80.0%)
  • pylint: _GUARD_TYPE_VERSION (5.2% @ 73.8%), _CHECK_FUNCTION_VERSION (1.6% @ 64.7%), _TO_BOOL_NONE (0.1% @ 71.8%), _LOAD_ATTR_INSTANCE_VALUE_1 (0.0% @ 100.0%), _CHECK_AND_ALLOCATE_OBJECT (0.0% @ 74.6%), _CHECK_METHOD_VERSION (0.0% @ 100.0%), _CHECK_METHOD_VERSION_KW (0.0% @ 100.0%)
  • sphinx: _GUARD_TYPE_VERSION (5.1% @ 69.0%), _CHECK_AND_ALLOCATE_OBJECT (0.0% @ 84.2%)
  • sqlglot: _CONTAINS_OP_SET (0.0% @ 100.0%)
  • sqlglot_parse: _TO_BOOL_NONE (0.5% @ 66.8%)
  • sqlglot_transpile: _TO_BOOL_NONE (0.5% @ 59.0%)
  • sympy: _CALL_METHOD_DESCRIPTOR_FAST (0.4% @ 94.4%), _SEND_GEN_FRAME (0.0% @ 100.0%)
  • unpickle_pure_python: _TO_BOOL_NONE (1.6% @ 99.4%)
@Fidget-Spinner
Copy link
Collaborator

Should polymorphic inline caches be a consideration to fix this? We fix this in the interpreter, it gets fixed in the JIT as well?

@brandtbucher
Copy link
Member Author

Possibly, now that we have multiple operands. I think it's just tricky when there are several different "kinds" of caches... we want to avoid an explosion in new tier one instructions (or our inline cache sizes.

@brandtbucher
Copy link
Member Author

Also, from spot-checking a couple of the weird for-loops, they truly do just seem to be loops that execute 0 or 1 times (or possibly hitting weird edge-cases in our thresholds, like looping exactly 16 times). Not sure what there is to do about that, honestly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants