Some benchmarks have concerningly high uop miss rates #702

brandtbucher · 2024-11-04T16:20:25Z

When looking at the aggregate stats across all benchmarks, none of the uops have very high miss rates (which is obviously good). However, when looking at the stats for individual benchmarks, there are clearly some outliers that deserve to be addressed.

First, here are all of the instructions with miss rates above 50% for an individual benchmark. Out of the entire benchmark suite, almost half of all benchmarks have at least one instruction with a high miss rate. I've listed affected instructions per benchmark here, in the format NAME (TOTAL% @ MISS_RATE%). For example, pylint: _GUARD_TYPE_VERSION (5.2% @ 73.8%) means that _GUARD_TYPE_VERSION makes up 5.2% of the instructions for the pylint benchmark, and it has a miss rate of 73.8%. Some instructions look like _FOR_ITER_TIER_TWO (0.0% @ 100.0%). This means that the instruction occurred only a handful of times, but failed every single time.

async_tree_cpu_io_mixed_tg: _GUARD_NOT_EXHAUSTED_LIST (0.0% @ 100.0%)
async_tree_io_tg: _GUARD_IS_TRUE_POP (0.1% @ 50.5%)
async_tree_memoization_tg: _GUARD_IS_TRUE_POP (0.2% @ 50.3%), _GUARD_NOT_EXHAUSTED_LIST (0.0% @ 100.0%)
async_tree_tg: _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 100.0%)
asyncio_tcp: _GUARD_NOT_EXHAUSTED_LIST (0.8% @ 73.1%), _GUARD_NOT_EXHAUSTED_RANGE (0.4% @ 90.1%), _FOR_ITER_TIER_TWO (0.0% @ 100.0%)
asyncio_tcp_ssl: _GUARD_NOT_EXHAUSTED_LIST (0.4% @ 61.0%), _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 100.0%), _FOR_ITER_TIER_TWO (0.0% @ 100.0%)
asyncio_websockets: _GUARD_NOT_EXHAUSTED_LIST (0.2% @ 58.1%)
bpe_tokenizer: _GUARD_IS_NONE_POP (0.0% @ 80.6%), _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 62.5%)
comprehensions: _TO_BOOL_NONE (0.9% @ 89.8%), _GUARD_IS_FALSE_POP (0.4% @ 75.0%), _CHECK_ATTR_CLASS (0.0% @ 100.0%)
concurrent_imap: _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 95.6%), _LOAD_ATTR_INSTANCE_VALUE_1 (0.0% @ 100.0%)
deepcopy: _GUARD_DORV_NO_DICT (0.2% @ 100.0%)
deltablue: _CALL_METHOD_DESCRIPTOR_FAST (0.2% @ 100.0%), _GUARD_IS_NONE_POP (0.0% @ 98.0%)
django_template: _GUARD_NOT_EXHAUSTED_RANGE (0.0% @ 100%)
docutils: _GUARD_TYPE_VERSION (6.3% @ 55.7%), _CHECK_ATTR_CLASS (0.0% @ 79.2%), _CHECK_AND_ALLOCATE_OBJECT (0.0% @ 92.5%)
dulwich_log: _GUARD_NOT_EXHAUSTED_LIST (0.1% @ 98.0%), _GUARD_NOT_EXHAUSTED_TUPLE (0.1% @ 100.0%)
generators: _GUARD_IS_FALSE_POP (0.2% @ 100.0%), _GUARD_NOT_EXHAUSTED_RANGE (0.0% @ 100.0%)
genchi: _TO_BOOL_STR (0.1% @ 83.3%), _GUARD_NOT_EXHAUSTED_TUPLE (0.1% @ 99.8%)
go: _TO_BOOL_NONE (0.1% @ 52.2%)
html5lib: _CHECK_FUNCTION_VERSION (4.1% @ 70.2%), _GUARD_IS_NOT_NONE_POP (0.2% @ 99.7%), _CALL_METHOD_DESCRIPTOR_O (0.1% @ 97.7%), _GUARD_NOT_EXHAUSTED_LIST (0.0% @ 85.7%)
logging: _GUARD_NOT_EXHAUSTED_LIST (0.1% @ 100.0%), _FOR_ITER_TIER_TWO (0.0% @ 100.0%)
mako: _CALL_METHOD_DESCRIPTOR_FAST (0.0% @ 100.0%)
mdp: _CALL_METHOD_DESCRIPTOR_NOARGS (0.0% @ 100.0%)
pathlib: _GUARD_NOT_EXHAUSTED_TUPLE (1.2% @ 100.0%)
pprint: _FOR_ITER_TIER_TWO (0.0% @ 68.4%)
pycparser: _CHECK_FUNCTION_VERSION (1.4% @ 80.0%), _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 88.3%)
pylint: _GUARD_TYPE_VERSION (5.2% @ 73.8%), _CHECK_FUNCTION_VERSION (1.6% @ 64.7%), _TO_BOOL_NONE (0.1% @ 71.8%), _LOAD_ATTR_INSTANCE_VALUE_1 (0.0% @ 100.0%), _GUARD_NOT_EXHAUSTED_RANGE (0.0% @ 54.0%), _CHECK_AND_ALLOCATE_OBJECT (0.0% @ 74.6%), _CHECK_METHOD_VERSION (0.0% @ 100.0%), _CHECK_METHOD_VERSION_KW (0.0% @ 100.0%)
regex_compile: _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 50.1%)
regex_effbot: _GUARD_NOT_EXHAUSTED_RANGE (0.0% @ 100%)
sphinx: _GUARD_TYPE_VERSION (5.1% @ 69.0%), _GUARD_NOT_EXHAUSTED_TUPLE (0.8% @ 64.3%), _CHECK_AND_ALLOCATE_OBJECT (0.0% @ 84.2%)
sqlglot: _FOR_ITER_TIER_TWO (1.3% @ 62.5%), _GUARD_NOT_EXHAUSTED_LIST (0.7% @ 63.0%), _CONTAINS_OP_SET (0.0% @ 100.0%)
sqlglot_optimize: _GUARD_NOT_EXHAUSTED_LIST (0.9% @ 58%), _GUARD_IS_NONE_POP (0.1% @ 89.1%)
sqlglot_parse: _TO_BOOL_NONE (0.5% @ 66.8%)
sqlglot_transpile: _TO_BOOL_NONE (0.5% @ 59.0%)
sympy: _GUARD_IS_NOT_NONE_POP (0.5% @ 58.2%), _CALL_METHOD_DESCRIPTOR_FAST (0.4% @ 94.4%), _SEND_GEN_FRAME (0.0% @ 100.0%)
thrift: _GUARD_NOT_EXHAUSTED_TUPLE (1.9% @ 100.0%)
unpickle_pure_python: _TO_BOOL_NONE (1.6% @ 99.4%)
xml_etree: _GUARD_NOT_EXHAUSTED_LIST (0.4% @ 86.6%), _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 86.6%), _GUARD_IS_NONE_POP (0.0% @ 100%)

For ease of analysis, I've broken these up into 3 groups:

Weird `for` loops

Almost a third of all benchmarks have a for loop header with a very high miss rate. These aren't failed type checks; the stats indicate that these loops are exited after just zero or one iterations. This is almost certainly a bug in our tracing logic, or a bug in the instructions themselves. I didn't see any bugs in the actual instructions on first glance, so I suspect we may need to dive a bit deeper into a particular benchmark to really learn what is happening here. I suspect that if we "fix" one, we'll fix them all:

async_tree_cpu_io_mixed_tg: _GUARD_NOT_EXHAUSTED_LIST (0.0% @ 100.0%)
async_tree_memoization_tg: _GUARD_NOT_EXHAUSTED_LIST (0.0% @ 100.0%)
async_tree_tg: _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 100.0%)
asyncio_tcp: _GUARD_NOT_EXHAUSTED_LIST (0.8% @ 73.1%), _GUARD_NOT_EXHAUSTED_RANGE (0.4% @ 90.1%), _FOR_ITER_TIER_TWO (0.0% @ 100.0%)
asyncio_tcp_ssl: _GUARD_NOT_EXHAUSTED_LIST (0.4% @ 61.0%), _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 100.0%), _FOR_ITER_TIER_TWO (0.0% @ 100.0%)
asyncio_websockets: _GUARD_NOT_EXHAUSTED_LIST (0.2% @ 58.1%)
bpe_tokenizer: _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 62.5%)
concurrent_imap: _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 95.6%)
django_template: _GUARD_NOT_EXHAUSTED_RANGE (0.0% @ 100%)
dulwich_log: _GUARD_NOT_EXHAUSTED_LIST (0.1% @ 98.0%), _GUARD_NOT_EXHAUSTED_TUPLE (0.1% @ 100.0%)
generators: _GUARD_NOT_EXHAUSTED_RANGE (0.0% @ 100.0%)
genchi: _GUARD_NOT_EXHAUSTED_TUPLE (0.1% @ 99.8%)
html5lib: _GUARD_NOT_EXHAUSTED_LIST (0.0% @ 85.7%)
logging: _GUARD_NOT_EXHAUSTED_LIST (0.1% @ 100.0%), _FOR_ITER_TIER_TWO (0.0% @ 100.0%)
pathlib: _GUARD_NOT_EXHAUSTED_TUPLE (1.2% @ 100.0%)
pprint: _FOR_ITER_TIER_TWO (0.0% @ 68.4%)
pycparser: _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 88.3%)
pylint: _GUARD_NOT_EXHAUSTED_RANGE (0.0% @ 54.0%)
regex_compile: _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 50.1%)
regex_effbot: _GUARD_NOT_EXHAUSTED_RANGE (0.0% @ 100%)
sphinx: _GUARD_NOT_EXHAUSTED_TUPLE (0.8% @ 64.3%)
sqlglot: _FOR_ITER_TIER_TWO (1.3% @ 62.5%), _GUARD_NOT_EXHAUSTED_LIST (0.7% @ 63.0%)
sqlglot_optimize: _GUARD_NOT_EXHAUSTED_LIST (0.9% @ 58%)
thrift: _GUARD_NOT_EXHAUSTED_TUPLE (1.9% @ 100.0%)
xml_etree: _GUARD_NOT_EXHAUSTED_LIST (0.4% @ 86.6%), _GUARD_NOT_EXHAUSTED_TUPLE (0.0% @ 86.6%)

Poorly-traced branches

All branches only have two directions, so a miss rate above 50% means that we aren't predicting the likely direction of a branch well (or at least not adapting to phase shifts in the program). Over 10% of the benchmarks show opportunities for improvement here:

async_tree_io_tg: _GUARD_IS_TRUE_POP (0.1% @ 50.5%)
async_tree_memoization_tg: _GUARD_IS_TRUE_POP (0.2% @ 50.3%)
bpe_tokenizer: _GUARD_IS_NONE_POP (0.0% @ 80.6%)
comprehensions: _GUARD_IS_FALSE_POP (0.4% @ 75.0%)
deltablue: _GUARD_IS_NONE_POP (0.0% @ 98.0%)
generators: _GUARD_IS_FALSE_POP (0.2% @ 100.0%)
html5lib: _GUARD_IS_NOT_NONE_POP (0.2% @ 99.7%)
sqlglot_optimize: _GUARD_IS_NONE_POP (0.1% @ 89.1%)
sympy: _GUARD_IS_NOT_NONE_POP (0.5% @ 58.2%)
xml_etree: _GUARD_IS_NONE_POP (0.0% @ 100%)

Polymorphism

This one is a little fuzzier, since if a site is equally polymorphic across three types, it's basically impossible to avoid seeing a 66% miss rate on the "likely" path. With that said, almost a quarter of the benchmarks have high miss rates due to polymorphism, with some of them concerningly high:

comprehensions: _TO_BOOL_NONE (0.9% @ 89.8%), _CHECK_ATTR_CLASS (0.0% @ 100.0%)
concurrent_imap: _LOAD_ATTR_INSTANCE_VALUE_1 (0.0% @ 100.0%)
deepcopy: _GUARD_DORV_NO_DICT (0.2% @ 100.0%)
deltablue: _CALL_METHOD_DESCRIPTOR_FAST (0.2% @ 100.0%)
docutils: _GUARD_TYPE_VERSION (6.3% @ 55.7%), _CHECK_ATTR_CLASS (0.0% @ 79.2%), _CHECK_AND_ALLOCATE_OBJECT (0.0% @ 92.5%)
genchi: _TO_BOOL_STR (0.1% @ 83.3%)
go: _TO_BOOL_NONE (0.1% @ 52.2%)
html5lib: _CHECK_FUNCTION_VERSION (4.1% @ 70.2%), _CALL_METHOD_DESCRIPTOR_O (0.1% @ 97.7%)
mako: _CALL_METHOD_DESCRIPTOR_FAST (0.0% @ 100.0%)
mdp: _CALL_METHOD_DESCRIPTOR_NOARGS (0.0% @ 100.0%)
pycparser: _CHECK_FUNCTION_VERSION (1.4% @ 80.0%)
pylint: _GUARD_TYPE_VERSION (5.2% @ 73.8%), _CHECK_FUNCTION_VERSION (1.6% @ 64.7%), _TO_BOOL_NONE (0.1% @ 71.8%), _LOAD_ATTR_INSTANCE_VALUE_1 (0.0% @ 100.0%), _CHECK_AND_ALLOCATE_OBJECT (0.0% @ 74.6%), _CHECK_METHOD_VERSION (0.0% @ 100.0%), _CHECK_METHOD_VERSION_KW (0.0% @ 100.0%)
sphinx: _GUARD_TYPE_VERSION (5.1% @ 69.0%), _CHECK_AND_ALLOCATE_OBJECT (0.0% @ 84.2%)
sqlglot: _CONTAINS_OP_SET (0.0% @ 100.0%)
sqlglot_parse: _TO_BOOL_NONE (0.5% @ 66.8%)
sqlglot_transpile: _TO_BOOL_NONE (0.5% @ 59.0%)
sympy: _CALL_METHOD_DESCRIPTOR_FAST (0.4% @ 94.4%), _SEND_GEN_FRAME (0.0% @ 100.0%)
unpickle_pure_python: _TO_BOOL_NONE (1.6% @ 99.4%)

The text was updated successfully, but these errors were encountered:

Fidget-Spinner · 2024-11-11T09:24:35Z

Should polymorphic inline caches be a consideration to fix this? We fix this in the interpreter, it gets fixed in the JIT as well?

brandtbucher · 2024-11-12T19:11:01Z

Possibly, now that we have multiple operands. I think it's just tricky when there are several different "kinds" of caches... we want to avoid an explosion in new tier one instructions (or our inline cache sizes.

brandtbucher · 2024-11-12T19:12:28Z

Also, from spot-checking a couple of the weird for-loops, they truly do just seem to be loops that execute 0 or 1 times (or possibly hitting weird edge-cases in our thresholds, like looping exactly 16 times). Not sure what there is to do about that, honestly.

brandtbucher added the investigate label Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some benchmarks have concerningly high uop miss rates #702

Some benchmarks have concerningly high uop miss rates #702

brandtbucher commented Nov 4, 2024

Fidget-Spinner commented Nov 11, 2024

brandtbucher commented Nov 12, 2024

brandtbucher commented Nov 12, 2024

Some benchmarks have concerningly high uop miss rates #702

Some benchmarks have concerningly high uop miss rates #702

Comments

brandtbucher commented Nov 4, 2024

Weird for loops

Poorly-traced branches

Polymorphism

Fidget-Spinner commented Nov 11, 2024

brandtbucher commented Nov 12, 2024

brandtbucher commented Nov 12, 2024

Weird `for` loops