-
Notifications
You must be signed in to change notification settings - Fork 50
All About Python Benchmarking
This page is dedicated to the numerous facets of benchmarking Python.
A benchmark is a script or application designed for producing concrete measurements of another application or runtime (e.g. CPython). It is often characterized as a "micro" benchmark or "macro" benchmark, depending on its complexity and what functionality it represents. Often a benchmark will aim to represent some specific execution profile based on an existing use case or capability, rather than measuring real-world usage. This is because benchmarks need to give consistent results, as well as emphasize the performance of the target use case or capability.
Why is benchmarking important to us, as we work to make Python faster? It boils down to this: decisions are especially effective when based on reliable data, and even more so when it comes to technology. This is no different for Python implementors. We want to be sure Python meets the needs of its community.
The Needs of the Community
The Python community has a wide variety of needs, of which Python itself satisfies some. It is critical that Python implementors have the following:
- an understanding of the community's Python needs
- a consistent terminology to communicate about those needs
- a uniform way to measure how well they are met
On this page we focus on two critical aspects of how Python is used, workloads and features. Workloads are the categories of use-cases for applications and libraries. Features are the capabilities provided by the Python language and stdlib that those applications rely on.
Python Performance
The central discussion here is around making Python faster. Benchmarks are essential to making that happen. For Python implementors, benchmarks provide the reliable data we need for making good decisions about Python performance. It is useful to users too, as they make their own technology choices.
Users:
- care about how fast (or how efficiently) their applications run
- when deciding between comparable features, which is fastest?
- factor in performance when considering different Python versions or other Python implementations
Python implementors:
- use benchmark results to communicate about those things with users
- care about which features are slow (or fast)
- need to know how much proposed changes improve or hurt performance
- want to quickly pinpoint the source of performance regressions
Benchmarks are meaningful only if they take care of all that. The tricky part is how to make benchmarks that do it effectively, especially when they can take a long time to run and time for analyzing results is limited.
This is where workloads and features come back into play. Benchmarks that focus either on workloads or on specific features are, together, very effective at ticking all those earlier boxes.
A Python "workload" is what we identify as a high-level use case for a Python runtime in the community. At its essence, a workload is a discrete category in which to group Python applications (and libraries), describing that specific case. Some workloads are complex, with applications utilizing many Python features, while others are simpler. Some workloads are long-running, while others are short-lived. The resources on which workloads depend also varies greatly.
XXX TODO:
- top-level workloads vs. sub-workloads
When a benchmark represents the behavior of a specific workload, we call it a "workload" benchmark. Another name for a workload benchmark is a "macro" benchmark.
Related:
- per-workload tables
- per-benchmark tables
Python (and its different implementations) can be partially described as sets of features. They can be categorized as granular (i.e. "atomic") vs. composite, language vs. stdlib, etc. Features are distinct from workloads in several important ways. A feature is provided by the language/runtime and is a low(er)-level building block with a focus on specific foundational capability. In contrast, a workload is focused on high-level user applications.
A "feature benchmark" is one that focuses strictly on exercising a specific Python feature in a specific way. Feature benchmarks are all "micro" benchmarks (but not all micro benchmarks are feature benchmarks).
Related:
- per-feature tables
- per-benchmark tables
XXX operations:
- how to run benchmarks
- getting consistent results
XXX operations:
- how to compare
- sharing results
- fair comparisons between runs, incl. between implementations
We can measure many kinds of performance, but CPU (time) performance is typically the primary subject (with memory use as a secondary one).
XXX core benchmarks vs. community benchmarks
XXX practical concerns:
- throughput: meaningful results vs. getting results quickly
- throughput (workloads): getting benchmark results quickly vs. accurate representation of low-running workloads
- benchmarks for features/workloads that use system resources (network, FS, etc.)
- benchmarks for features/workloads that are fundamentally non-deterministic
Related:
...