-
-
Notifications
You must be signed in to change notification settings - Fork 604
Trace analysis using trace.py
The purpose of the script is to display trace data, possibly applying various transformations to make the data more digestable. It operates on a trace file, which needs to be obtained beforehand.
See Tracepoints page to learn more about tracing OSv, how to enable it, etc.
Currently the trace file can be obtained only via GDB. Support for direct download from OSv instance is planned.
To save the trace data into a file (named trace_file
) connect with GDB to OSv and execute:
gdb$ osv trace save trace_file
Alternatively, you can use this shell one-liner to automate the whole process including starting gdb:
$ scripts/trace.py extract trace_file
First, run OSv with tracepoints enabled. I use one of the tests in this example:
$ scripts/run.py -H -e "--trace=memory_malloc --trace-backtrace /tests/misc-fs-stress.so file"
Explanation of parameters:
-
-H
means the OSv instance should not terminate on poweroff, we use it because we need to get the trace before it powers off -
--trace=memory_malloc
enablesmemory_malloc
tracepoint which we are interested in collecting -
--trace-backtrace
enables collecting of call backtrace at tracepoints; we want to know where the tracepoints are called from
Let the test run for a while, then collect the trace:
$ scripts/trace.py extract trace
Then you can print some statistics about the trace file:
$ scripts/trace.py summary trace
Collected 31742 samples spanning 2.57 s
Tracepoint statistics:
name count
---- -----
memory_malloc 31742
You can list the samples. It's good to use less
because typically there will be a lot of samples:
$ scripts/trace.py list trace | less
0xffffc0003e990010 0 1389705265.413128853 memory_malloc buf=0xffffc0003cb73500, len=152
0xffffc0003e990010 0 1389705265.413129568 memory_malloc buf=0xffffc00036677980, len=80
0xffffc0003e990010 0 1389705265.413130045 memory_malloc buf=0xffffc0003cb44b80, len=80
...
Meaning of the columns is:
Thread ID CPU TIMESTAMP TRACEPOINT ARGUMENTS
0xffffc0003e990010 0 1389705265.413128853 memory_malloc buf=0xffffc0003cb73500, len=152
If backtrace was collected you can show it by passing -b
option to trace.py list
:
$ scripts/trace.py list -b trace
0x0000000000d0ee58 0 1389705320.885835886 memory_malloc buf=0xffffc00035f47800, len=192 [log, operator(), zfs_kmem_alloc, zfs_range_lock, zfs_get_data, zil_lwb_commit, zil_close, zfsvfs_teardown, sys_umount2, unmount_rootfs]
0x0000000000d0ee58 0 1389705320.885836601 memory_malloc buf=0xffffc0003fe1f158, len=8 [log, operator(), zfs_kmem_alloc, dmu_buf_hold_array_by_dnode, zfs_get_data, zil_lwb_commit, zil_close, zfsvfs_teardown, sys_umount2, unmount_rootfs]
The backtrace which is associated with trace samples contains only raw addresses. The script is using addr2line
to resolve these symbols. For that to work it needs to have an object file with debug information. By default it uses build/release/loader.elf
for that. You can change the default by passing -d
or --debug
, in which case it uses build/debug/loader.elf
. You can also point it to arbitrary object file using -e
or --exe
option. You can disable symbol resolution by passing -x
or --no-resolve
in which case only raw addresses are displayed.
The user should ensure that correct object is used. There could be no sign, other than suspicious symbol chain in the trace, that the wrong object file was used for symbol resolution.
By default only function name is displayed. More information can be added using these options, any combination is allowed:
-
-L
,--show-line-number
, eg:log:320
-
-A
,--show-address
, eg:log @ 0x2032e3
-
-F
,--show-file-name
, eg:log /home/tgrabiec/src/osv/include/osv/trace.hh
Applying all of the above (-LAF
) will yield log /home/tgrabiec/src/osv/include/osv/trace.hh:320 @ 0x2032e3
Symbol resolution options can be used with all commands which display backtrace.
Tracepoints only carry information about a point in time a particular place in code was reached. However we can use a pair of trace samples to calculate the amount of time it took to execute the code between them. By convention the ending sample has the name starting with the name of entry sample and ending with either _ret
or _err
. The pair can be used to create a so called timed trace, which is basically the entry trace plus duration.
You can print timed trace summary using --timed
option of summary
command:
$ scripts/trace.py summary --timed trace
Collected 36843 samples spanning 8.85 s
Tracepoint statistics:
name count
---- -----
sched_wait 18420
sched_wait_ret 18423
Timed tracepoints [ms]:
name count min 50% 90% 99% 99.9% max total
---- ----- --- --- --- --- ----- --- -----
sched_wait 18255 0.000 0.043 15.210 2226.438 4232.178 6125.616 1066349.137
You can list timed samples using list-timed
command. Passing --sort=duration
sorts by duration, descending. The duration is in the column next to the one with tracepoint name:
$ scripts/trace.py list-timed --sort=duration trace
0xffffc00009680000 1 1389721284.391018152 6125.616 sched_wait
0xffffc0003cc1d000 0 1389721284.387134552 6022.635 sched_wait
0xffffc0003f386000 0 1389721281.852320671 4893.802 sched_wait
0xffffc0003eece000 0 1389721281.852322102 4893.799 sched_wait
0xffffc0003ef37000 0 1389721281.852324009 4893.791 sched_wait
0xffffc0003ef4c000 0 1389721281.852328539 4893.791 sched_wait
0xffffc0003ef61000 0 1389721281.852325916 4893.791 sched_wait
0xffffc0003eef8000 0 1389721281.852327824 4893.790 sched_wait
When backtrace collecting is enabled (--trace-backtrace
) trace samples carry information from where they were hit. This information can be used to identify hot places in code, or generally annotate code with information carried by trace samples. Such annotated call-graph is called a "profile" in this document.
The prof
command can be used to display a profile with the number of times given tracepoint was encountered.
Example:
$ scripts/trace.py prof -t memory_malloc trace
(100.00%, #31742) All
|-- (79.68%, #25291) zfs_kmem_alloc
| |-- (16.95%, #5380) zio_create
| | |-- (8.96%, #2844) dmu_buf_hold_array_by_dnode
| | | |-- (8.86%, #2811) zfs_get_data
| | | | zil_lwb_commit
| | | | zil_commit
| | | | zfsvfs_teardown
| | | | sys_umount2
| | | |
| | | |-- (0.10%, #32) dmu_buf_hold_array
| | | | space_map_sync
| | | | metaslab_sync
| | | | vdev_sync
| | | | spa_sync
| | | |
| | | \-- (0.00%, #1) space_map_load
| | | metaslab_activate
| | | metaslab_group_alloc
| | | zio_dva_allocate
| | | zio_execute
(...)
The tree shows all paths leading to memory_malloc
. On every level nodes are sorted by number of hits in descending order. So the most frequent path is at the top.
We can see that 79.68% of samples come from zfs_kmem_alloc
, which is called from different places. Among these places the most frequent is zio_create
, which is hit in 16.95% of all samples. And so on.
When there is only one path leading to given node the path elements are lined up one after another like this:
- (8.86%, #2811) zfs_get_data
zil_lwb_commit
zil_commit
zfsvfs_teardown
sys_umount2
Wait profile belongs to the class of profiles which annotate call-graph with a time-weight, typically representing time spent inside particular place in code. These profiles are called time profiles.
Wait profile is using time spent in sched::thread::wait()
to annotate the call-graph. The time is calculated based on samples of two tracepoints: sched_wait
and sched_wait_ret
. To be able to show this profile, you need to start OSv with --trace=sched_wait*
or enable these tracepoints dynamically.
Example
First collect some some traces by running OSv like this:
$ scripts/run.py -H -e "--trace=vfs_pwritev* --trace=sched_wait* --trace-backtrace tests/misc-fs-stress.so ff"
The vfs_pwritev*
tracepoints will be needed later to investigate latency of writes.
After the test is done save trace as trace
and run trace.py
on it like this:
$ scripts/trace.py prof-wait -g thread trace | less
You will see a lot of output looking like this:
=== Thread 0xffffc0003d511000 ===
12.10 s (100.00%, #581) All
|-- 12.10 s (99.98%, #561) do_wait_until
| TQ_SLEEP
| sched::thread::main()
|
|-- 1.35 ms (0.01%, #14) waiter::wait(sched::timer*) const
| wait_until<memory::pool::free_different_cpu(memory::pool::free_object*, unsigned int)::__lambda1>
| memory::pool::free(void*)
| |-- 582.01 us (0.00%, #3) zio_destroy
| | zio_execute
| | zio_notify_parent
| | zio_execute
| | taskqueue_run_locked
| |
| |-- 466.61 us (0.00%, #4) dbuf_write_done
| | arc_write_done
| | zio_done
| | zio_execute
| | zio_notify_parent
| |
| |-- 274.89 us (0.00%, #6) zio_remove_child
| | zio_execute
| | |-- 196.16 us (0.00%, #4) zio_notify_parent
| | | zio_execute
| | | taskqueue_run_locked
| | |
| | \-- 78.73 us (0.00%, #2) taskqueue_run_locked
| | sched::thread::main()
What to do with all this information depends on what we're trying to find out. The test we ran is measuring throughput of POSIX write()
which eventually calls zfs_write()
. In addition to that the test records latency of write and prints a summary of after the test, which may look like this:
Latency of write() [s]:
0 0.000000092
0.5 0.000004264
0.9 0.000005597
0.99 0.000012213
0.999 0.004506400
1.0 2.430870533
As we can see, in this run there were calls which took as long as 2.4s to return. Suppose we want to find out why. Let's see if the latency is really inside vfs_pwritev:
$ scripts/trace.py summary --timed trace
(...)
Timed tracepoints [ms]:
name count min 50% 90% 99% 99.9% max total
---- ----- --- --- --- --- ----- --- -----
vfs_pwritev 44949 0.004 0.005 0.008 1.089 18.284 2430.869 8086.956
(...)
Yes, max duration of vfs_pwritev
is 2430ms
, which is what the test reports. Let's locate the long sample by using list-timed
:
$ scripts/trace.py list-timed --sort=duration trace | grep pwritev | head
0xffffc0003e990010 1 1389797838.338931799 2430.869 vfs_pwritev 3 0x00002000002ff9b0 0x2 0x-1
0xffffc0003e990010 0 1389797842.818786860 578.166 vfs_pwritev 3 0x00002000002ff9b0 0x2 0x-1
0xffffc0003e990010 1 1389797837.760517836 578.070 vfs_pwritev 3 0x00002000002ff9b0 0x2 0x-1
0xffffc0003e990010 0 1389797843.445839882 232.688 vfs_pwritev 3 0x00002000002ff9b0 0x2 0x-1
0xffffc0003e990010 1 1389797840.812511921 143.236 vfs_pwritev 3 0x00002000002ff9b0 0x2 0x-1
0xffffc0003e990010 0 1389797842.681091785 95.171 vfs_pwritev 3 0x00002000002ff9b0 0x2 0x-1
0xffffc0003e990010 1 1389797841.075021982 86.724 vfs_pwritev 3 0x00002000002ff9b0 0x2 0x-1
We now know when the issue happened. Now we can pass that information to prof-wait
(note the presence of time units in parameters):
$ scripts/trace.py prof-wait --since 1389797838.338931799s --period 2430.869ms trace | less
What we do here is called time slicing. The profile will include only samples which overlap with specified period. If a sample crosses the edge of the time range its duration will be trimmed to reflect only the overlapping part.
Searching for zfs_write
gives an evidence that the pause is caused by waiting on ZIO to complete:
=== Thread 0xffffc0003e990010 ===
2.43 s (100.00%, #1) All
waiter::wait(sched::timer*) const
condvar::wait(lockfree::mutex*, unsigned long)
zio_wait
dmu_tx_count_write
dmu_tx_hold_write
zfs_write
vfs_file::write(uio*, int)
sys_write
The investigation process now continues. Why are we waiting for ZIO to complete during async write? This subject is out of scope of this tutorial.
The way a profile is displayed can be controlled with various options.
Grouping
By default samples are presented in on tree. The tool allows to group samples by a couple of properties via -g
or --group-by
option:
-
-g thread
- groups data by thread -
-g cpu
- groups data by CPU
Orientation
By default frames closer to the tracepoint call-site are closer to the root. We call this kind of profile callee-oriented, because callees are higher. It shows paths leading to tracepoint.
You can reverse the orientation by passing -r
or --caller-oriented
. In this case we will have the most-bottom frames in the root, its callees branching from them, and finally tracepoint call-sites in the leafs. This orientation is useful to show how bigger operations are composed.
NOTE: be careful with caller-oriented profiles. If the maximum backtrace length used during collecting of trace is too low it may not cover bottom-most frames (thread_main()
) for some samples. As a result traces which would otherwise end up in a single root will be spread across multiple 1st-level nodes.
Example of memory_malloc hit profile, caller oriented:
$ scripts/trace.py prof -r -t memory_malloc trace
(100.00%, #31742) All
|-- (29.60%, #9397) sched::thread::main()
| |-- (10.02%, #3182) txg_sync_thread
| | |-- (10.02%, #3180) spa_sync
| | | |-- (6.57%, #2086) bplist_iterate
| | | | spa_free_sync_cb
| | | | |-- (3.29%, #1043) zio_create
| | | | | zfs_kmem_alloc
| | | | |
| | | | \-- (3.29%, #1043) zio_add_child
| | | | zfs_kmem_alloc
| | | |
| | | |-- (1.90%, #603) dsl_pool_sync
| | | | |-- (1.78%, #565) dmu_objset_sync
| | | | | |-- (1.68%, #532) zil_sync
| | | | | | bplist_append
| | | | | | zfs_kmem_alloc
| | | | | |
| | | | | |-- (0.03%, #11) zio_create
| | | | | | zfs_kmem_alloc
| | | | | |
| | | | | |-- (0.03%, #11) zio_add_child
| | | | | | zfs_kmem_alloc
| | | | | |
| | | | | \-- (0.03%, #11) arc_write
| | | | | zfs_kmem_alloc
Function rooting
Some times you want to focus on particular function. For example, I would like to know the paths leading to memory allocation which are coming from dsl_pool_sync()
. There's an option which allows that: --function
. If that option is used with --caller-oriented
, it will show callees of that function. Otherwise it will show the callers.
Example (caller oriented, showing callees of dsl_pool_sync
):
$ scripts/trace.py prof -r --function dsl_pool_sync -t memory_malloc trace
(100.00%, #813) All
dsl_pool_sync
|-- (76.51%, #622) dmu_objset_sync
| |-- (65.44%, #532) zil_sync
| | bplist_append
| | zfs_kmem_alloc
| |
| |-- (7.01%, #57) dnode_sync
| | |-- (6.64%, #54) dbuf_sync_leaf
| | | dbuf_write
| | | |-- (2.21%, #18) zio_create
| | | | zfs_kmem_alloc
| | | |
| | | |-- (2.21%, #18) zio_add_child
| | | | zfs_kmem_alloc
| | | |
| | | \-- (2.21%, #18) arc_write
| | | zfs_kmem_alloc
| | |
(...)
Filtering
Often the profile has too much data. There are some switches which you can use to filter data. One of them is --max-levels
which reduces the number of tree levels displayed in the profile. Example:
$ scripts/trace.py prof --max-levels 1 -t memory_malloc trace
(100.00%, #31742) All
|-- (79.68%, #25291) zfs_kmem_alloc
|
|-- (14.90%, #4728) operator new(unsigned long)
|
|-- (3.30%, #1047) alloc_bio
|
|-- (2.10%, #665) virtio::vring::add_buf(void*)
| virtio::vring::add_buf_wait(void*)
| virtio::blk::make_request(bio*)
| blk_strategy
| vdev_disk_start_bio
| zio_vdev_io_start
| zio_execute
|
\-- (0.03%, #11) AcpiUtAllocate
You can also use --min-duration
in time profiles to hide nodes with resident time shorter than specified.
You can filter-out infrequently hit paths using --min-hits
, eg:
-
--min-hits 500
- shows only paths which were hit at least 500 times -
--min-hist 10%
- shows only paths which were visited at least in 10% of cases