Linux ftrace版本的iosnoop示例。

这是在Linux 3.16内核,跟踪tar命令打包一个文件系统。

# ./iosnoop
Tracing block I/O... Ctrl-C to end.
COMM             PID    TYPE DEV      BLOCK        BYTES     LATms
supervise        1809   W    202,1    17039968     4096       1.32
supervise        1809   W    202,1    17039976     4096       1.30
tar              14794  RM   202,1    8457608      4096       7.53
tar              14794  RM   202,1    8470336      4096      14.90
tar              14794  RM   202,1    8470368      4096       0.27
tar              14794  RM   202,1    8470784      4096       7.74
tar              14794  RM   202,1    8470360      4096       0.25
tar              14794  RM   202,1    8469968      4096       0.24
tar              14794  RM   202,1    8470240      4096       0.24
tar              14794  RM   202,1    8470392      4096       0.23
tar              14794  RM   202,1    8470544      4096       5.96
tar              14794  RM   202,1    8470552      4096       0.27
tar              14794  RM   202,1    8470384      4096       0.24
[...]

"tar"的I/O可以看出它有点随机(注意BLOCK字段),还有大小是4Kb(BYTES)。 有一项返回时间14.9毫秒,也有快速响应的,快速返回(0.24ms)可能是来自一些缓存(磁盘或者控制器)。

“RM”类型表示读取元数据。这次跟踪开始显示了两条来自supervise PID 1809的写信息。

这里有一个很好的随机I/O工作负载:

# ./iosnoop
Tracing block I/O. Ctrl-C to end.
COMM             PID    TYPE DEV      BLOCK        BYTES     LATms
randread         9182   R    202,32   30835224     8192       0.18
randread         9182   R    202,32   21466088     8192       0.15
randread         9182   R    202,32   13529496     8192       0.16
randread         9182   R    202,16   21250648     8192       0.18
randread         9182   R    202,16   1536776      32768      0.30
randread         9182   R    202,32   17157560     24576      0.23
randread         9182   R    202,32   21313320     8192       0.16
randread         9182   R    202,32   862184       8192       0.18
randread         9182   R    202,16   25496872     8192       0.21
randread         9182   R    202,32   31471768     8192       0.18
randread         9182   R    202,16   27571336     8192       0.20
randread         9182   R    202,16   30783448     8192       0.16
randread         9182   R    202,16   21435224     8192       1.28
randread         9182   R    202,16   970616       8192       0.15
randread         9182   R    202,32   13855608     8192       0.16
randread         9182   R    202,32   17549960     8192       0.15
randread         9182   R    202,32   30938232     8192       0.14
[...]

注意变化偏移。这里返回延迟非常好,因为使用的存储设备是SSDs磁盘。如果是机械磁盘,我期望的延迟是10ms左右。

这是空闲的Linux 3.2系统 :

# ./iosnoop
Tracing block I/O. Ctrl-C to end.
COMM             PID    TYPE DEV      BLOCK        BYTES     LATms
supervise        3055   W    202,1    12852496     4096       0.64
supervise        3055   W    202,1    12852504     4096       1.32
supervise        3055   W    202,1    12852800     4096       0.55
supervise        3055   W    202,1    12852808     4096       0.52
jbd2/xvda1-212   212    WS   202,1    1066720      45056     41.52
jbd2/xvda1-212   212    WS   202,1    1066808      12288     41.52
jbd2/xvda1-212   212    WS   202,1    1066832      4096      32.37
supervise        3055   W    202,1    12852800     4096      14.28
supervise        3055   W    202,1    12855920     4096      14.07
supervise        3055   W    202,1    12855960     4096       0.67
supervise        3055   W    202,1    12858208     4096       1.00
flush:1-409      409    W    202,1    12939640     12288     18.00
[...]

这里显示PID为3055的supervise的多种写。jbd2/xvda1-212有很高的延迟,这个journaling block device driver,做的是同步写(TYPE=WS)。

这两个操作能显示开始(-s)和结束(t)时间

# ./iosnoop -ts
Tracing block I/O. Ctrl-C to end.
STARTs         ENDs           COMM             PID    TYPE DEV      BLOCK        BYTES     LATms
5982800.302061 5982800.302679 supervise        1809   W    202,1    17039600     4096       0.62
5982800.302423 5982800.302842 supervise        1809   W    202,1    17039608     4096       0.42
5982800.304962 5982800.305446 supervise        1801   W    202,1    17039616     4096       0.48
5982800.305250 5982800.305676 supervise        1801   W    202,1    17039624     4096       0.43
5982800.308849 5982800.309452 supervise        1810   W    202,1    12862464     4096       0.60
5982800.308856 5982800.309470 supervise        1806   W    202,1    17039632     4096       0.61
5982800.309206 5982800.309740 supervise        1806   W    202,1    17039640     4096       0.53
5982800.309211 5982800.309805 supervise        1810   W    202,1    12862472     4096       0.59
5982800.309332 5982800.309953 supervise        1812   W    202,1    17039648     4096       0.62
5982800.309676 5982800.310283 supervise        1812   W    202,1    17039656     4096       0.61
[...]

收集I/O事件时添加时间对后期处理数据很有用

现在,匹配一个单独的PID

# ./iosnoop -p 1805
Tracing block I/O issued by PID 1805. Ctrl-C to end.
COMM             PID    TYPE DEV      BLOCK        BYTES     LATms
supervise        1805   W    202,1    17039648     4096       0.68
supervise        1805   W    202,1    17039672     4096       0.60
supervise        1805   W    202,1    17040040     4096       0.62
supervise        1805   W    202,1    17040056     4096       0.47
supervise        1805   W    202,1    17040624     4096       0.49
supervise        1805   W    202,1    17040632     4096       0.44
^C
Ending tracing...

这个操作能够让内核根据PID来过滤I/O问题。也可以使用"-n"选项来匹配一个进程的名字,但是目前只能在用户空间使用,因此很少生效。

我想说的是通常情况下能标识原始进程,但是会有错误边界。

这依赖于文件系统、块I/O队列和I/O子系统,还可能会漏掉一些不是这个PID的上下文但是和这个PID有关系的事件。 (例如:在已经完成的I/O上触发一个读readahead。无论是发生在文件系统还是存储子系统)

You can try the -Q option for more reliable process identification. 你可以使用-Q选项来获取更加可靠的进程标识。

The -Q option begins tracing on block I/O queue insert, instead of issue. 这个-Q操作开始触发在块I/O插入队列,而不是issue。

这里是当dd写一个大文件时的过程 Here's before and after, while dd(1) writes a large file:

# ./iosnoop
Tracing block I/O. Ctrl-C to end.
COMM             PID    TYPE DEV      BLOCK        BYTES     LATms
dd               26983  WS   202,16   4064416      45056     16.70
dd               26983  WS   202,16   4064504      45056     16.72
dd               26983  WS   202,16   4064592      45056     16.74
dd               26983  WS   202,16   4064680      45056     16.75
cat              27031  WS   202,16   4064768      45056     16.56
cat              27031  WS   202,16   4064856      45056     16.46
cat              27031  WS   202,16   4064944      45056     16.40
gawk             27030  WS   202,16   4065032      45056      0.88
gawk             27030  WS   202,16   4065120      45056      1.01
gawk             27030  WS   202,16   4065208      45056     16.15
gawk             27030  WS   202,16   4065296      45056     16.16
gawk             27030  WS   202,16   4065384      45056     16.16
[...]

这输出是块设备I/O事件完成的时间(LATms)的典型代表, The output here shows the block I/O time from issue to completion (LATms), which is largely representative of the device.

通过进程的名字和PID标识dd,cat和gawk。默认情况下iosnoop只显示正在使用CPU的I/O事件,不会显示准备发起I/O的进程。

The process names and PIDs identify dd, cat, and gawk. By default iosnoop shows who is on-CPU at time of block I/O issue, but these may not be the processes that originated the I/O.

In this case (having debugged it), the reason is that processes such as cat and gawk are making hypervisor calls (this is a Xenguest instance),

这个案例(调试模式)中,原因是cat和gawk进程使用虚拟化系统调用(这是Xen情况)

eg, for memory operations, and during hypervisor processing a queue of pending work is checked and dispatched. So cat and gawk were on-CPU when the block device I/O was issued, but they didn't originate it.

例如:对于内存操作,

Now the -Q option is used:

# ./iosnoop -Q
Tracing block I/O. Ctrl-C to end.
COMM             PID    TYPE DEV      BLOCK        BYTES     LATms
kjournald        1217   WS   202,16   6132200      45056    141.12
kjournald        1217   WS   202,16   6132288      45056    141.10
kjournald        1217   WS   202,16   6132376      45056    141.10
kjournald        1217   WS   202,16   6132464      45056    141.11
kjournald        1217   WS   202,16   6132552      40960    141.11
dd               27718  WS   202,16   6132624      4096       0.18
flush:16-1279    1279   W    202,16   6132632      20480      0.52
flush:16-1279    1279   W    202,16   5940856      4096       0.50
flush:16-1279    1279   W    202,16   5949056      4096       0.52
flush:16-1279    1279   W    202,16   5957256      4096       0.54
flush:16-1279    1279   W    202,16   5965456      4096       0.56
flush:16-1279    1279   W    202,16   5973656      4096       0.58
flush:16-1279    1279   W    202,16   5981856      4096       0.60
flush:16-1279    1279   W    202,16   5990056      4096       0.63
[...]

This uses the block_rq_insert tracepoint as the starting point of I/O, instead of block_rq_issue. This makes the following differences to columns and options:

  • COMM: more likely to show the originating process.
  • PID: more likely to show the originating process.
  • LATms: shows the I/O time, including time spent on the block I/O queue.
  • STARTs (not shown above): shows the time of queue insert, not I/O issue.
  • -p PID: more likely to match the originating process.
  • -n name: more likely to match the originating process.

The reason that this ftrace-based iosnoop does not just instrument both insert and issue tracepoints is one of overhead. Even with buffering, iosnoop can have difficulty under high load.

If I want to capture events for post-processing, I use the duration mode, which not only lets me set the duration, but also uses buffering, which reduces the overheads of tracing.

Capturing 5 seconds, with both start timestamps (-s) and end timestamps (-t):

# time ./iosnoop -ts 5 > out

real    0m5.566s
user    0m0.336s
sys    0m0.140s
# wc out
  27010  243072 2619744 out

This server is doing over 5,000 disk IOPS. Even with buffering, this did consume a measurable amount of CPU to capture: 0.48 seconds of CPU time in total. Note that the run took 5.57 seconds: this is 5 seconds for the capture, followed by the CPU time for iosnoop to fetch and process the buffer.

Now tracing for 30 seconds:

# time ./iosnoop -ts 30 > out

real    0m31.207s
user    0m0.884s
sys    0m0.472s
# wc out
  64259  578313 6232898 out

Since it's the same server and workload, this should have over 150k events, but only has 64k. The tracing buffer has overflowed, and events have been dropped. If I really must capture this many events, I can either increase the trace buffer size (it's the bufsize_kb setting in the script), or, use a different tracer (perf_evets, SystemTap, ktap, etc.) If the IOPS rate is low (eg, less than 5k), then unbuffered (no duration), despite the higher overheads, may be sufficient, and will keep capturing events until Ctrl-C.

Here's an example of digging into the sequence of I/O to explain an outlier. My randread program on an SSD server (which is an AWS EC2 instance) usually experiences about 0.15 ms I/O latency, but there are some outliers as high as 20 milliseconds. Here's an excerpt:

# ./iosnoop -ts > out
# more out
Tracing block I/O. Ctrl-C to end.
STARTs         ENDs           COMM             PID    TYPE DEV      BLOCK        BYTES     LATms
6037559.121523 6037559.121685 randread         22341  R    202,32   29295416     8192       0.16
6037559.121719 6037559.121874 randread         22341  R    202,16   27515304     8192       0.16
[...]
6037595.999508 6037596.000051 supervise        1692   W    202,1    12862968     4096       0.54
6037595.999513 6037596.000144 supervise        1687   W    202,1    17040160     4096       0.63
6037595.999634 6037596.000309 supervise        1693   W    202,1    17040168     4096       0.68
6037595.999937 6037596.000440 supervise        1693   W    202,1    17040176     4096       0.50
6037596.000579 6037596.001192 supervise        1689   W    202,1    17040184     4096       0.61
6037596.000826 6037596.001360 supervise        1689   W    202,1    17040192     4096       0.53
6037595.998302 6037596.018133 randread         22341  R    202,32   954168       8192      20.03
6037595.998303 6037596.018150 randread         22341  R    202,32   954200       8192      20.05
6037596.018182 6037596.018347 randread         22341  R    202,32   18836600     8192       0.16
[...]

It's important to sort on the I/O completion time (ENDs). In this case it's already in the correct order.

So my 20 ms reads happened after a large group of supervise writes were completed (I truncated dozens of supervise write lines to keep this example short). Other latency outliers in this output file showed the same sequence: slow reads after a batch of writes.

Note the I/O request timestamp (STARTs), which shows that these 20 ms reads were issued before the supervise writes – so they had been sitting on a queue. I've debugged this type of issue many times before, but this one is different: those writes were to a different device (202,1), so I would have assumed they would be on different queues, and wouldn't interfere with each other. Somewhere in this system (Xen guest) it looks like there is a shared queue. (Having just discovered this using iosnoop, I can't yet tell you which queue, but I'd hope that after identifying it there would be a way to tune its queueing behavior, so that we can eliminate or reduce the severity of these outliers.)

Use -h to print the USAGE message:

# ./iosnoop -h
USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name]
               [duration]
                 -d device       # device string (eg, "202,1)
                 -i iotype       # match type (eg, '*R*' for all reads)
                 -n name         # process name to match on I/O issue
                 -p PID          # PID to match on I/O issue
   -Q              # 延迟包含I/O排队时间
                 -s              # include start time of I/O (s)
                 -t              # include completion time of I/O (s)
                 -h              # this usage message
                 duration        # duration seconds, and use buffers
  eg,
       iosnoop                   # watch block I/O live (unbuffered)
       iosnoop 1                 # trace 1 sec (buffered)
       iosnoop -Q                # 延迟包含I/O排队时间
       iosnoop -ts               # 显示开始和结束时间戳
       iosnoop -i '*R*'          # 跟踪“读”
       iosnoop -p 91             # show I/O issued when PID 91 is on-CPU
       iosnoop -Qp 91            # show I/O queued by PID 91, queue time

See the man page and example file for more info.

Copyright © 温玉 2021 | 浙ICP备2020032454号 all right reserved,powered by Gitbook该文件修订时间: 2023-06-19 08:59:50

results matching ""

    No results matching ""