ToC
DAX文件系统+mmap()文件接口,可以将持久型内存(persistent memory)直接映射到用户态,我预计这在未来将成为一种在系统编程时操作PM的标准套路。本文将通过实验验证DAX文件系统2M大页映射,并指出测试性能的方法。
1. 如何配置
[1]给出了比较详细的配置说明,这里避免过多重复,简单总结一下。
根据[1]所述,内核支持DAX的文件系统ext4和XFS现在支持2MB大小的hugepage了,但要使用这个特性,需要满足3个条件:
- mmap()调用必须最少映射 2 MiB;
- 文件系统块的最少以2 MiB被分配;
- 文件系统块必须和mmap()调用有相同的对齐量。
其中,第1点(在用户态的调用)挺容易达到的,第2、3点得益于当年为了支持RAID所提出的特性,ext4和XFS也支持从底层分配一定大小和一定对齐量的块。关于怎么快速验证2MB大页映射,[1]中也以fallocate一个1GB大小的文件进行了说明。本文更进一步,验证对于一个文件,怎样创建它、怎样映射它才能利用到2MB大页映射。
2. 监测2M大页page fault的结果
我们要验证大页映射是否成功,可以对大页fault成功与否进行监测。[1]中利用了内核函数tracepoint机制,对构建pmd页表项的dax_pmd_fault_done
这个tracepoint进行监测。这里我们用的是bpftrace
这个工具,它可以更方便地监测tracepoint。步骤如下:
**Step 1.**根据文档[1]和bpftrace文档[2],我们可以首先查看dax_pmd_fault_done
这个tracepoint的相关信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
$ sudo cat /sys/kernel/debug/tracing/events/fs_dax/dax_pmd_fault_done/format name: dax_pmd_fault_done ID: 888 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:unsigned long ino; offset:8; size:8; signed:0; field:unsigned long vm_start; offset:16; size:8; signed:0; field:unsigned long vm_end; offset:24; size:8; signed:0; field:unsigned long vm_flags; offset:32; size:8; signed:0; field:unsigned long address; offset:40; size:8; signed:0; field:unsigned long pgoff; offset:48; size:8; signed:0; field:unsigned long max_pgoff; offset:56; size:8; signed:0; field:dev_t dev; offset:64; size:4; signed:0; field:unsigned int flags; offset:68; size:4; signed:0; field:int result; offset:72; size:4; signed:1; print fmt: "dev %d:%d ino %#lx %s %s address %#lx vm_start %#lx vm_end %#lx pgoff %#lx max_pgoff %#lx %s", ((unsigned int) ((REC->dev) >> 20)), ((unsigned int) ((REC->dev) & ((1U << 20) - 1))), REC->ino, REC->vm_flags & 0x00000008 ? "shared" : "private", __print_flags(REC->flags, "|", { 0x01, "WRITE" }, { 0x02, "MKWRITE" }, { 0x04, "ALLOW_RETRY" }, { 0x08, "RETRY_NOWAIT" }, { 0x10, "KILLABLE" }, { 0x20, "TRIED" }, { 0x40, "USER" }, { 0x80, "REMOTE" }, { 0x100, "INSTRUCTION" }), REC->address, REC->vm_start, REC->vm_end, REC->pgoff, REC->max_pgoff, __print_flags(REC->result, "|", { 0x0001, "OOM" }, { 0x0002, "SIGBUS" }, { 0x0004, "MAJOR" }, { 0x0008, "WRITE" }, { 0x0010, "HWPOISON" }, { 0x0020, "HWPOISON_LARGE" }, { 0x0040, "SIGSEGV" }, { 0x0100, "NOPAGE" }, { 0x0200, "LOCKED" }, { 0x0400, "RETRY" }, { 0x0800, "FALLBACK" }, { 0x1000, "DONE_COW" }, { 0x2000, "NEEDDSYNC" }) |
Step 2. 然后用一行bpftrace命令判断是否使用大页成功:
1 2 |
sudo bpftrace -e ' tracepoint:fs_dax:dax_pmd_fault_done {printf("dax_pmd_fault_done result: 0x%x\n", args->result);@[args->result]=count()}' |
注意这行命令中,我们打印的是result这个信息,它一般会在2MB对齐页发生page fault时打印出0x100
或0x800
两个结构。这两个结构分别对应VM_FAULT_NOPAGE
或VM_FAULT_FALLBACK
,表示缺页时2M页表项创建成功或失败。
3. 修改fio mmap引擎
fio的mmap引擎中,原本的实现是映射测试文件的全部范围:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
static int fio_mmap_file(struct thread_data *td, struct fio_file *f, size_t length, off_t off) { ... fmd->mmap_ptr = mmap(NULL, length, flags, shared, f->fd, off); if (fmd->mmap_ptr == MAP_FAILED) { fmd->mmap_ptr = NULL; td_verror(td, errno, "mmap"); goto err; } ... } |
这里我们为了验证mmap以不同粒度映射这个文件,更改一种映射形式。我们先映射一个文件大小的匿名(用MAP_ANONYMOUS
实现)区域,然后再以我们需要的映射粒度,逐一往这个区域"覆盖(用MAP_FIXED
实现)"文件映射:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
static int fio_mmap_file(struct thread_data *td, struct fio_file *f, size_t length, off_t off) { ... unsigned long long align_2mb = 1024 * 1024 * 2; // 为了满足条件1,这里多映射2MB,然后调整fmd->mmap_ptr到一个2MB对齐的且length长的映射区 fmd->mmap_ptr = mmap(NULL, length + align_2mb, flags, shared|MAP_ANONYMOUS , -1, 0); if (fmd->mmap_ptr == MAP_FAILED) { fmd->mmap_ptr = NULL; td_verror(td, errno, "mmap"); goto err; } fmd->mmap_ptr = (void *)((unsigned long long)fmd->mmap_ptr / align_2mb * align_2mb + align_2mb); // 这里可以调整我们每次映射的区域大小,比如这里是每次2MB映射这个文件,我们还会调整到 // 每次1MB映射来验证1MB大小的映射是否可以利用2MB大页表项 size_t map_size = 1024 * 1024 * 2; for (off_t i = 0; i < length - map_size; i += map_size) { void *new_addr = fmd->mmap_ptr + i; void *res = mmap(new_addr, map_size, flags, shared|MAP_FIXED, f->fd, off + i); if (res == MAP_FAILED) { printf("error: %s\n", strerror(errno)); } } ... } |
在上述代码map_size
分别设置成1024 * 1024 * 2
或 1024 * 1024 * 1
(即2MB或1MB映射文件),分别重新编译fio。用如下fio脚本测试:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# test.fio [global] thread group_reporting overwrite=1 thinktime=0 sync=0 direct=0 ioengine=mmap filename=/pmem6/testdata size=100GB time_based runtime=10 [Random_Read] bs=4k numjobs=1 iodepth=1 rw=randread |
4. 测试及结果分析
4.1. 怎么运行测试呢:
首先,在另一个shell,用bpftrace 追踪dax_pmd_fault_done这个tracepoint的结果:
1 2 |
$ sudo bpftrace -e ' tracepoint:fs_dax:dax_pmd_fault_done {printf("dax_pmd_fault_done result: 0x%x\n", args->result);@[args->result]=count()}' |
然后,在另一个shell,运行测试test.fio并用perf stat统计各种性能参数,之后会同时得到fio和perf的测试报告,同时bpftrace那个shell也监测到了内核tracepoint的相关情况。
1 2 |
$ perf stat -ddd /MY_FIO_BIN_DIR/fio test.fio |
4.2. 测试结果1: 以2 MB为粒度映射文件
fio 及 perf stat 的report如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
zjc@test_dax_hugepage $ perf stat -ddd /home/zjc/bin/fio-mod/bin/fio mmap_lat.fio SeqR: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=mmap, iodepth=1 fio-3.16 Starting 1 thread Jobs: 1 (f=1): [r(1)][100.0%][r=1723MiB/s][r=441k IOPS][eta 00m:00s] SeqR: (groupid=0, jobs=1): err= 0: pid=137311: Sat Nov 23 18:47:58 2019 read: IOPS=501k, BW=1957MiB/s (2052MB/s)(19.1GiB/10001msec) clat (nsec): min=1116, max=1189.9k, avg=1639.48, stdev=1276.94 lat (nsec): min=1145, max=1189.9k, avg=1666.78, stdev=1277.78 clat percentiles (nsec): | 1.00th=[ 1368], 5.00th=[ 1400], 10.00th=[ 1416], 20.00th=[ 1448], | 30.00th=[ 1464], 40.00th=[ 1480], 50.00th=[ 1496], 60.00th=[ 1528], | 70.00th=[ 1560], 80.00th=[ 1624], 90.00th=[ 1976], 95.00th=[ 2352], | 99.00th=[ 3632], 99.50th=[ 3888], 99.90th=[11712], 99.95th=[13248], | 99.99th=[14400] bw ( MiB/s): min= 1292, max= 2088, per=100.00%, avg=1969.49, stdev=216.98, samples=19 iops : min=330974, max=534696, avg=504188.63, stdev=55547.02, samples=19 lat (usec) : 2=90.35%, 4=9.28%, 10=0.20%, 20=0.17%, 50=0.01% lat (msec) : 2=0.01% cpu : usr=98.13%, sys=1.86%, ctx=19, majf=0, minf=52873 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=5010392,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=1957MiB/s (2052MB/s), 1957MiB/s-1957MiB/s (2052MB/s-2052MB/s), io=19.1GiB (20.5GB), run=10001-10001msec Disk stats (read/write): pmem6: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% Performance counter stats for '/home/zjc/bin/fio-mod/bin/fio mmap_lat.fio': 13209.252198 task-clock:u (msec) # 1.248 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 157,520 page-faults:u # 0.012 M/sec 32,808,107,404 cycles:u # 2.484 GHz (37.84%) 13,767,631,499 instructions:u # 0.42 insn per cycle (45.78%) 2,546,685,790 branches:u # 192.796 M/sec (46.08%) 3,557,939 branch-misses:u # 0.14% of all branches (46.44%) 3,983,151,533 L1-dcache-loads:u # 301.543 M/sec (46.74%) 350,423,104 L1-dcache-load-misses:u # 8.80% of all L1-dcache hits (46.89%) 147,288,693 LLC-loads:u # 11.150 M/sec (30.98%) 4,700,903 LLC-load-misses:u # 6.38% of all LL-cache hits (30.77%) <not supported> L1-icache-loads:u 5,876,206 L1-icache-load-misses:u (30.69%) 4,043,536,403 dTLB-loads:u # 306.114 M/sec (30.63%) 5,516,303 dTLB-load-misses:u # 0.14% of all dTLB cache hits (30.45%) 19,166 iTLB-loads:u # 0.001 M/sec (30.22%) 120,540 iTLB-load-misses:u # 628.93% of all iTLB cache hits (30.10%) <not supported> L1-dcache-prefetches:u <not supported> L1-dcache-prefetch-misses:u 10.580851824 seconds time elapsed |
bpftrace的结果:
1 2 3 4 5 6 7 8 9 10 11 |
zjc@test_dax_hugepage $ sudo bpftrace -e ' tracepoint:fs_dax:dax_pmd_fault_done {printf("dax_pmd_fault_done result: 0x%x\n", args->result);@[args->result]=count()}' dax_pmd_fault_done result: 0x100 dax_pmd_fault_done result: 0x100 dax_pmd_fault_done result: 0x100 ... dax_pmd_fault_done result: 0x100 dax_pmd_fault_done result: 0x100 ^C @[256]: 51199 |
4.3. 测试结果2: 以1MB为粒度映射文件
fio 及 perf stat 的report如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
zjc@test_dax_hugepage $ perf stat -ddd /home/zjc/bin/fio-mod/bin/fio mmap_lat.fio SeqR: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=mmap, iodepth=1 fio-3.16 Starting 1 thread Jobs: 1 (f=1): [r(1)][100.0%][r=885MiB/s][r=227k IOPS][eta 00m:00s] SeqR: (groupid=0, jobs=1): err= 0: pid=136706: Sat Nov 23 18:39:56 2019 read: IOPS=217k, BW=849MiB/s (890MB/s)(8488MiB/10001msec) clat (usec): min=2, max=1967, avg= 4.15, stdev= 2.82 lat (usec): min=2, max=1967, avg= 4.18, stdev= 2.82 clat percentiles (nsec): | 1.00th=[ 3440], 5.00th=[ 3600], 10.00th=[ 3696], 20.00th=[ 3792], | 30.00th=[ 3856], 40.00th=[ 3920], 50.00th=[ 3984], 60.00th=[ 4048], | 70.00th=[ 4128], 80.00th=[ 4192], 90.00th=[ 4384], 95.00th=[ 4640], | 99.00th=[ 8896], 99.50th=[ 9792], 99.90th=[16512], 99.95th=[17024], | 99.99th=[20096] bw ( KiB/s): min=373528, max=908128, per=99.78%, avg=867118.63, stdev=125170.53, samples=19 iops : min=93382, max=227032, avg=216779.63, stdev=31292.62, samples=19 lat (usec) : 4=54.10%, 10=45.42%, 20=0.47%, 50=0.01%, 100=0.01% lat (msec) : 2=0.01% cpu : usr=46.18%, sys=53.79%, ctx=89, majf=0, minf=2176958 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=2172849,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=849MiB/s (890MB/s), 849MiB/s-849MiB/s (890MB/s-890MB/s), io=8488MiB (8900MB), run=10001-10001msec Disk stats (read/write): pmem6: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% Performance counter stats for '/home/zjc/bin/fio-mod/bin/fio mmap_lat.fio': 13256.946076 task-clock:u (msec) # 1.243 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 2,282,121 page-faults:u # 0.172 M/sec 20,141,999,012 cycles:u # 1.519 GHz (37.85%) 6,079,376,470 instructions:u # 0.30 insn per cycle (45.56%) 1,126,849,210 branches:u # 85.001 M/sec (45.83%) 6,968,748 branch-misses:u # 0.62% of all branches (46.12%) 1,754,527,971 L1-dcache-loads:u # 132.348 M/sec (46.36%) 248,313,128 L1-dcache-load-misses:u # 14.15% of all L1-dcache hits (46.60%) 71,871,943 LLC-loads:u # 5.421 M/sec (31.13%) 5,340,391 LLC-load-misses:u # 14.86% of all LL-cache hits (30.90%) <not supported> L1-icache-loads:u 184,362,644 L1-icache-load-misses:u (30.77%) 1,764,918,323 dTLB-loads:u # 133.132 M/sec (30.73%) 8,388,448 dTLB-load-misses:u # 0.48% of all dTLB cache hits (30.67%) 114,007 iTLB-loads:u # 0.009 M/sec (30.55%) 3,475,836 iTLB-load-misses:u # 3048.79% of all iTLB cache hits (30.34%) <not supported> L1-dcache-prefetches:u <not supported> L1-dcache-prefetch-misses:u 10.669283491 seconds time elapsed |
bpftrace的结果:
1 2 3 4 5 6 7 8 9 10 11 |
zjc@test_dax_hugepage $ sudo bpftrace -e ' tracepoint:fs_dax:dax_pmd_fault_done {printf("dax_pmd_fault_done result: 0x%x\n", args->result);@[args->result]=count()}' dax_pmd_fault_done result: 0x800 dax_pmd_fault_done result: 0x800 dax_pmd_fault_done result: 0x800 ... dax_pmd_fault_done result: 0x800 dax_pmd_fault_done result: 0x800 ^C @[2048]: 51199 |
4.4. 测试结果分析
实验环境:
项目 | 参数 |
---|---|
kernel版本 | 4.16.0 |
fio版本 | 3.16 |
文件系统 | ext4-dax |
硬件设备 | AEP (interleaved) |
结果对比:
2MB mmap | 1MB mmap | |
---|---|---|
dax_pmd_fault_done result | VM_FAULT_NOPAGE | VM_FAULT_FALLBACK |
随机读延迟 (us) | 1.67 | 4.18 |
page-faults:u | 157,520 | 2,282,121 |
dTLB-load-misses:u | 5,516,303 | 8,388,448 |
iTLB-load-misses:u | 120,540 | 3,475,836 |
可以发现,2MB大小的mmap调用才可以利用DAX文件系统的hugepage映射特性,在AEP上,用fio mmap引擎测试的随机4K读延迟性能提升3倍,page fault、TLB miss都显著减少了。
[1] https://nvdimm.wiki.kernel.org/2mib_fs_dax
[2] https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#6-tracepoint-static-tracing-kernel-level-arguments