DAX文件系统+mmap()文件接口,可以将持久型内存(persistent memory)直接映射到用户态,我预计这在未来将成为一种在系统编程时操作PM的标准套路。本文将通过实验验证DAX文件系统2M大页映射,并指出测试性能的方法。
1. 如何配置
[1]给出了比较详细的配置说明,这里避免过多重复,简单总结一下。
根据[1]所述,内核支持DAX的文件系统ext4和XFS现在支持2MB大小的hugepage了,但要使用这个特性,需要满足3个条件:
- mmap()调用必须最少映射 2 MiB;
- 文件系统块的最少以2 MiB被分配;
- 文件系统块必须和mmap()调用有相同的对齐量。
其中,第1点(在用户态的调用)挺容易达到的,第2、3点得益于当年为了支持RAID所提出的特性,ext4和XFS也支持从底层分配一定大小和一定对齐量的块。关于怎么快速验证2MB大页映射,[1]中也以fallocate一个1GB大小的文件进行了说明。本文更进一步,验证对于一个文件,怎样创建它、怎样映射它才能利用到2MB大页映射。
2. 监测2M大页page fault的结果
我们要验证大页映射是否成功,可以对大页fault成功与否进行监测。[1]中利用了内核函数tracepoint机制,对构建pmd页表项的dax_pmd_fault_done
这个tracepoint进行监测。这里我们用的是bpftrace
这个工具,它可以更方便地监测tracepoint。步骤如下:
**Step 1.**根据文档[1]和bpftrace文档[2],我们可以首先查看dax_pmd_fault_done
这个tracepoint的相关信息:
$ sudo cat /sys/kernel/debug/tracing/events/fs_dax/dax_pmd_fault_done/format
name: dax_pmd_fault_done
ID: 888
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:unsigned long ino; offset:8; size:8; signed:0;
field:unsigned long vm_start; offset:16; size:8; signed:0;
field:unsigned long vm_end; offset:24; size:8; signed:0;
field:unsigned long vm_flags; offset:32; size:8; signed:0;
field:unsigned long address; offset:40; size:8; signed:0;
field:unsigned long pgoff; offset:48; size:8; signed:0;
field:unsigned long max_pgoff; offset:56; size:8; signed:0;
field:dev_t dev; offset:64; size:4; signed:0;
field:unsigned int flags; offset:68; size:4; signed:0;
field:int result; offset:72; size:4; signed:1;
print fmt: "dev %d:%d ino %#lx %s %s address %#lx vm_start %#lx vm_end %#lx pgoff %#lx max_pgoff %#lx %s", ((unsigned int) ((REC->dev) >> 20)), ((unsigned int) ((REC->dev) & ((1U << 20) - 1))), REC->ino, REC->vm_flags & 0x00000008 ? "shared" : "private", __print_flags(REC->flags, "|", { 0x01, "WRITE" }, { 0x02, "MKWRITE" }, { 0x04, "ALLOW_RETRY" }, { 0x08, "RETRY_NOWAIT" }, { 0x10, "KILLABLE" }, { 0x20, "TRIED" }, { 0x40, "USER" }, { 0x80, "REMOTE" }, { 0x100, "INSTRUCTION" }), REC->address, REC->vm_start, REC->vm_end, REC->pgoff, REC->max_pgoff, __print_flags(REC->result, "|", { 0x0001, "OOM" }, { 0x0002, "SIGBUS" }, { 0x0004, "MAJOR" }, { 0x0008, "WRITE" }, { 0x0010, "HWPOISON" }, { 0x0020, "HWPOISON_LARGE" }, { 0x0040, "SIGSEGV" }, { 0x0100, "NOPAGE" }, { 0x0200, "LOCKED" }, { 0x0400, "RETRY" }, { 0x0800, "FALLBACK" }, { 0x1000, "DONE_COW" }, { 0x2000, "NEEDDSYNC" })
Step 2. 然后用一行bpftrace命令判断是否使用大页成功:
sudo bpftrace -e ' tracepoint:fs_dax:dax_pmd_fault_done {printf("dax_pmd_fault_done result: 0x%x\n", args->result);@[args->result]=count()}'
注意这行命令中,我们打印的是result这个信息,它一般会在2MB对齐页发生page fault时打印出0x100
或0x800
两个结构。这两个结构分别对应VM_FAULT_NOPAGE
或VM_FAULT_FALLBACK
,表示缺页时2M页表项创建成功或失败。
3. 修改fio mmap引擎
fio的mmap引擎中,原本的实现是映射测试文件的全部范围:
static int fio_mmap_file(struct thread_data *td, struct fio_file *f,
size_t length, off_t off)
{
...
fmd->mmap_ptr = mmap(NULL, length, flags, shared, f->fd, off);
if (fmd->mmap_ptr == MAP_FAILED) {
fmd->mmap_ptr = NULL;
td_verror(td, errno, "mmap");
goto err;
}
...
}
这里我们为了验证mmap以不同粒度映射这个文件,更改一种映射形式。我们先映射一个文件大小的匿名(用MAP_ANONYMOUS
实现)区域,然后再以我们需要的映射粒度,逐一往这个区域"覆盖(用MAP_FIXED
实现)"文件映射:
static int fio_mmap_file(struct thread_data *td, struct fio_file *f,
size_t length, off_t off)
{
...
unsigned long long align_2mb = 1024 * 1024 * 2;
// 为了满足条件1,这里多映射2MB,然后调整fmd->mmap_ptr到一个2MB对齐的且length长的映射区
fmd->mmap_ptr = mmap(NULL, length + align_2mb, flags, shared|MAP_ANONYMOUS , -1, 0);
if (fmd->mmap_ptr == MAP_FAILED) {
fmd->mmap_ptr = NULL;
td_verror(td, errno, "mmap");
goto err;
}
fmd->mmap_ptr = (void *)((unsigned long long)fmd->mmap_ptr / align_2mb * align_2mb + align_2mb);
// 这里可以调整我们每次映射的区域大小,比如这里是每次2MB映射这个文件,我们还会调整到
// 每次1MB映射来验证1MB大小的映射是否可以利用2MB大页表项
size_t map_size = 1024 * 1024 * 2;
for (off_t i = 0; i < length - map_size; i += map_size) {
void *new_addr = fmd->mmap_ptr + i;
void *res = mmap(new_addr, map_size, flags, shared|MAP_FIXED, f->fd, off + i);
if (res == MAP_FAILED) {
printf("error: %s\n", strerror(errno));
}
}
...
}
在上述代码map_size
分别设置成1024 * 1024 * 2
或 1024 * 1024 * 1
(即2MB或1MB映射文件),分别重新编译fio。用如下fio脚本测试:
# test.fio
[global]
thread
group_reporting
overwrite=1
thinktime=0
sync=0
direct=0
ioengine=mmap
filename=/pmem6/testdata
size=100GB
time_based
runtime=10
[Random_Read]
bs=4k
numjobs=1
iodepth=1
rw=randread
4. 测试及结果分析
4.1. 怎么运行测试呢:
首先,在另一个shell,用bpftrace 追踪dax_pmd_fault_done这个tracepoint的结果:
$ sudo bpftrace -e ' tracepoint:fs_dax:dax_pmd_fault_done {printf("dax_pmd_fault_done result: 0x%x\n", args->result);@[args->result]=count()}'
然后,在另一个shell,运行测试test.fio并用perf stat统计各种性能参数,之后会同时得到fio和perf的测试报告,同时bpftrace那个shell也监测到了内核tracepoint的相关情况。
$ perf stat -ddd /MY_FIO_BIN_DIR/fio test.fio
4.2. 测试结果1: 以2 MB为粒度映射文件
fio 及 perf stat 的report如下:
zjc@test_dax_hugepage $ perf stat -ddd /home/zjc/bin/fio-mod/bin/fio mmap_lat.fio
SeqR: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=mmap, iodepth=1
fio-3.16
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=1723MiB/s][r=441k IOPS][eta 00m:00s]
SeqR: (groupid=0, jobs=1): err= 0: pid=137311: Sat Nov 23 18:47:58 2019
read: IOPS=501k, BW=1957MiB/s (2052MB/s)(19.1GiB/10001msec)
clat (nsec): min=1116, max=1189.9k, avg=1639.48, stdev=1276.94
lat (nsec): min=1145, max=1189.9k, avg=1666.78, stdev=1277.78
clat percentiles (nsec):
| 1.00th=[ 1368], 5.00th=[ 1400], 10.00th=[ 1416], 20.00th=[ 1448],
| 30.00th=[ 1464], 40.00th=[ 1480], 50.00th=[ 1496], 60.00th=[ 1528],
| 70.00th=[ 1560], 80.00th=[ 1624], 90.00th=[ 1976], 95.00th=[ 2352],
| 99.00th=[ 3632], 99.50th=[ 3888], 99.90th=[11712], 99.95th=[13248],
| 99.99th=[14400]
bw ( MiB/s): min= 1292, max= 2088, per=100.00%, avg=1969.49, stdev=216.98, samples=19
iops : min=330974, max=534696, avg=504188.63, stdev=55547.02, samples=19
lat (usec) : 2=90.35%, 4=9.28%, 10=0.20%, 20=0.17%, 50=0.01%
lat (msec) : 2=0.01%
cpu : usr=98.13%, sys=1.86%, ctx=19, majf=0, minf=52873
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=5010392,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=1957MiB/s (2052MB/s), 1957MiB/s-1957MiB/s (2052MB/s-2052MB/s), io=19.1GiB (20.5GB), run=10001-10001msec
Disk stats (read/write):
pmem6: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
Performance counter stats for '/home/zjc/bin/fio-mod/bin/fio mmap_lat.fio':
13209.252198 task-clock:u (msec) # 1.248 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
157,520 page-faults:u # 0.012 M/sec
32,808,107,404 cycles:u # 2.484 GHz (37.84%)
13,767,631,499 instructions:u # 0.42 insn per cycle (45.78%)
2,546,685,790 branches:u # 192.796 M/sec (46.08%)
3,557,939 branch-misses:u # 0.14% of all branches (46.44%)
3,983,151,533 L1-dcache-loads:u # 301.543 M/sec (46.74%)
350,423,104 L1-dcache-load-misses:u # 8.80% of all L1-dcache hits (46.89%)
147,288,693 LLC-loads:u # 11.150 M/sec (30.98%)
4,700,903 LLC-load-misses:u # 6.38% of all LL-cache hits (30.77%)
<not supported> L1-icache-loads:u
5,876,206 L1-icache-load-misses:u (30.69%)
4,043,536,403 dTLB-loads:u # 306.114 M/sec (30.63%)
5,516,303 dTLB-load-misses:u # 0.14% of all dTLB cache hits (30.45%)
19,166 iTLB-loads:u # 0.001 M/sec (30.22%)
120,540 iTLB-load-misses:u # 628.93% of all iTLB cache hits (30.10%)
<not supported> L1-dcache-prefetches:u
<not supported> L1-dcache-prefetch-misses:u
10.580851824 seconds time elapsed
bpftrace的结果:
zjc@test_dax_hugepage $ sudo bpftrace -e ' tracepoint:fs_dax:dax_pmd_fault_done {printf("dax_pmd_fault_done result: 0x%x\n", args->result);@[args->result]=count()}'
dax_pmd_fault_done result: 0x100
dax_pmd_fault_done result: 0x100
dax_pmd_fault_done result: 0x100
...
dax_pmd_fault_done result: 0x100
dax_pmd_fault_done result: 0x100
^C
@[256]: 51199
4.3. 测试结果2: 以1MB为粒度映射文件
fio 及 perf stat 的report如下:
zjc@test_dax_hugepage $ perf stat -ddd /home/zjc/bin/fio-mod/bin/fio mmap_lat.fio
SeqR: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=mmap, iodepth=1
fio-3.16
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=885MiB/s][r=227k IOPS][eta 00m:00s]
SeqR: (groupid=0, jobs=1): err= 0: pid=136706: Sat Nov 23 18:39:56 2019
read: IOPS=217k, BW=849MiB/s (890MB/s)(8488MiB/10001msec)
clat (usec): min=2, max=1967, avg= 4.15, stdev= 2.82
lat (usec): min=2, max=1967, avg= 4.18, stdev= 2.82
clat percentiles (nsec):
| 1.00th=[ 3440], 5.00th=[ 3600], 10.00th=[ 3696], 20.00th=[ 3792],
| 30.00th=[ 3856], 40.00th=[ 3920], 50.00th=[ 3984], 60.00th=[ 4048],
| 70.00th=[ 4128], 80.00th=[ 4192], 90.00th=[ 4384], 95.00th=[ 4640],
| 99.00th=[ 8896], 99.50th=[ 9792], 99.90th=[16512], 99.95th=[17024],
| 99.99th=[20096]
bw ( KiB/s): min=373528, max=908128, per=99.78%, avg=867118.63, stdev=125170.53, samples=19
iops : min=93382, max=227032, avg=216779.63, stdev=31292.62, samples=19
lat (usec) : 4=54.10%, 10=45.42%, 20=0.47%, 50=0.01%, 100=0.01%
lat (msec) : 2=0.01%
cpu : usr=46.18%, sys=53.79%, ctx=89, majf=0, minf=2176958
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=2172849,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=849MiB/s (890MB/s), 849MiB/s-849MiB/s (890MB/s-890MB/s), io=8488MiB (8900MB), run=10001-10001msec
Disk stats (read/write):
pmem6: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
Performance counter stats for '/home/zjc/bin/fio-mod/bin/fio mmap_lat.fio':
13256.946076 task-clock:u (msec) # 1.243 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
2,282,121 page-faults:u # 0.172 M/sec
20,141,999,012 cycles:u # 1.519 GHz (37.85%)
6,079,376,470 instructions:u # 0.30 insn per cycle (45.56%)
1,126,849,210 branches:u # 85.001 M/sec (45.83%)
6,968,748 branch-misses:u # 0.62% of all branches (46.12%)
1,754,527,971 L1-dcache-loads:u # 132.348 M/sec (46.36%)
248,313,128 L1-dcache-load-misses:u # 14.15% of all L1-dcache hits (46.60%)
71,871,943 LLC-loads:u # 5.421 M/sec (31.13%)
5,340,391 LLC-load-misses:u # 14.86% of all LL-cache hits (30.90%)
<not supported> L1-icache-loads:u
184,362,644 L1-icache-load-misses:u (30.77%)
1,764,918,323 dTLB-loads:u # 133.132 M/sec (30.73%)
8,388,448 dTLB-load-misses:u # 0.48% of all dTLB cache hits (30.67%)
114,007 iTLB-loads:u # 0.009 M/sec (30.55%)
3,475,836 iTLB-load-misses:u # 3048.79% of all iTLB cache hits (30.34%)
<not supported> L1-dcache-prefetches:u
<not supported> L1-dcache-prefetch-misses:u
10.669283491 seconds time elapsed
bpftrace的结果:
zjc@test_dax_hugepage $ sudo bpftrace -e ' tracepoint:fs_dax:dax_pmd_fault_done {printf("dax_pmd_fault_done result: 0x%x\n", args->result);@[args->result]=count()}'
dax_pmd_fault_done result: 0x800
dax_pmd_fault_done result: 0x800
dax_pmd_fault_done result: 0x800
...
dax_pmd_fault_done result: 0x800
dax_pmd_fault_done result: 0x800
^C
@[2048]: 51199
4.4. 测试结果分析
实验环境:
项目 | 参数 |
---|---|
kernel版本 | 4.16.0 |
fio版本 | 3.16 |
文件系统 | ext4-dax |
硬件设备 | AEP (interleaved) |
结果对比:
2MB mmap | 1MB mmap | |
---|---|---|
dax_pmd_fault_done result | VM_FAULT_NOPAGE | VM_FAULT_FALLBACK |
随机读延迟 (us) | 1.67 | 4.18 |
page-faults:u | 157,520 | 2,282,121 |
dTLB-load-misses:u | 5,516,303 | 8,388,448 |
iTLB-load-misses:u | 120,540 | 3,475,836 |
可以发现,2MB大小的mmap调用才可以利用DAX文件系统的hugepage映射特性,在AEP上,用fio mmap引擎测试的随机4K读延迟性能提升3倍,page fault、TLB miss都显著减少了。
[1] https://nvdimm.wiki.kernel.org/2mib_fs_dax
[2] https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#6-tracepoint-static-tracing-kernel-level-arguments