Solidigm

ANaza · ‎07-14-2016

NVMe drive model: Intel SSD DC P3700 U.2 NVMe SSD
Capacity: 764G
FS: XFS
Other HW:
- AIC SB122A-PH
- 8 Intel NVMe DC P3700 2 on CPU 0, 6 on CPU 1
- 128 GiB RAM (8 x 16 DDR4 2400Mhz DIMMs)
- 2 x Intel E5-2620v3 2.4Ghz CPUs
- 2 x Intel DC S2510 SATA SSDs (one is used a system drive).
- Note that both are engineering samples provided by Intel NSG. But all have had the latest firmware updated using isdct 3.0.0.
OS: CentOS Linux release 7.2.1511 (Core)
Kernel: Linux fs00 3.10.0-327.22.2.el7.x86_64 # 1 SMP Thu Jun 23 17:05:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

We have been testing two Intel DC P3700 U.2 800GB NVMe SSDs to see the impact of the emulated sector size to throughput (512 vs 4096). Using fio 2.12, we observed a puzzling collapse of performance. The steps are given below.

Steps:

1. Copy or write sequentially single large file (300G or larger)

2. Start fio test with the following config:

[readtest]

thread=1

blocksize=2m

filename=/export/beegfs/data0/file_000000

rw=randread

direct=1

buffered=0

ioengine=libaio

nrfiles=1

gtod_reduce=0

numjobs=32

iodepth=128

runtime=360

group_reporting=1

percentage_random=90

3. Observe extremely slow performance:

fio-2.12

Starting 32 threads

readtest: (groupid=0, jobs=32): err= 0: pid=5097: Thu Jul 14 13:00:25 2016

read : io=65536KB, bw=137028B/s, iops=0, runt=489743msec

slat (usec): min=4079, max=7668, avg=5279.19, stdev=662.80

clat (msec): min=3, max=25, avg=18.97, stdev= 6.16

lat (msec): min=8, max=31, avg=24.25, stdev= 6.24

clat percentiles (usec):

| 1.00th=[ 3280], 5.00th=[ 4320], 10.00th=[ 9664], 20.00th=[17536],

| 30.00th=[18816], 40.00th=[20352], 50.00th=[20608], 60.00th=[21632],

| 70.00th=[21632], 80.00th=[22912], 90.00th=[25472], 95.00th=[25472],

| 99.00th=[25472], 99.50th=[25472], 99.90th=[25472], 99.95th=[25472],

| 99.99th=[25472]

lat (msec) : 4=3.12%, 10=9.38%, 20=25.00%, 50=62.50%

cpu : usr=0.00%, sys=74.84%, ctx=792583, majf=0, minf=16427

IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%

submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

issued : total=r=32/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0

latency : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):

READ: io=65536KB, aggrb=133KB/s, minb=133KB/s, maxb=133KB/s, mint=489743msec, maxt=489743msec

Disk stats (read/write):

nvme0n1: ios=0/64317, merge=0/0, ticks=0/1777871, in_queue=925406, util=0.19%

4. Repeat the test

5. Performance is much higher:

fio-2.12

Starting 32 threads

readtest: (groupid=0, jobs=32): err= 0: pid=5224: Thu Jul 14 13:11:58 2016

read : io=861484MB, bw=2389.3MB/s, iops=1194, runt=360564msec

slat (usec): min=111, max=203593, avg=26742.15, stdev=21321.98

clat (msec): min=414, max=5176, avg=3391.05, stdev=522.29

lat (msec): min=414, max=5247, avg=3417.79, stdev=524.75

clat percentiles (msec):

| 1.00th=[ 1614], 5.00th=[ 2376], 10.00th=[ 2802], 20.00th=[ 3097],

| 30.00th=[ 3228], 40.00th=[ 3359], 50.00th=[ 3458], 60.00th=[ 3556],

| 70.00th=[ 3654], 80.00th=[ 3785], 90.00th=[ 3949], 95.00th=[ 4080],

| 99.00th=[ 4359], 99.50th=[ 4424], 99.90th=[ 4752], 99.95th=[ 4883],

<e...

idata · ‎07-15-2016

Since we deal with similar situation, I tried the above steps and confirmed on our machine this issue. In fact, I also tried it with both XFS and EXT4. The symptom showed up regardless.

idata · ‎07-15-2016

AlexNZ,

Thanks for bringing this situation to our attention, we would like to verify this and provide a solution as fast as possible. Please allow us some time to check on this and we will keep you all posted.NC

idata · ‎07-20-2016

Hello,

After reviewing the settings, we would like to verify the following:For the read test, could you please try: fio –output=test_result.txt –name=myjob –filename=/dev/nvme0n1 –ioengine=libaio –direct=1 –norandommap –randrepeat=0 –runtime=600 –blocksize=4K –rw=randread –iodepth=32 –numjobs=4 –group_reporting.It is important to notice that we normally run the tests with 4 threads and iodepth=32, for the blocksize=4K.Please let us know as we may need to keep researching about this.NC

ANaza · ‎07-20-2016

Hello,

With proposed settings I received the following result:

myjob: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32

...

fio-2.12

Starting 4 processes

myjob: (groupid=0, jobs=4): err= 0: pid=23560: Wed Jul 20 07:06:08 2016

read : io=1092.2GB, bw=1863.1MB/s, iops=477156, runt=600001msec

slat (usec): min=1, max=63, avg= 2.76, stdev= 1.57

clat (usec): min=14, max=3423, avg=260.81, stdev=90.86

lat (usec): min=18, max=3426, avg=263.68, stdev=90.84

clat percentiles (usec):

| 1.00th=[ 114], 5.00th=[ 139], 10.00th=[ 157], 20.00th=[ 185],

| 30.00th=[ 207], 40.00th=[ 229], 50.00th=[ 251], 60.00th=[ 274],

| 70.00th=[ 298], 80.00th=[ 326], 90.00th=[ 374], 95.00th=[ 422],

| 99.00th=[ 532], 99.50th=[ 588], 99.90th=[ 716], 99.95th=[ 788],

| 99.99th=[ 1048]

bw (KB /s): min= 5400, max=494216, per=25.36%, avg=484036.11, stdev=14017.77

lat (usec) : 20=0.01%, 50=0.01%, 100=0.23%, 250=49.61%, 500=48.54%

lat (usec) : 750=1.55%, 1000=0.06%

lat (msec) : 2=0.01%, 4=0.01%

cpu : usr=15.00%, sys=41.78%, ctx=77056567, majf=0, minf=264

IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%

submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%

issued : total=r=286294132/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0

latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):

READ: io=1092.2GB, aggrb=1863.1MB/s, minb=1863.1MB/s, maxb=1863.1MB/s, mint=600001msec, maxt=600001msec

Disk stats (read/write):

nvme0n1: ios=286276788/29109, merge=0/0, ticks=72929877/10859607, in_queue=84848144, util=99.33%

But in this case it was the test for raw device (/dev/nvme0n1), whereas in our case it was file on XFS on NVMe drive.

Also during latest tests we determined that flushing page cache (echo 1 > /proc/sys/vm/drop_caches) solves the problem.

Why does page cache affect direct IO - is still the question.

Can it be something specific to NVMe drivers?

AlexNZ

Solidigm

Critical performance drop on newly created large file