cancel
Showing results for 
Search instead for 
Did you mean: 

Critical performance drop on newly created large file

ANaza
New Contributor II
  • NVMe drive model: Intel SSD DC P3700 U.2 NVMe SSD
  • Capacity: 764G
  • FS: XFS
  • Other HW:
    • AIC SB122A-PH
    • 8 Intel NVMe DC P3700 2 on CPU 0, 6 on CPU 1
    • 128 GiB RAM (8 x 16 DDR4 2400Mhz DIMMs)
    • 2 x Intel E5-2620v3 2.4Ghz CPUs
    • 2 x Intel DC S2510 SATA SSDs (one is used a system drive).
    • Note that both are engineering samples provided by Intel NSG. But all have had the latest firmware updated using isdct 3.0.0.
  • OS: CentOS Linux release 7.2.1511 (Core)
  • Kernel: Linux fs00 3.10.0-327.22.2.el7.x86_64 # 1 SMP Thu Jun 23 17:05:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

We have been testing two Intel DC P3700 U.2 800GB NVMe SSDs to see the impact of the emulated sector size to throughput (512 vs 4096). Using fio 2.12, we observed a puzzling collapse of performance. The steps are given below.

Steps:

1. Copy or write sequentially single large file (300G or larger)

2. Start fio test with the following config:

[readtest]

thread=1

blocksize=2m

filename=/export/beegfs/data0/file_000000

rw=randread

direct=1

buffered=0

ioengine=libaio

nrfiles=1

gtod_reduce=0

numjobs=32

iodepth=128

runtime=360

group_reporting=1

percentage_random=90

3. Observe extremely slow performance:

fio-2.12

Starting 32 threads

readtest: (groupid=0, jobs=32): err= 0: pid=5097: Thu Jul 14 13:00:25 2016

read : io=65536KB, bw=137028B/s, iops=0, runt=489743msec

slat (usec): min=4079, max=7668, avg=5279.19, stdev=662.80

clat (msec): min=3, max=25, avg=18.97, stdev= 6.16

lat (msec): min=8, max=31, avg=24.25, stdev= 6.24

clat percentiles (usec):

| 1.00th=[ 3280], 5.00th=[ 4320], 10.00th=[ 9664], 20.00th=[17536],

| 30.00th=[18816], 40.00th=[20352], 50.00th=[20608], 60.00th=[21632],

| 70.00th=[21632], 80.00th=[22912], 90.00th=[25472], 95.00th=[25472],

| 99.00th=[25472], 99.50th=[25472], 99.90th=[25472], 99.95th=[25472],

| 99.99th=[25472]

lat (msec) : 4=3.12%, 10=9.38%, 20=25.00%, 50=62.50%

cpu : usr=0.00%, sys=74.84%, ctx=792583, majf=0, minf=16427

IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%

submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

issued : total=r=32/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0

latency : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):

READ: io=65536KB, aggrb=133KB/s, minb=133KB/s, maxb=133KB/s, mint=489743msec, maxt=489743msec

Disk stats (read/write):

nvme0n1: ios=0/64317, merge=0/0, ticks=0/1777871, in_queue=925406, util=0.19%

4. Repeat the test

5. Performance is much higher:

fio-2.12

Starting 32 threads

readtest: (groupid=0, jobs=32): err= 0: pid=5224: Thu Jul 14 13:11:58 2016

read : io=861484MB, bw=2389.3MB/s, iops=1194, runt=360564msec

slat (usec): min=111, max=203593, avg=26742.15, stdev=21321.98

clat (msec): min=414, max=5176, avg=3391.05, stdev=522.29

lat (msec): min=414, max=5247, avg=3417.79, stdev=524.75

clat percentiles (msec):

| 1.00th=[ 1614], 5.00th=[ 2376], 10.00th=[ 2802], 20.00th=[ 3097],

| 30.00th=[ 3228], 40.00th=[ 3359], 50.00th=[ 3458], 60.00th=[ 3556],

| 70.00th=[ 3654], 80.00th=[ 3785], 90.00th=[ 3949], 95.00th=[ 4080],

| 99.00th=[ 4359], 99.50th=[ 4424], 99.90th=[ 4752], 99.95th=[ 4883],

<e...

22 REPLIES 22

idata
Esteemed Contributor III

I read this thread with strong interest. I concur with AlexNZ, testing files residing on a file system is far more useful in production situations. We do so to figure out the overhead of

  • local file system (XFS, EXT4 etc)
  • Distributed file system(s) (Lustre, GPFS etc)

over raw devices (individual and aggregated).

The following suggestion from NC is only for testing raw devices.

fio –output=test_result.txt –name=myjob –filename=/dev/nvme0n1 –ioengine=libaio –direct=1 –norandommap –randrepeat=0 –runtime=600 –blocksize=4K –rw=randread –iodepth=32 –numjobs=4 –group_reporting.

On our end, we have done many hundreds of raw device tests. Results are always in line with what Intel has published. But this particular file testing result, as I posted on July 15, is a "shocker"!

It would be great to know why fio reading a regular file from a NVMe SSD with direct=1 is still affected by data in the page cache.

Another point: we understand why usually for Intel NVMe SSDs, numjobs=4 and iodepth=32 are used. But such settings are only optimal for raw devices, right? When it comes to reading/writing regular files, IMHO we should configure fio using parameter values that match as closely as possible to that of the actual workloads. NC, your view?

idata
Esteemed Contributor III

Hello all,

According to this situation and checking all the information provided, we will be escalating the situation here and we will be updating here. Please expect a response anytime soon.NC

idata
Esteemed Contributor III

Hello all,

We would like you to try the test but before that could you please try to TRIM the drives first?, once you do that please share the results back to us.Also, please make sure you are using the correct driver in this https://downloadcenter.intel.com/download/23929/Intel-SSD-Data-Center-Family-for-NVMe-Drivers link.Something important to mention is that the performance tools we use are Synthetic Benchmarking tools, as explained in the Intel® Solid-State Drive DC P3700 evaluation guide, and these are intended to measure the behavior of the SSD without taking into consideration other components in the system that would add "bottlenecks". Synthetic benchmarks measure raw drive I/O transfer rates.http://manuals.ts.fujitsu.com/file/12176/fujitsu_intel-ssd-dc-pcie-eg-en.pdf Here is the evaluation guide.Please let us know.NC

idata
Esteemed Contributor III

Thanks for your follow-up. I did try fstrim on a DC P3700 NVMe SSD here.

First of all, lets get the driver and firmware issue out of the way. The server runs CentOS 7.2:

[root@fs11 ~]# uname -a

Linux fs11 3.10.0-327.22.2.el7.x86_64 # 1 SMP Thu Jun 23 17:05:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

[root@fs11 ~]# cat /etc/redhat-release

CentOS Linux release 7.2.1511 (Core)

We also use the latest isdct:

[root@fs11 ~]# isdct version

- Version Information -

Name: Intel(R) Data Center Tool

Version: 3.0.0

Description: Interact and configure Intel SSDs.

And, according to the tool, the drive is healthy:

[root@fs11 ~]# isdct show -intelssd 2

- Intel SSD DC P3700 Series CVFT515400401P6JGN -

Bootloader : 8B1B0131

DevicePath : /dev/nvme2n1

DeviceStatus : Healthy

Firmware : 8DV10171

FirmwareUpdateAvailable : The selected Intel SSD contains current firmware as of this tool release.

Index : 2

ModelNumber : INTEL SSDPE2MD016T4

ProductFamily : Intel SSD DC P3700 Series

SerialNumber : CVFT515400401P6JGN

While the drive had a file system (XFS), with data, I ran fstrim:

[root@fs11 ~]# fstrim -v /export/beegfs/data2

fstrim: /export/beegfs/data2: FITRIM ioctl failed: Input/output error

So, I umount the XFS, use isdct delete to remove all data, recreated the XFS, mount it again, and then ran fstrim:

Same outcome. Please see the session log below:

[root@fs11 ~]# df -h

Filesystem Size Used Avail Use% Mounted on

/dev/sda3 192G 2.4G 190G 2% /

devtmpfs 63G 0 63G 0% /dev

tmpfs 63G 0 63G 0% /dev/shm

tmpfs 63G 26M 63G 1% /run

tmpfs 63G 0 63G 0% /sys/fs/cgroup

/dev/sda1 506M 166M 340M 33% /boot

/dev/sdb 168G 73M 157G 1% /export/beegfs/meta

tmpfs 13G 0 13G 0% /run/user/99

/dev/nvme2n1 1.5T 241G 1.3T 17% /export/beegfs/data2

tmpfs 13G 0 13G 0% /run/user/0

[root@fs11 ~]# umount /export/beegfs/data2

[root@fs11 ~]# df -h

Filesystem Size Used Avail Use% Mounted on

/dev/sda3 192G 2.4G 190G 2% /

devtmpfs 63G 0 63G 0% /dev

tmpfs 63G 0 63G 0% /dev/shm

tmpfs 63G 26M 63G 1% /run

tmpfs 63G 0 63G 0% /sys/fs/cgroup

/dev/sda1 506M 166M 340M 33% /boot

/dev/sdb ...

ANaza
New Contributor II

Hello,

I can confirm that after TRIM result is still poor.

Actually, after quick looking at linux kernel code, including XFS implementation, I found that even during direct reading, page cache is still involved.

But such poor performance still looks weired.