Community
cancel
Showing results for 
Search instead for 
Did you mean: 
ANaza
Novice
2,778 Views

Critical performance drop on newly created large file

  • NVMe drive model: Intel SSD DC P3700 U.2 NVMe SSD
  • Capacity: 764G
  • FS: XFS
  • Other HW:
    • AIC SB122A-PH
    • 8 Intel NVMe DC P3700 2 on CPU 0, 6 on CPU 1
    • 128 GiB RAM (8 x 16 DDR4 2400Mhz DIMMs)
    • 2 x Intel E5-2620v3 2.4Ghz CPUs
    • 2 x Intel DC S2510 SATA SSDs (one is used a system drive).
    • Note that both are engineering samples provided by Intel NSG. But all have had the latest firmware updated using isdct 3.0.0.
  • OS: CentOS Linux release 7.2.1511 (Core)
  • Kernel: Linux fs00 3.10.0-327.22.2.el7.x86_64 # 1 SMP Thu Jun 23 17:05:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

We have been testing two Intel DC P3700 U.2 800GB NVMe SSDs to see the impact of the emulated sector size to throughput (512 vs 4096). Using fio 2.12, we observed a puzzling collapse of performance. The steps are given below.

Steps:

1. Copy or write sequentially single large file (300G or larger)

2. Start fio test with the following config:

[readtest]

thread=1

blocksize=2m

filename=/export/beegfs/data0/file_000000

rw=randread

direct=1

buffered=0

ioengine=libaio

nrfiles=1

gtod_reduce=0

numjobs=32

iodepth=128

runtime=360

group_reporting=1

percentage_random=90

3. Observe extremely slow performance:

fio-2.12

Starting 32 threads

readtest: (groupid=0, jobs=32): err= 0: pid=5097: Thu Jul 14 13:00:25 2016

read : io=65536KB, bw=137028B/s, iops=0, runt=489743msec

slat (usec): min=4079, max=7668, avg=5279.19, stdev=662.80

clat (msec): min=3, max=25, avg=18.97, stdev= 6.16

lat (msec): min=8, max=31, avg=24.25, stdev= 6.24

clat percentiles (usec):

| 1.00th=[ 3280], 5.00th=[ 4320], 10.00th=[ 9664], 20.00th=[17536],

| 30.00th=[18816], 40.00th=[20352], 50.00th=[20608], 60.00th=[21632],

| 70.00th=[21632], 80.00th=[22912], 90.00th=[25472], 95.00th=[25472],

| 99.00th=[25472], 99.50th=[25472], 99.90th=[25472], 99.95th=[25472],

| 99.99th=[25472]

lat (msec) : 4=3.12%, 10=9.38%, 20=25.00%, 50=62.50%

cpu : usr=0.00%, sys=74.84%, ctx=792583, majf=0, minf=16427

IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%

submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

issued : total=r=32/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0

latency : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):

READ: io=65536KB, aggrb=133KB/s, minb=133KB/s, maxb=133KB/s, mint=489743msec, maxt=489743msec

Disk stats (read/write):

nvme0n1: ios=0/64317, merge=0/0, ticks=0/1777871, in_queue=925406, util=0.19%

4. Repeat the test

5. Performance is much higher:

fio-2.12

Starting 32 threads

readtest: (groupid=0, jobs=32): err= 0: pid=5224: Thu Jul 14 13:11:58 2016

read : io=861484MB, bw=2389.3MB/s, iops=1194, runt=360564msec

slat (usec): min=111, max=203593, avg=26742.15, stdev=21321.98

clat (msec): min=414, max=5176, avg=3391.05, stdev=522.29

lat (msec): min=414, max=5247, avg=3417.79, stdev=524.75

clat percentiles (msec):

| 1.00th=[ 1614], 5.00th=[ 2376], 10.00th=[ 2802], 20.00th=[ 3097],

| 30.00th=[ 3228], 40.00th=[ 3359], 50.00th=[ 3458], 60.00th=[ 3556],

| 70.00th=[ 3654], 80.00th=[ 3785], 90.00th=[ 3949], 95.00th=[ 4080],

| 99.00th=[ 4359], 99.50th=[ 4424], 99.90th=[ 4752], 99.95th=[ 4883],

<e...

0 Kudos
22 Replies
idata
Community Manager
151 Views

Since we deal with similar situation, I tried the above steps and confirmed on our machine this issue. In fact, I also tried it with both XFS and EXT4. The symptom showed up regardless.

idata
Community Manager
151 Views

AlexNZ,

 

 

Thanks for bringing this situation to our attention, we would like to verify this and provide a solution as fast as possible. Please allow us some time to check on this and we will keep you all posted.

 

 

NC
idata
Community Manager
151 Views

Hello,

 

 

After reviewing the settings, we would like to verify the following:

 

 

For the read test, could you please try: fio –output=test_result.txt –name=myjob –filename=/dev/nvme0n1 –ioengine=libaio –direct=1 –norandommap –randrepeat=0 –runtime=600 –blocksize=4K –rw=randread –iodepth=32 –numjobs=4 –group_reporting.

 

 

It is important to notice that we normally run the tests with 4 threads and iodepth=32, for the blocksize=4K.

 

 

Please let us know as we may need to keep researching about this.

 

 

NC
ANaza
Novice
151 Views

Hello,

With proposed settings I received the following result:

myjob: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32

...

fio-2.12

Starting 4 processes

myjob: (groupid=0, jobs=4): err= 0: pid=23560: Wed Jul 20 07:06:08 2016

read : io=1092.2GB, bw=1863.1MB/s, iops=477156, runt=600001msec

slat (usec): min=1, max=63, avg= 2.76, stdev= 1.57

clat (usec): min=14, max=3423, avg=260.81, stdev=90.86

lat (usec): min=18, max=3426, avg=263.68, stdev=90.84

clat percentiles (usec):

| 1.00th=[ 114], 5.00th=[ 139], 10.00th=[ 157], 20.00th=[ 185],

| 30.00th=[ 207], 40.00th=[ 229], 50.00th=[ 251], 60.00th=[ 274],

| 70.00th=[ 298], 80.00th=[ 326], 90.00th=[ 374], 95.00th=[ 422],

| 99.00th=[ 532], 99.50th=[ 588], 99.90th=[ 716], 99.95th=[ 788],

| 99.99th=[ 1048]

bw (KB /s): min= 5400, max=494216, per=25.36%, avg=484036.11, stdev=14017.77

lat (usec) : 20=0.01%, 50=0.01%, 100=0.23%, 250=49.61%, 500=48.54%

lat (usec) : 750=1.55%, 1000=0.06%

lat (msec) : 2=0.01%, 4=0.01%

cpu : usr=15.00%, sys=41.78%, ctx=77056567, majf=0, minf=264

IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%

submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%

issued : total=r=286294132/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0

latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):

READ: io=1092.2GB, aggrb=1863.1MB/s, minb=1863.1MB/s, maxb=1863.1MB/s, mint=600001msec, maxt=600001msec

Disk stats (read/write):

nvme0n1: ios=286276788/29109, merge=0/0, ticks=72929877/10859607, in_queue=84848144, util=99.33%

But in this case it was the test for raw device (/dev/nvme0n1), whereas in our case it was file on XFS on NVMe drive.

Also during latest tests we determined that flushing page cache (echo 1 > /proc/sys/vm/drop_caches) solves the problem.

Why does page cache affect direct IO - is still the question.

Can it be something specific to NVMe drivers?

AlexNZ

idata
Community Manager
151 Views

I read this thread with strong interest. I concur with AlexNZ, testing files residing on a file system is far more useful in production situations. We do so to figure out the overhead of

  • local file system (XFS, EXT4 etc)
  • Distributed file system(s) (Lustre, GPFS etc)

over raw devices (individual and aggregated).

The following suggestion from NC is only for testing raw devices.

fio –output=test_result.txt –name=myjob –filename=/dev/nvme0n1 –ioengine=libaio –direct=1 –norandommap –randrepeat=0 –runtime=600 –blocksize=4K –rw=randread –iodepth=32 –numjobs=4 –group_reporting.

On our end, we have done many hundreds of raw device tests. Results are always in line with what Intel has published. But this particular file testing result, as I posted on July 15, is a "shocker"!

It would be great to know why fio reading a regular file from a NVMe SSD with direct=1 is still affected by data in the page cache.

Another point: we understand why usually for Intel NVMe SSDs, numjobs=4 and iodepth=32 are used. But such settings are only optimal for raw devices, right? When it comes to reading/writing regular files, IMHO we should configure fio using parameter values that match as closely as possible to that of the actual workloads. NC, your view?

idata
Community Manager
151 Views

Hello all,

 

 

According to this situation and checking all the information provided, we will be escalating the situation here and we will be updating here. Please expect a response anytime soon.

 

 

NC
idata
Community Manager
151 Views

Hello all,

 

 

We would like you to try the test but before that could you please try to TRIM the drives first?, once you do that please share the results back to us.

 

 

Also, please make sure you are using the correct driver in this https://downloadcenter.intel.com/download/23929/Intel-SSD-Data-Center-Family-for-NVMe-Drivers link.

 

Something important to mention is that the performance tools we use are Synthetic Benchmarking tools, as explained in the Intel® Solid-State Drive DC P3700 evaluation guide, and these are intended to measure the behavior of the SSD without taking into consideration other components in the system that would add "bottlenecks". Synthetic benchmarks measure raw drive I/O transfer rates.

 

 

http://manuals.ts.fujitsu.com/file/12176/fujitsu_intel-ssd-dc-pcie-eg-en.pdf Here is the evaluation guide.

 

 

Please let us know.

 

 

NC
idata
Community Manager
151 Views

Thanks for your follow-up. I did try fstrim on a DC P3700 NVMe SSD here.

First of all, lets get the driver and firmware issue out of the way. The server runs CentOS 7.2:

[root@fs11 ~]# uname -a

Linux fs11 3.10.0-327.22.2.el7.x86_64 # 1 SMP Thu Jun 23 17:05:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

[root@fs11 ~]# cat /etc/redhat-release

CentOS Linux release 7.2.1511 (Core)

We also use the latest isdct:

[root@fs11 ~]# isdct version

- Version Information -

Name: Intel(R) Data Center Tool

Version: 3.0.0

Description: Interact and configure Intel SSDs.

And, according to the tool, the drive is healthy:

[root@fs11 ~]# isdct show -intelssd 2

- Intel SSD DC P3700 Series CVFT515400401P6JGN -

Bootloader : 8B1B0131

DevicePath : /dev/nvme2n1

DeviceStatus : Healthy

Firmware : 8DV10171

FirmwareUpdateAvailable : The selected Intel SSD contains current firmware as of this tool release.

Index : 2

ModelNumber : INTEL SSDPE2MD016T4

ProductFamily : Intel SSD DC P3700 Series

SerialNumber : CVFT515400401P6JGN

While the drive had a file system (XFS), with data, I ran fstrim:

[root@fs11 ~]# fstrim -v /export/beegfs/data2

fstrim: /export/beegfs/data2: FITRIM ioctl failed: Input/output error

So, I umount the XFS, use isdct delete to remove all data, recreated the XFS, mount it again, and then ran fstrim:

Same outcome. Please see the session log below:

[root@fs11 ~]# df -h

Filesystem Size Used Avail Use% Mounted on

/dev/sda3 192G 2.4G 190G 2% /

devtmpfs 63G 0 63G 0% /dev

tmpfs 63G 0 63G 0% /dev/shm

tmpfs 63G 26M 63G 1% /run

tmpfs 63G 0 63G 0% /sys/fs/cgroup

/dev/sda1 506M 166M 340M 33% /boot

/dev/sdb 168G 73M 157G 1% /export/beegfs/meta

tmpfs 13G 0 13G 0% /run/user/99

/dev/nvme2n1 1.5T 241G 1.3T 17% /export/beegfs/data2

tmpfs 13G 0 13G 0% /run/user/0

[root@fs11 ~]# umount /export/beegfs/data2

[root@fs11 ~]# df -h

Filesystem Size Used Avail Use% Mounted on

/dev/sda3 192G 2.4G 190G 2% /

devtmpfs 63G 0 63G 0% /dev

tmpfs 63G 0 63G 0% /dev/shm

tmpfs 63G 26M 63G 1% /run

tmpfs 63G 0 63G 0% /sys/fs/cgroup

/dev/sda1 506M 166M 340M 33% /boot

/dev/sdb ...

ANaza
Novice
151 Views

Hello,

I can confirm that after TRIM result is still poor.

Actually, after quick looking at linux kernel code, including XFS implementation, I found that even during direct reading, page cache is still involved.

But such poor performance still looks weired.

idata
Community Manager
151 Views

Just a quick supplement regarding the I/O errors that I reported in my last reply: I even tried to do the following:

  1. umount the drive
  2. Do a nvmeformat: isdct start -intelssd 2 -nvmeformat LBAformat=3 SecureEraseSetting=0 ProtectionInformation=0 MetaDataSettings=0
  3. Recreate XFS
  4. mount the XFS
  5. ran fstrim -v to the mount point.

I still got

[root@fs11 ~]# dmesg |tail -11

[987891.677911] nvme2n1: unknown partition table

[987898.749260] XFS (nvme2n1): Mounting V4 Filesystem

[987898.752844] XFS (nvme2n1): Ending clean mount

[987948.612051] blk_update_request: I/O error, dev nvme2n1, sector 3070890712

[987948.612088] blk_update_request: I/O error, dev nvme2n1, sector 3087667912

[987948.612151] blk_update_request: I/O error, dev nvme2n1, sector 3121222312

[987948.612193] blk_update_request: I/O error, dev nvme2n1, sector 3062502112

[987948.612211] blk_update_request: I/O error, dev nvme2n1, sector 3104445112

[987948.612228] blk_update_request: I/O error, dev nvme2n1, sector 3079279312

[987948.612296] blk_update_request: I/O error, dev nvme2n1, sector 3096056512

[987948.612314] blk_update_request: I/O error, dev nvme2n1, sector 3112833712

So, unlike SCSI drives that I used years ago, format didn't remap "bad" sectors. Would appreciate a hint as to how to get this issue resolved too.

idata
Community Manager
151 Views

I tried to narrow down the cause of the issue with fstrim more. It seems to me the hardware (i.e. the NVMe SSD itself) is responsible, rather than the software layer on top of it (XFS). So I decided to add a partition table first and create XFS on the partition. As is evident below, adding the partition didn't help.

Is the drive faulty? If yes, then why isdct still deems its DeviceStatus Healthy?

[root@fs11 ~]# isdct delete -f -intelssd 2

Deleting...

- Intel SSD DC P3700 Series CVFT515400401P6JGN -

Status : Delete successful.

[root@fs11 ~]# parteed -a optimal /dev/nvme2n1 mklabel gpt

-bash: parteed: command not found

[root@fs11 ~]# parted -a optimal /dev/nvme2n1 mklabel gpt

Information: You may need to update /etc/fstab.

[root@fs11 ~]# parted /dev/nvme2n1 mkpart primary 1048576B 100%

Information: You may need to update /etc/fstab.

[root@fs11 ~]# parted /dev/nvme2n1

GNU Parted 3.1

Using /dev/nvme2n1

Welcome to GNU Parted! Type 'help' to view a list of commands.

(parted) print

Model: Unknown (unknown)

Disk /dev/nvme2n1: 1600GB

Sector size (logical/physical): 4096B/4096B

Partition Table: gpt

Disk Flags:

Number Start End Size File system Name Flags

1 1049kB 1600GB 1600GB primary

(parted) quit

[root@fs11 ~]# mkfs.xfs -K -f -d agcount=24 -l size=128m,version=2 -i size=512 -s size=4096 /dev/nvme2n1

meta-data=/dev/nvme2n1 isize=512 agcount=24, agsize=16279311 blks

= sectsz=4096 attr=2, projid32bit=1

= crc=0 finobt=0

data = bsize=4096 blocks=390703446, imaxpct=5

= sunit=0 swidth=0 blks

naming =version 2 bsize=4096 ascii-ci=0 ftype=0

<p style="color: # 222222; font-family: arial, sans-serif; fon...
idata
Community Manager
151 Views

Hello,

 

 

Thanks everyone for trying the suggestion. We would like to gather all these inputs and research here with our department in order to work in a resolution for all of you.

 

 

Please allow us some time to do the research, we will keep you posted.

 

 

NC
idata
Community Manager
151 Views

Thanks for following-up. I reviewed what I had done regarding the fstrim, and the tests that I have done, and came up two additional plausible causes:

  1. In the way I do mkfs.xfs, I always use the -K option, what if I don't use it?
  2. I would like to take advantage of the variable sector support provided by DC P3700. So, we are evaluating the performance benefits of using large SectorSize these days. Thus, the NVMe SSDs that I tested fstrim on has a 4096 sector size. What happens if I retain the default 512?

My tests indicates that the Intel DC P3700 firmware or the NVMe Linux driver or both may have a bug. The following are my evidences. Please review.

We use a lot of Intel DC P3700 SSDs of various capacities - 800GB to 1.6TB are two common ones - and have done hundreds of tests over them.

We also understand that with Intel NVMe DC P3700 SSDs, there is no need to run trim at all. The firmware's garbage collection takes care of such needs transparently and behind the scene. But still, IMHO it's a good idea when sector size is changed, well-known Linux utilities still work as anticipated. We ran into this issue by serendipity, and got a "nice" surprise along the way

Case 1. mkfs.xfs without -K

We will pick one /dev/nvme2n1, umount it, isdct delete all data on it, mkfs.xfs without the -K flag, and then run fstrim.

[root@fs11 ~]# man mkfs.xfs

[root@fs11 ~]# lsblk

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT

sda 8:0 0 223.6G 0 disk

├─sda1 8:1 0 512M 0 part /boot

├─sda2 8:2 0 31.5G 0 part [SWAP]

└─sda3 8:3 0 191.6G 0 part /

sdb 8:16 0 223.6G 0 disk /export/beegfs/meta

sdc 8:32 0 59.6G 0 disk

sdd 8:48 0 59.6G 0 disk

sr0 11:0 1 1024M 0 rom

nvme0n1 259:2 0 1.5T 0 disk /export/beegfs/data0

nvme1n1 259:6 0 1.5T 0 disk /export/beegfs/data1

nvme2n1 259:7 0 1.5T 0 disk /export/beegfs/data2

nvme3n1 259:5 0 1.5T 0 disk /export/beegfs/data3

nvme4n1 259:0 0 1.5T 0 disk /export/beegfs/data4

nvme5n1 259:3 0 1.5T 0 disk

nvme6n1 259:1 0 1.5T 0 disk

nvme7n1 259:4 0 1.5T 0 disk

[root@fs11 ~]# umount /export/beegfs/data2

[root@fs11 ~]# lsblk

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT

sda 8:0 0 223.6G 0 disk

├─sda1 8:1 0 512M 0 part /boot

├─sda2 8:2 0 31.5G 0 part [SWAP]

└─sda3 8:3 0 191.6G 0 part /

sdb 8:16 0 223.6G 0 disk /export/beegfs/meta

sdc 8:32 0 59.6G 0 disk

sdd 8:48 0 59.6G 0 disk

sr0 11:0 1 1024M 0 rom

nvme0n1 259:2 0 1.5T 0 disk /export/beegfs/data0

nvme1n1 259:6 0 1.5T 0 disk /export/beegfs/data1

nvme2n1 259:7 0 1.5T 0 disk

nvme3n1 259:5 0 1.5T 0 disk /export/beegfs/data3

nvme4n1 259:0 0 1.5T 0 disk /export/beegfs/data4

<p style="padding-left: 3...
idata
Community Manager
151 Views

Hello everyone,

 

 

We would like to address the performance drop questions first so we don't mix the situations.

 

Can you please confirm if this was the process you followed:

 

 

-Create large file

 

-Flush page cache

 

-Run FIO

 

 

Now, at which step are you flushing the page cache to avoid performance drop?

 

 

Please let us know.

 

NC
ANaza
Novice
151 Views

Hello,

First we skipped flushing of page cache and performed fio testing right after large file creation. And with such approach results of direct reading tests were very poor.

But as I mentioned above, we found that flushing page cache after file creation improves situation. Which is confusing because O_DIRECT mode has to skip any operations with page cache.

Later I reviewed Linux kernel code and found that it performs some operations with page cache even in direct mode. So now I suspect that this issue is rather related to Linux kernel.

idata
Community Manager
151 Views

AlexNZ,

 

 

Thanks for the information provided, we will continue with our testing here and we will let you know soon.

 

 

NC
idata
Community Manager
151 Views

Hello Everyone,

 

 

Our engineering team is running some investigation on this report and we will share any results once we get them.

 

 

Thanks.

 

 

NC
idata
Community Manager
151 Views

Hi AlexNZ,

 

 

Chances are that your findings with the Kernel are the reason for this drop. We understand that Linux users can submit Kernel questions, findings and bugs here: https://bugzilla.kernel.org/. Here are some instructions that we found: https://www.kernel.org/pub/linux/docs/lkml/reporting-bugs.html

 

 

It is very important to bear in mind that the benchmarking we provide using FIO (or even IOMeter for Windows), as per the evaluation guide, are not done the same way you've reported to be doing it, since the evaluation guide (shared on previous post) states that those are synthetic tools, which are used for raw disk, and you all seem to be getting the actual numbers we've shared on the SSD's spec's when measuring raw disk… the drop you see is once the file system is created and, as you may know, even different type of file systems may cause different SSD's performance numbers, some interesting articles on this (that you may even be aware of, but still worth to share):

 

 

http://www.linux-magazine.com/Issues/2015/172/Tuning-Your-SSD

 

 

https://wiki.archlinux.org/index.php/Solid_State_Drives

 

 

http://www.phoronix.com/scan.php?page=article&item=linux-43-ssd&num=1

 

 

Please let us know if you have any questions.

 

 

NC
ANaza
Novice
151 Views

Hello NC,

Thanks for your reply.

I'll consider asking Kernel community about it. But since I know how to avoid this effect and know that kernel actually manipulates page cache in direct mode, I think it's not longer so important.

Alex

idata
Community Manager
36 Views

Hi AlexNZ,

 

 

Please share any findings from the kernel communities, we will be waiting for your response.

 

 

NC
Reply