Re: SLES 10 SP3/SP4 Modular Server Dual Storage and MPIO (Multipath) Setup

idata · ‎03-27-2012

Overview

I've recently had a long discussion with Intel Technical support about drivers for MPIO on SUSE Linux Enterprise Server 10 SP3/SP4. http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=17588&ProdId=3034&lang=eng&OSVersion=SUSE%20Linux%20Enterprise%20Server%2010*&DownloadType=Drivers Drivers for SP1 and SP2 are available for download, but nothing for SP3 and SP4. Intel eventually told me to upgrade to SLES 11, which was not possible for my client, not to mention that SLES 10 has long term support from SUSE for which Intel need to provide drivers.

I then contacted SUSE and they replied to say that the MPIO drivers are now part of SP3 and SP4, but there is no documentation (that I can find) to support this.

Using the SLES 10 SP1/SP2 installation guide as a base along with other sources from the web I have come up with a working solution.

Update

Be sure to read addtional comments on the path grouping policy in my post of Apr 19, 2012

Installation

Start by following the detailed http://www.intel.com/support/motherboards/server/sb/CS-029441.htm PDF for SLES 10 SP1/SP2 taking my notes below into account.

The best way to do this is to start with a single controller, install SLES, configure MPIO and then add the second controller.

Check fstab setup disk by-id as per the PDF
Check that you have the SLES MPIO packages installed. If they are not there install them from YaST

# rpm -qa | grep device

device-mapper-1.02.13-6.14

# rpm -qa | grep multi

multipath-tools-0.4.7-34.38

Do NOT install the Intel packages (dm-intel, mpath_prio_intel)
Set services to start

# chkconfig boot.multipath on

# chkconfig multipathd on

Edit kernel settings (note there is no dm-intel)

# vi /etc/sysconfig/kernel

INITRD_MODULES="mptsas processor thermal fan reiserfs edd dm-multipath"

Run mkinitrd

# mkinitrd

Create a multipath.conf file

# vi /etc/multipath.conf

devnode_blacklist {

devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"

devnode "^hd[a-z]"

devnode "^cciss!c[0-9]d[0-9]*"

}

devices {

device {

vendor "Intel"

product "Multi-Flex"

path_grouping_policy group_by_prio

getuid_callout "/sbin/scsi_id -g -u -s /block/%n"

prio "alua /dev/%n"

path_checker tur

path_selector "round-robin 0"

hardware_handler "1 alua"

failback immediate

rr_weight uniform

no_path_retry queue

rr_min_io 100

features "1 queue_if_no_path"

}

You will notice some key differences in the device setup compared to the sample multipath.conf.SLES.txt that comes with the Intel drivers. Because there is no mpio_prio_intel we use alua instead. The prio line is key because that checks the priority of the devices in the event that a controller fails and allows you to fail-over.

Reboot

# shutdown -r now

After startup you can now check multipath output on the single controller

# multipath -ll

22222000155e8d800 dm-0 Intel,Multi-Flex

[size=100G][features=1 queue_if_no_path][hwhandler=1 alua]

\_ round-robin 0 [prio=1][active]

\_ 0:0:2:0 sda 8:0 [active][ready]

Shutdown

# shutdown -h now

Insert the second controller and monitor the Modular Server web interface to make sure it's installed correctly then startup SLES again
You should now have two paths

# multipath -ll

22222000155e8d800 dm-0 Intel,Multi-Flex

[size=100G][features=1 queue_if_no_path][hwhandler=1 alua]

\_ round-robin 0 [prio=2][active]

\_ 0:0:2:0 sda 8:0 [active][ready]

\_ 0:0:3:0 sdb 8:16 [failed][ready]

If you run the command repeatedly you will see the paths alternate between failed and active. This is normal and will also show in /var/log/messages

Mar 27 09:42:19 sles10 kernel: sd 0:0:2:0: alua: port group 00 state S supports touSnA

Mar 27 09:42:19 sles10 multipathd: sda: tur checker reports path is up

Mar 27 09:42:19 sles10 multipathd: 8:0: reinstated

Mar 27 09:42:19 sles10 ...

Daniel_O_Intel · ‎03-28-2012

Very nice solution - thanks for posting this.

idata · ‎03-28-2012

Pleasure.

I'm also testing a Promise vtrak with redundant controllers attached to this setup. I'm getting mixed results when simulating fail-over under heavy system load. I'll try report back if there is anything useful. At the moment it seems if there is too much load while writing to the Promise and a controller fails the OS becomes unresponsive. I'm trying different path grouping policies to see if that makes any difference.

idata · ‎03-29-2012

Some feedback on my fail-over testing with a Promise VTrak E310s (dual controllers). I had problems with the system hanging when I simulated a controller failure on the modular server while doing heavy writes to the Promise. I went back and did the same test on the local disk in the modular server and it was fine. So the problem was only with the Promise.

I had a feeling this had something to do with the Active/Active, Active/Passive setup so I did some more reading on multipath and started looking at all the path group policy settings. You options are as follows:

multibus: One path group is formed with all paths to a LUN. Suitable for devices that are in Active/Active mode.

failover: Each path group will have only one path.

group_by_serial: One path group per storage controller(serial). All paths that connect to the LUN through a controller are assigned to a path group. Suitable for devices that are in Active/Passive mode.

group_by_prio: Paths with same priority will be assigned to a path group.

group_by_node_name: Paths with same target node name will be assigned to a path group.

The default Intel suggests is group_by_prio. I tried multibus which also failed. I then tried group_by_serial and voila, problem solved! So my updated multipath.conf file (including the Promise VTrack) is as follows:

devices {

device {

vendor "Promise"

product "VTrak"

path_grouping_policy group_by_serial

getuid_callout "/sbin/scsi_id -g -u -s /block/%n"

path_checker tur

path_selector "round-robin 0"

hardware_handler "0"

failback immediate

rr_weight uniform

no_path_retry 20

rr_min_io 100

features "1 queue_if_no_path"

}

device {

vendor "Intel"

product "Multi-Flex"

path_grouping_policy group_by_serial

getuid_callout "/sbin/scsi_id -g -u -s /block/%n"

prio "alua /dev/%n"

path_checker tur

path_selector "round-robin 0"

hardware_handler "1 alua"

failback immediate

rr_weight uniform

no_path_retry queue

rr_min_io 100

features "1 queue_if_no_path"

}

There was also a note to say that using multibus on a Active/Passive setup would reduce I/O performance. My undertanding is that both the modular and the VTrack support Active/Active, but I tested it anyway and there was no real peformance difference.

Here are some bonnie tests I did in the VTrack I did for each grouping policy

Disks 12 x Seagate ST2000NM0011 in one pool with 3 x 6TB RAID 6 volumes.

group_by_prio

# bonnie -d /home1/ -s 40000 -m sles10-prio

Bonnie 1.4: File '/home1//Bonnie.30739', size: 41943040000, volumes: 1

Writing with putc()... done: 67740 kB/s 87.2 %CPU

Rewriting... done: 1949875 kB/s 84.9 %CPU

Writing intelligently... done: 125190 kB/s 10.4 %CPU

Reading with getc()... done: 98664 kB/s 95.6 %CPU

Reading intelligently... done: 4000456 kB/s 100.0 %CPU

Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...

---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-

-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --04k (03)-

Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU

sles10 1*40000 67740 87.2125190 10.4 1949875 84.9 98664 95.64000456 100 262467.2 184

multibus

# bonnie -d /home1/ -s 40000 -m sles10-multi

Bonnie 1.4: File '/home1//Bonnie.6718', size: 41943040000, volumes: 1

Writing with putc()... done: 68732 kB/s 87.6 %CPU

Rewriting... done: 2262718 kB/s 98.1 %CPU

Writing intelligently... done: 130749 kB/s 8.5 %CPU

Reading with getc()... done: 100383 kB/s 96.7 %CPU

Reading intelligently... done: 5008622 kB/s 100.0 %CPU

Seeker 2...Seeker 1...Seeker 3...start 'em...done...done...done...

---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-

-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --04k (03)-

Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU

sles10 1*40000 68732 87.6130749 8.5 2262718 98.1 100383 96.75008622 100 320667.0 224

group_by_serial

# bonnie -d /home1/ -s 40000 -m sles10-serial

Bonnie 1.4: File '/home1//Bonnie.8445', size: 41943040000, volumes: 1

Writing with putc()... done: 61271 kB/s 89.4 %CPU

Rewriting... done: 1910663 kB/s 94.4 %CPU

Writing intelligently... done: 123190 kB/s 9.9 %CPU

Reading with getc()... done: 101686 kB/s 97.7 %CPU

Reading intelligently... done: 4074685 kB/s 100.0 %CPU

Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...

---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-

-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --04k (03)-

Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU

sles10 1*40000 61271 89.4123190 9.9 1910663 94.4 101686 97.74074685 100 278299.6 223

idata · ‎04-19-2012

Update

After spending a lot of time trying to make this work and then doing a whole lot of performance testing on the vtrak I made the system live at a client site. They quickly reported performance problems and I discovered a huge degredation in the I/O performance on the "local" disk. I then removed the secondary controller and the I/O performance returned to normal.

I then reproduced this in our lab and discovered that using group_by_prio and group_by_serial both cause at least 50% performance loss in disk I/O. This is quite a shock as Intel recommend group_by_prio! I then went and did more reading and after testing many settings settled on failover as my prefferred path grouping policy. This does not suffer from the same performance loss and the system remains stable under heavy load and simulated controller failure.

Configuration

multipath.conf

devnode_blacklist {

devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"

devnode "^hd[a-z]"

devnode "^cciss!c[0-9]d[0-9]*"

}

devices {

device {

vendor "Intel"

product "Multi-Flex"

path_grouping_policy failover

getuid_callout "/sbin/scsi_id -g -u -s /block/%n"

prio "alua /dev/%d"

path_checker tur

path_selector "round-robin 0"

hardware_handler "1 alua"

failback immediate

# rr_weight uniform

rr_weight priorities

no_path_retry queue

rr_min_io 100

features "1 queue_if_no_path"

}

So now multipath -ll output looks as follows

# multipath -ll

22206000155abb71e dm-0 Intel,Multi-Flex

[size=136G][features=1 queue_if_no_path][hwhandler=1 alua]

\_ round-robin 0 [prio=1][active]

\_ 0:0:2:0 sdb 8:16 [active][ready]

\_ round-robin 0 [prio=1][enabled]

\_ 0:0:3:0 sda 8:0 [active][ready]

As you can see the controllers are now in a active/enabled state. You will also see that /var/log/message is now free of the annoying error messages I originally thought to be normal. The paths swapping between active and failed is the cause of the I/O performance problems!

Test fail-over

So now when we remove a controller we have the following in /var/log/messages

Apr 17 15:43:30 sles10 kernel: end_device-0:1:1: mptsas: ioc0: removing ssp device: fw_channel 0, fw_id 1, phy 11,sas_addr 0x500015500002050a

Apr 17 15:43:30 sles10 kernel: phy-0:1:40: mptsas: ioc0: delete phy 11, phy-obj (0xffff810266ddd800)

Apr 17 15:43:30 sles10 kernel: port-0:1:1: mptsas: ioc0: delete port 1, sas_addr (0x500015500002050a)

Apr 17 15:43:30 sles10 kernel: sd 0:0:1:0: alua: Detached

Apr 17 15:43:30 sles10 kernel: Synchronizing SCSI cache for disk sdb:

Apr 17 15:43:30 sles10 kernel: phy-0:3: mptsas: ioc0: delete phy 3, phy-obj (0xffff8102672c7c00)

Apr 17 15:43:30 sles10 kernel: port-0:1: mptsas: ioc0: delete port 1, sas_addr (0x5001517b9f5e03ff)

Apr 17 15:43:30 sles10 kernel: mptsas: ioc0: delete expander: num_phys 25, sas_addr (0x5001517b9f5e03ff)

Apr 17 15:43:30 sles10 multipathd: sdb: remove path (uevent)

Apr 17 15:43:30 sles10 multipathd: 22206000155abb71e: load table [0 285149758 multipath 1 queue_if_no_path 1 alua 1 1 round-robin 0 1 1 8:0 100]

Apr 17 15:43:30 sles10 multipathd: sdb: path removed from map 22206000155abb71e

Apr 17 15:43:30 sles10 multipathd: dm-0: add map (uevent)

Apr 17 15:43:30 sles10 multipathd: dm-0: devmap already registered

Apr 17 15:43:30 sles10 multipathd: dm-1: add map (uevent)

Apr 17 15:43:30 sles10 multipathd: dm-3: add map (uevent)

Apr 17 15:43:30 sles10 multipathd: dm-2: add map (uevent)

Apr 17 15:43:30 sles10 multipathd: dm-5: add map (uevent)

Apr 17 15:43:30 sles10 multipathd: dm-6: add map (uevent)

Apr 17 15:43:30 sles10 multipathd: dm-7: add map (uevent)

Apr 17 15:43:31 sles10 kernel: sd 0:0:0:0: alua: port group 00 state S supports touSnA

Apr 17 15:43:31 sles10 kernel: sd 0:0:0:0: alua: port group 00 switched to state A

Apr 17 15:43:33 sles10 multipathd: dm-4: add map (uevent)

Let's check the multipath status:

# multipath -ll

22206000155abb71e dm-0 Intel,Multi-Flex

[size=136G][features=1 queue_if_no_path][hwhandler=1 alua]

\_ round-robin 0 [prio=1][active]

\_ 0:0:2:0 sdb 8:16 [active][ready]

We then push the controller back in and have the following in the logs:

Apr 17 15:47:12 sles10 kernel: mptsas: ioc0: add expander: num_phys 25, sas_addr (0x5001517b9f5e03ff)

Apr 17 15:47:12 sles10 kernel: mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 1, phy 11, sas_addr 0x500015500002050a

Apr 17 15:47:12 sles10 kernel: Vendor: Intel Model: Multi-Flex Rev: 0308

Apr 17 15:47:12 sles10 kernel: Type: Direct-Access ANSI SCSI revision: 05

Apr 17 15:47:12 sles10 kernel: 0:0:2:0: mptscsih: ioc0: qdepth=64, tagged=1, simple=1, ordered=0, scsi_level=6, cmd_que=1

Apr 17 15:47:12 sles10 kernel: 0:0:2:0: alua: supports explicit TPGS

Apr 17 15:47:12 sles10 kernel: 0:0:2:0: alua: port group 01 rel port 06

Apr 17 15:47:12 sles10 kernel: 0:0:2:0: alua: port group 01 ...

idata · ‎04-23-2012

A quick update on some testing I have done on SLES 11 SP2. Here the recommended config from Intel does work. The interesting part though is why.

Config for SLES11

# cat /etc/multipath.conf

blacklist {

devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"

devnode "^(hd|xvd)[a-z][[0-9]*]"

devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"

}

devices {

device {

vendor "Intel"

product "Multi-Flex"

path_grouping_policy "group_by_prio"

getuid_callout "/lib/udev/scsi_id -g -u /dev/%n"

prio "alua"

path_checker tur

path_selector "round-robin 0"

hardware_handler "1 alua"

failback immediate

rr_weight uniform

rr_min_io 100

no_path_retry queue

features "1 queue_if_no_path"

}

SLES11 with path grouping policy group_by_prio

# multipath -ll

2224b000155126b27 dm-0 Intel,Multi-Flex

size=60G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw

|-+- policy='round-robin 0' prio=130 status=active

| `- 0:0:3:0 sdb 8:16 active ready running

`-+- policy='round-robin 0' prio=1 status=enabled

`- 0:0:0:0 sda 8:0 active ready running

SLES10 with path grouping policy group_by_prio

# multipath -ll 22222000155e8d800 dm-0 Intel,Multi-Flex [size=100G][features=1 queue_if_no_path][hwhandler=1 alua] \_ round-robin 0 [prio=2][active] \_ 0:0:2:0 sda 8:0 [active][ready] \_ 0:0:3:0 sdb 8:16 [failed][ready]

As you can see as with my path grouping policy of failover for SLES10 the controllers are listed active/enabled. The noticeable difference is the prio values in the multipath output. This means that the algorithm which is supposed to calculate the prio values doesn't work on SLES10 which is why group_by_prio does not work on SLES10. Both controllers come back with the same prio value which is why they keep failing and becoming active all the time.

idata · ‎05-21-2012

Thank you a lot for your work and contribution to the Community, emilec.

We got SLES 11 SP2 running with multipath i/o and can do a "affinity change" without problems.

But one problem still exists: When shutting down the server, the system is not able to do a clean unmount of the partitions.

"Not shutting down MD Raid - reboot/halt scripts do this." missing

Removing multipath targets: May 21 08:53:52 | 22289xxxxx_part2: map in use

May 21 08:53:52 | failed to remove multipath map 22289xxxxx

When the server boots up again, it does a fsck (File-System Check). One time so far, it has found orphaned inodes.

Does anybody know how to solve this problem? Do you have it too?

idata · ‎05-22-2012

Hi bic_admin

I must admit I didn't pay much attention to the shutdown procedure on the SLES 11 SP2 platform I was testing on. I know my SLES 10 platforms all shutdown and boot cleanly. Unfortunately my lab equipment has been reassigned to another task, but when it's free I'll see if I can reproduce your problem.