- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Overview
I've recently had a long discussion with Intel Technical support about drivers for MPIO on SUSE Linux Enterprise Server 10 SP3/SP4. http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=17588&ProdId=3034&lang=eng&OSVersion=SUSE%20Linux%20Enterprise%20Server%2010*&DownloadType=Drivers Drivers for SP1 and SP2 are available for download, but nothing for SP3 and SP4. Intel eventually told me to upgrade to SLES 11, which was not possible for my client, not to mention that SLES 10 has long term support from SUSE for which Intel need to provide drivers.
I then contacted SUSE and they replied to say that the MPIO drivers are now part of SP3 and SP4, but there is no documentation (that I can find) to support this.
Using the SLES 10 SP1/SP2 installation guide as a base along with other sources from the web I have come up with a working solution.
Update
Be sure to read addtional comments on the path grouping policy in my post of Apr 19, 2012
Installation
Start by following the detailed http://www.intel.com/support/motherboards/server/sb/CS-029441.htm PDF for SLES 10 SP1/SP2 taking my notes below into account.
The best way to do this is to start with a single controller, install SLES, configure MPIO and then add the second controller.
- Check fstab setup disk by-id as per the PDF
- Check that you have the SLES MPIO packages installed. If they are not there install them from YaST
device-mapper-1.02.13-6.14
# rpm -qa | grep multi
multipath-tools-0.4.7-34.38
- Do NOT install the Intel packages (dm-intel, mpath_prio_intel)
- Set services to start
# chkconfig boot.multipath on
# chkconfig multipathd on- Edit kernel settings (note there is no dm-intel)
INITRD_MODULES="mptsas processor thermal fan reiserfs edd dm-multipath"
- Run mkinitrd
- Create a multipath.conf file
# vi /etc/multipath.conf
devnode_blacklist {
devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"devnode "^hd[a-z]"
devnode "^cciss!c[0-9]d[0-9]*"
}
devices {
device {
vendor "Intel"product "Multi-Flex"
path_grouping_policy group_by_prio
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
prio "alua /dev/%n"
path_checker tur
path_selector "round-robin 0"
hardware_handler "1 alua"
failback immediate
rr_weight uniform
no_path_retry queue
rr_min_io 100
features "1 queue_if_no_path"
}
}
You will notice some key differences in the device setup compared to the sample multipath.conf.SLES.txt that comes with the Intel drivers. Because there is no mpio_prio_intel we use alua instead. The prio line is key because that checks the priority of the devices in the event that a controller fails and allows you to fail-over.
- Reboot
- After startup you can now check multipath output on the single controller
22222000155e8d800 dm-0 Intel,Multi-Flex
[size=100G][features=1 queue_if_no_path][hwhandler=1 alua]
\_ round-robin 0 [prio=1][active]
\_ 0:0:2:0 sda 8:0 [active][ready]
- Shutdown
- Insert the second controller and monitor the Modular Server web interface to make sure it's installed correctly then startup SLES again
- You should now have two paths
22222000155e8d800 dm-0 Intel,Multi-Flex
[size=100G][features=1 queue_if_no_path][hwhandler=1 alua]
\_ round-robin 0 [prio=2][active]
\_ 0:0:2:0 sda 8:0 [active][ready]
\_ 0:0:3:0 sdb 8:16 [failed][ready]
If you run the command repeatedly you will see the paths alternate between failed and active. This is normal and will also show in /var/log/messages
Mar 27 09:42:19 sles10 kernel: sd 0:0:2:0: alua: port group 00 state S supports touSnAMar 27 09:42:19 sles10 multipathd: sda: tur checker reports path is up
Mar 27 09:42:19 sles10 multipathd: 8:0: reinstated
Mar 27 09:42:19 sles10 ...
- Tags:
- Storage
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Very nice solution - thanks for posting this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Pleasure.
I'm also testing a Promise vtrak with redundant controllers attached to this setup. I'm getting mixed results when simulating fail-over under heavy system load. I'll try report back if there is anything useful. At the moment it seems if there is too much load while writing to the Promise and a controller fails the OS becomes unresponsive. I'm trying different path grouping policies to see if that makes any difference.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Some feedback on my fail-over testing with a Promise VTrak E310s (dual controllers). I had problems with the system hanging when I simulated a controller failure on the modular server while doing heavy writes to the Promise. I went back and did the same test on the local disk in the modular server and it was fine. So the problem was only with the Promise.
I had a feeling this had something to do with the Active/Active, Active/Passive setup so I did some more reading on multipath and started looking at all the path group policy settings. You options are as follows:
multibus: One path group is formed with all paths to a LUN. Suitable for devices that are in Active/Active mode.
failover: Each path group will have only one path.
group_by_serial: One path group per storage controller(serial). All paths that connect to the LUN through a controller are assigned to a path group. Suitable for devices that are in Active/Passive mode.
group_by_prio: Paths with same priority will be assigned to a path group.
group_by_node_name: Paths with same target node name will be assigned to a path group.
The default Intel suggests is group_by_prio. I tried multibus which also failed. I then tried group_by_serial and voila, problem solved! So my updated multipath.conf file (including the Promise VTrack) is as follows:
devices {
device {
vendor "Promise"
product "VTrak"
path_grouping_policy group_by_serial
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
path_checker tur
path_selector "round-robin 0"
hardware_handler "0"
failback immediate
rr_weight uniform
no_path_retry 20
rr_min_io 100
features "1 queue_if_no_path"
}
device {
vendor "Intel"
product "Multi-Flex"
path_grouping_policy group_by_serial
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
prio "alua /dev/%n"
path_checker tur
path_selector "round-robin 0"
hardware_handler "1 alua"
failback immediate
rr_weight uniform
no_path_retry queue
rr_min_io 100
features "1 queue_if_no_path"
}
}
There was also a note to say that using multibus on a Active/Passive setup would reduce I/O performance. My undertanding is that both the modular and the VTrack support Active/Active, but I tested it anyway and there was no real peformance difference.
Here are some bonnie tests I did in the VTrack I did for each grouping policy
Disks 12 x Seagate ST2000NM0011 in one pool with 3 x 6TB RAID 6 volumes.
group_by_prio
# bonnie -d /home1/ -s 40000 -m sles10-prioBonnie 1.4: File '/home1//Bonnie.30739', size: 41943040000, volumes: 1
Writing with putc()... done: 67740 kB/s 87.2 %CPU
Rewriting... done: 1949875 kB/s 84.9 %CPU
Writing intelligently... done: 125190 kB/s 10.4 %CPU
Reading with getc()... done: 98664 kB/s 95.6 %CPU
Reading intelligently... done: 4000456 kB/s 100.0 %CPU
Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --04k (03)-
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
sles10 1*40000 67740 87.2125190 10.4 1949875 84.9 98664 95.64000456 100 262467.2 184
multibus
# bonnie -d /home1/ -s 40000 -m sles10-multi
Bonnie 1.4: File '/home1//Bonnie.6718', size: 41943040000, volumes: 1
Writing with putc()... done: 68732 kB/s 87.6 %CPU
Rewriting... done: 2262718 kB/s 98.1 %CPU
Writing intelligently... done: 130749 kB/s 8.5 %CPU
Reading with getc()... done: 100383 kB/s 96.7 %CPU
Reading intelligently... done: 5008622 kB/s 100.0 %CPU
Seeker 2...Seeker 1...Seeker 3...start 'em...done...done...done...
---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --04k (03)-
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
sles10 1*40000 68732 87.6130749 8.5 2262718 98.1 100383 96.75008622 100 320667.0 224
group_by_serial
# bonnie -d /home1/ -s 40000 -m sles10-serialBonnie 1.4: File '/home1//Bonnie.8445', size: 41943040000, volumes: 1
Writing with putc()... done: 61271 kB/s 89.4 %CPU
Rewriting... done: 1910663 kB/s 94.4 %CPU
Writing intelligently... done: 123190 kB/s 9.9 %CPU
Reading with getc()... done: 101686 kB/s 97.7 %CPU
Reading intelligently... done: 4074685 kB/s 100.0 %CPU
Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --04k (03)-
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
sles10 1*40000 61271 89.4123190 9.9 1910663 94.4 101686 97.74074685 100 278299.6 223
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Update
After spending a lot of time trying to make this work and then doing a whole lot of performance testing on the vtrak I made the system live at a client site. They quickly reported performance problems and I discovered a huge degredation in the I/O performance on the "local" disk. I then removed the secondary controller and the I/O performance returned to normal.
I then reproduced this in our lab and discovered that using group_by_prio and group_by_serial both cause at least 50% performance loss in disk I/O. This is quite a shock as Intel recommend group_by_prio! I then went and did more reading and after testing many settings settled on failover as my prefferred path grouping policy. This does not suffer from the same performance loss and the system remains stable under heavy load and simulated controller failure.
Configuration
multipath.conf
devnode_blacklist {
devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
devnode "^hd[a-z]"
devnode "^cciss!c[0-9]d[0-9]*"
}
devices {
device {
vendor "Intel"
product "Multi-Flex"
path_grouping_policy failover
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
prio "alua /dev/%d"
path_checker tur
path_selector "round-robin 0"
hardware_handler "1 alua"
failback immediate
# rr_weight uniform
rr_weight priorities
no_path_retry queue
rr_min_io 100
features "1 queue_if_no_path"
}
}
So now multipath -ll output looks as follows
# multipath -ll22206000155abb71e dm-0 Intel,Multi-Flex
[size=136G][features=1 queue_if_no_path][hwhandler=1 alua]
\_ round-robin 0 [prio=1][active]
\_ 0:0:2:0 sdb 8:16 [active][ready]
\_ round-robin 0 [prio=1][enabled]
\_ 0:0:3:0 sda 8:0 [active][ready]
As you can see the controllers are now in a active/enabled state. You will also see that /var/log/message is now free of the annoying error messages I originally thought to be normal. The paths swapping between active and failed is the cause of the I/O performance problems!
Test fail-over
So now when we remove a controller we have the following in /var/log/messages
Apr 17 15:43:30 sles10 kernel: end_device-0:1:1: mptsas: ioc0: removing ssp device: fw_channel 0, fw_id 1, phy 11,sas_addr 0x500015500002050aApr 17 15:43:30 sles10 kernel: phy-0:1:40: mptsas: ioc0: delete phy 11, phy-obj (0xffff810266ddd800)
Apr 17 15:43:30 sles10 kernel: port-0:1:1: mptsas: ioc0: delete port 1, sas_addr (0x500015500002050a)
Apr 17 15:43:30 sles10 kernel: sd 0:0:1:0: alua: Detached
Apr 17 15:43:30 sles10 kernel: Synchronizing SCSI cache for disk sdb:
Apr 17 15:43:30 sles10 kernel: phy-0:3: mptsas: ioc0: delete phy 3, phy-obj (0xffff8102672c7c00)
Apr 17 15:43:30 sles10 kernel: port-0:1: mptsas: ioc0: delete port 1, sas_addr (0x5001517b9f5e03ff)
Apr 17 15:43:30 sles10 kernel: mptsas: ioc0: delete expander: num_phys 25, sas_addr (0x5001517b9f5e03ff)
Apr 17 15:43:30 sles10 multipathd: sdb: remove path (uevent)
Apr 17 15:43:30 sles10 multipathd: 22206000155abb71e: load table [0 285149758 multipath 1 queue_if_no_path 1 alua 1 1 round-robin 0 1 1 8:0 100]
Apr 17 15:43:30 sles10 multipathd: sdb: path removed from map 22206000155abb71e
Apr 17 15:43:30 sles10 multipathd: dm-0: add map (uevent)
Apr 17 15:43:30 sles10 multipathd: dm-0: devmap already registered
Apr 17 15:43:30 sles10 multipathd: dm-1: add map (uevent)
Apr 17 15:43:30 sles10 multipathd: dm-3: add map (uevent)
Apr 17 15:43:30 sles10 multipathd: dm-2: add map (uevent)
Apr 17 15:43:30 sles10 multipathd: dm-5: add map (uevent)
Apr 17 15:43:30 sles10 multipathd: dm-6: add map (uevent)
Apr 17 15:43:30 sles10 multipathd: dm-7: add map (uevent)
Apr 17 15:43:31 sles10 kernel: sd 0:0:0:0: alua: port group 00 state S supports touSnA
Apr 17 15:43:31 sles10 kernel: sd 0:0:0:0: alua: port group 00 switched to state A
Apr 17 15:43:33 sles10 multipathd: dm-4: add map (uevent)
Let's check the multipath status:
# multipath -ll22206000155abb71e dm-0 Intel,Multi-Flex
[size=136G][features=1 queue_if_no_path][hwhandler=1 alua]
\_ round-robin 0 [prio=1][active]
\_ 0:0:2:0 sdb 8:16 [active][ready]
We then push the controller back in and have the following in the logs:
Apr 17 15:47:12 sles10 kernel: mptsas: ioc0: add expander: num_phys 25, sas_addr (0x5001517b9f5e03ff)
Apr 17 15:47:12 sles10 kernel: mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 1, phy 11, sas_addr 0x500015500002050a
Apr 17 15:47:12 sles10 kernel: Vendor: Intel Model: Multi-Flex Rev: 0308
Apr 17 15:47:12 sles10 kernel: Type: Direct-Access ANSI SCSI revision: 05
Apr 17 15:47:12 sles10 kernel: 0:0:2:0: mptscsih: ioc0: qdepth=64, tagged=1, simple=1, ordered=0, scsi_level=6, cmd_que=1
Apr 17 15:47:12 sles10 kernel: 0:0:2:0: alua: supports explicit TPGS
Apr 17 15:47:12 sles10 kernel: 0:0:2:0: alua: port group 01 rel port 06
Apr 17 15:47:12 sles10 kernel: 0:0:2:0: alua: port group 01 ...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A quick update on some testing I have done on SLES 11 SP2. Here the recommended config from Intel does work. The interesting part though is why.
Config for SLES11
# cat /etc/multipath.confblacklist {
devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
devnode "^(hd|xvd)[a-z][[0-9]*]"
devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
}
devices {
device {
vendor "Intel"
product "Multi-Flex"
path_grouping_policy "group_by_prio"
getuid_callout "/lib/udev/scsi_id -g -u /dev/%n"
prio "alua"
path_checker tur
path_selector "round-robin 0"
hardware_handler "1 alua"
failback immediate
rr_weight uniform
rr_min_io 100
no_path_retry queue
features "1 queue_if_no_path"
}
}
SLES11 with path grouping policy group_by_prio
# multipath -ll
2224b000155126b27 dm-0 Intel,Multi-Flex
size=60G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=130 status=active
| `- 0:0:3:0 sdb 8:16 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 0:0:0:0 sda 8:0 active ready running
SLES10 with path grouping policy group_by_prio
# multipath -ll 22222000155e8d800 dm-0 Intel,Multi-Flex [size=100G][features=1 queue_if_no_path][hwhandler=1 alua] \_ round-robin 0 [prio=2][active] \_ 0:0:2:0 sda 8:0 [active][ready] \_ 0:0:3:0 sdb 8:16 [failed][ready]
As you can see as with my path grouping policy of failover for SLES10 the controllers are listed active/enabled. The noticeable difference is the prio values in the multipath output. This means that the algorithm which is supposed to calculate the prio values doesn't work on SLES10 which is why group_by_prio does not work on SLES10. Both controllers come back with the same prio value which is why they keep failing and becoming active all the time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you a lot for your work and contribution to the Community, emilec.
We got SLES 11 SP2 running with multipath i/o and can do a "affinity change" without problems.
But one problem still exists: When shutting down the server, the system is not able to do a clean unmount of the partitions.
"Not shutting down MD Raid - reboot/halt scripts do this." missing
Removing multipath targets: May 21 08:53:52 | 22289xxxxx_part2: map in use
May 21 08:53:52 | failed to remove multipath map 22289xxxxx
When the server boots up again, it does a fsck (File-System Check). One time so far, it has found orphaned inodes.
Does anybody know how to solve this problem? Do you have it too?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi bic_admin
I must admit I didn't pay much attention to the shutdown procedure on the SLES 11 SP2 platform I was testing on. I know my SLES 10 platforms all shutdown and boot cleanly. Unfortunately my lab equipment has been reassigned to another task, but when it's free I'll see if I can reproduce your problem.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page