2005-11-17

Unconventional firmware update on a p5-520

Today I connected a p5-520 to one of my HMCs, in order to partition it and to manage it remotely afterwards. To this day the machine was running in stand-alone (non-partitioned) mode. The plan was to "encapsulate" the existing OS inside an LPAR, assigning to it the resources it was using, and removing those that weren't used. A straight RJ45 t/p cable was passed beforehand to connect the machine to the private HMC DHCP LAN.

After connecting the cable, the HMC discovers the new managed system automatically after about 5 minutes. So far so good. Note here that when a machine is transitioned like this from stand-alone to partitioned mode, you end up with a single LPAR designated as the service partition for the managed system. The name of the LPAR matches the system serial number, and it has only one default profile, named accordingly "default". I decided to delete this LPAR and start from scratch, so I had to go into the managed system's properties and set the service partition to "unassigned" before I could delete it.

Now, I create my new LPAR and with its default (and only) profile. That's when the strangeness started. After creation, I go back into the profile to verify the resources, and discover that all the fields are greyed (inactive), and no modifications are possible!! I can't add/remove any resources, nor change the name of the LPAR or profile! I know this is not right because I have a dozen other LPARs on the same HMC (but in different managed systems), without this kind of problem.

My first guess is that this is a firmware problem, so I check the machine's firmware level and find that it is at level SF223, the latest being SF235 at this moment. Between them there are several releases with severity HIPER, as shown here. That pretty much got me decided, so I downloaded the latest SF235 level to another machine accessible via FTP. According to the docs, system firmware can be updated either from the HMC or from an AIX partition (which I presume must be the service partition, didn't read in detail). I decided to update from the HMC, and maybe try the other method on another occasion.

Before starting the firmware update, I deleted the partition and profile that I have created, powered the machine down and brought it back to the "partition standby" state.

The update involved several phases:

1) fetching the firmware package (one RPM package and one XML file) from the other machine via FTP
2) installing the updates
3) powering down the system
4) apparently rebooting the service processor (at one point the operator panel value showed "Firmware not ready")
5) powering on the system into the "standby" state

During phase 2 I got a huge scare when looking in parallel at the firmware description (see above link), and finding this:

Before attempting to load this system firmware please ensure that your HMC software has been upgraded to Version 5, Release 1.0.

Shit!! My HMC is at level 4.5.0! I could see myself calling IBM, saying that I attempted a firmware update on an outdated HMC, getting brushed off for not reading the procedure carefully enough, etc, while users were already calling asking when the machine is coming back up. I would be humiliated twice on the same day (the first time for a failed RAM upgrade on the same machine, but that's another story).

Meanwhile, phase 2 is taking forever, cancelling and backing out is impossible (except killing the window and/or rebooting the HMC, but I didn't have the guts). So I just waited and prayed. Luckily for me I asked the users for a whole day's downtime beforehand, so I told them that they should not expect the machine today, and was ready to go to the site (the machine is 1.5 hours away by train and bus), look at the LEDs and call IBM from there (I hate debugging this kind of problems without physical access to the machine). Then I saw the process going into phase 3 and my hopes were restored. The whole thing took in excess of 30 minutes, and then - success!! WHEW. Just in case, I rebooted the HMC before creating the LPAR again. Then I recreated the LPAR and profile, went back into the profile, and no more greyed fields! Problem resolved. I boot the system, and everything is fine and dandy. I haven't felt this kind of relief in a long time.

So, in spite of IBM's recommendation (or should I say "requirement"), installing firmware SF235 on a p5-520 via an HMC at level 4.5.0 works. Don't know if it will work for other models, but I don't think I'll take my chances :-)

IBM too eager to ship new stuff

Back in August I got my hands on my first p5 machine, a 590. Two weeks later, the machine's service processor fries (a thing almost unheard of, according to IBM, had to happen to me). It turns out that these machines have two service processors, but automatic failover will be supported only with the next firmware level. So for now the second SP just sits there and does nothing.

Back then I checked the IBM website - indeed, it said that this model features "redundant service processors (planned for 2H/2005)". Today I checked the same page and the "planned for" bit was gone! I go to the firmware page - a new p5 firmware level SF235 is out. The description says:

Added support for redundant service processors with dynamic failover in models 570, 590, and 595.

Yippee!! But wait, what's this - earlier on the same page:

This level of firmware does not support the 9119-590, 9119-595, 9406-595 or 9118-575 systems. Firmware to support these systems is being tested and is expected to be available mid-November 2005.

Gee, thanks but no thanks, still waiting...

Labels:

2005-11-16

In reference to the previous post

Just remembered on the way home that Powerpath itself does a bosboot during installation, so down time is required in any case. I feel less guilty now :-)

EMC AIX ODM package

Dilemma of the day: to install or not to install the EMC AIX ODM package on AIX 5.3 machines connected to an EMC DMX SAN??

This is from the AIX 5.3 release notes:

A device configured as MPIO other FC disk has the following properties:
[...]
Is supported in a production environment. Device-specific vendor ODM pre-definitions are not required to be installed before using in a production environment.

So, I install my machines with the default AIX configuration and they run happily for weeks with EMC disks configured as "MPIO Other FC SCSI Disk Drive", with no problems whatsoever.

Today I find this in the "EMC Host Connectivity Guide for IBM AIX":

A Symmetrix device configured as an MPIO other FC device is not
supported. The only supported configuration is with the EMC
Symmetrix FCP MPIO ODM predefined attributes which has been
certified to operate with the AIX default PCMs. These predefines get
installed with the EMC.Symmetrix.fcp.MPIO.rte fileset.

My machines haven't gone into production yet, but some will very soon. So now I have to varyoff the VGs, delete the disks, install the goddamn EMC filesets (which does a bosboot), and reboot each machine. Luckily I only have ten machines so far.

According to the same EMC doc, two filesets are necessary, from the following two combinations:

EMC.Symmetrix.aix.rte
EMC.Symmetrix.fcp.rte

or

EMC.Symmetrix.aix.rte
EMC.Symmetrix.fcp.MPIO.rte

The former configures the disks without MPIO (lspath shows nothing for those disks), while the latter configures them as MPIO devices with a path to each HBA they are connected to.

But can EMC.Symmetrix.fcp.rte and EMC.Symmetrix.fcp.MPIO.rte coexist? No, they can't. This is what happens if you try to install the two at the same time:

*************************************************************************
* EMC Symmetrix FCP MPIO Software Support for Symmetrix devices has *
* been found installed or Symmetrix FCP Software Support is being *
* installed simultaneously. It is necessary to remove all previous *
* versions of the EMC.Symmetrix.fcp.MPIO.rte software before Symmetrix *
* FCP Software Support can be installed. *
* Select the correct software fileset to install or run the following *
* command to remove the EMC.Symmetrix.fcp.MPO.rte software. *
* *
* Please run 'installp -u EMC.Symmetrix.fcp.MPIO.rte' to remove the *
* installed fileset. *
* *
*************************************************************************
instal: Failed while executing the EMC.Symmetrix.fcp.rte.pre_i script.

*************************************************************************
* EMC Symmetrix FCP Software Support for Symmetrix devices has been *
* found installed or Symmetrix FCP MPIO Software Support is being *
* installed simultaneously. It is necessary to remove all previous *
* versions of the EMC.Symmetrix.fcp.rte software before Symmetrix FCP *
* MPIO Software Support can be installed. *
* Select the correct software fileset to install or run the following *
* command to remove the EMC.Symmetrix.fcp.rte software *
* *
* Please run 'installp -u EMC.Symmetrix.fcp.rte' to remove the *
* installed fileset. *
* *
*************************************************************************
instal: Failed while executing the EMC.Symmetrix.fcp.MPIO.rte.pre_i script.

So, it's either one or the other. Why the hassle, you may ask? Thing is, right now we do not have multiple HBA attachments for the same disks. In this situation, either case will work. In the case of MPIO, each disk will have one path through the lone HBA. However, management says one day we might do double HBA attachment, and the choice will be between MPIO and EMC Powerpath. Powerpath requires EMC.Symmetrix.fcp.rte, which cannot coexist with EMC.Symmetrix.fcp.MPIO.rte, as we have just seen. So, the possible scenarios are:

1. Configure now with MPIO and schedule down time if Powerpath is chosen
2. Configure now without MPIO, and schedule down time if MPIO is chosen

To me, choice 1 seems clear. Screw EMC!!

Labels: