Thursday, September 4, 2008

SAN event data gathering tips

Taken from : http://www.redbooks.ibm.com/abstracts/tips0553.html

The purpose of this TIP is to outline all the basic information that needs to be collected to assist in resolving SAN related problems. You might find it daunting to have to go through several steps to gather all the data requested, but the most common cause of delays in problem resolution is a lack of data. Therefore, you should never assume that the root cause of any problem is in the most obvious place. By gathering logs from all parts of the SAN, you give yourself the greatest chance of getting a fast and effective resolution to the problem.

The second most common cause of delays in problem resolution is providing data that has been collected some hours or even days after the problem occurred. Often in this case, there is no longer evidence of the original problem. Timely and complete data collection will aid in problems being resolved quickly. Collection of timely and detailed information for hosts are outlined in the following sections.

The collection of log information is critical to understanding the cause of an event in a SAN environment, and to aid support in analyzing the collected logs. It is also useful to provide the time offsets of individual equipment. As some hardware may never have had its real-time clock set to the local time, it can become very difficult to match events from one piece of equipment to the other. Another important piece of information to aid in timely error analysis is a physical diagram of the SAN topology. This diagram should be kept up to date, and include all hosts, switches, directors, and storage devices within the SAN. This document can save countless hours of reconstructing the original picture by piecing together log information.

AIX

Time difference
Use the date command to display the system date and time.

Log collection
Collect both errpt and errpt -a (each piped to a file).

Hardware configuration collection
Take a snap (the errpt is found in a snap, but it is good to have a separate copy).
The preferred snap for IBM TotalStorage DS Family problems is:

snap -gfiLc where:

g - Gathers the output of the lslpp -hBc command, which collects the exact operating system environment
f - Gathers file system information
i - Gathers installation debug vital product data (VPD) information
L - Gathers LVM information
c - Creates a compressed pax image (snap.pax.Z file)


Multi-pathing data collection
SDD (all versions of AIX)
Issue the following commands and capture the output. This data is not found in a snap. Preferably provide the output of these commands during the failure.

datapath query adapter
datapath query device
lsvpcfg


MPIO (available on AIX 5.2 and above)
Issue the following commands and capture the output. This data is not included in a snap.

pcmpath query adapter
pcmpath query device
pcmpath query essmap


HP-UX

Time difference
Use the date command to get the system date and time.

Log collection
Collect the contents of the /var/adm/syslog/syslog.log file.

Hardware configuration collection
Provide the following server details for each server involved with the SAN:

Manufacturer
Machine Type/Model Number
Feature details, for example, number of CPUs, amount of memory


HBA details
For SAN problems we always need the following details about the FC HBAs:

Hardware manufacturer, brand and model
BIOS (firmware) level - BIOS settings if QLogic
Driver level


Software configuration collection
Capture the output of the uname -a command

Multi-pathing data collection
SDD
Issue the following commands and capture the output. Preferably provide the output of these commands during the failure.

datapath query adapter
datapath query device


Linux

Time difference
Use the date command to get the system date and time.

Log collection
Capture the contents of /var/log/messages
Capture the output of the dmesg command

Hardware configuration collection
For IBM xSeries hardware, the best way to collect configuration data is by using the e-gatherer tool. Make sure you also supply the HBA details. You can download e-gatherer from:




Software configuration collection
Capture the output of the command: uname -a

If you are running Redhat, install and run sysreport and send the output.

Multi-pathing Data Collection
SDD
Issue the following commands and capture the output. Preferably provide the output of these commands during the failure.

datapath query adapter
datapath query device


Microsoft Windows

Time difference
Display the system date and time using the clock in the bottom right hand corner, or by issuing the time and date commands at a command prompt.

Log collection
Always save the system logs and the application logs as soon as possible after the event.
Do not export and provide the logs in EVT format, it is not helpful.
To find the system logs, either right click My computer then Manage or click:

Start -> Programs -> Administrative Tools -> Computer Management
When this opens, go to:
System Tools -> Event Viewer -> System
And then:
Action -> Save Log file as changing the save as type to CSV
Repeat for Application logs.


Hardware configuration collection
For IBM xSeries hardware, the best way to collect configuration data is by using the e-gatherer tool. Make sure you also supply the HBA details. You can download e-gatherer from:




Software configuration data collection
If you cannot take the e-gatherer data provide:

Operating System
Service Pack Level


Multi-pathing data collection

SDD
If you are running SDD, issue the following commands and capture the output.

datapath query adapter
datapath query device


Novell Netware

Time difference
Display the system date and time.

Log collection
CONLOG.EXE is a utility which writes all system console messages to a .LOG file.
More details can be found at url:




Hardware configuration collection
There is no e-gatherer for Netware.

Software configuration data collection
Provide:

Operating System Level
State whether this is a clustered system


Multi-pathing data collection

SDD
If you are running SDD, issue the following commands and capture the output.

datapath query adapter
datapath query device


SUN Solaris

Time difference
Use the date command to get the system date and time.

Log collection
Save the /var/adm/messages file. Previous days messages are normally available as /var/adm/messagesx where x is the number of days since the logs rolled.

Hardware configuration collection
There is no e-gatherer or snap command to collect these details, so a good description of the hardware including the following is required:

Software configuration data collection
Provide :

Operating System details.
A copy of your sd.conf file
Output from iostat -El


Depending on the HBA there will be a /kernel/drv/*.conf where the * could be QLogic or JNI.

Multi-pathing data collection

SDD
Issue the following commands and capture the output.

datapath query adapter
datapath query device


Veritas Volume Manager DMP
Provide the output from the following commands:

ls -lL /dev/rdsk/*
ls -la /dev/vx/dmp/*
format


**Contributed by shah_mr