Looking at PERC Raid controller log files for disk errors

Log files can be obtained from Dells OpenManage interface – see below:

image

This can show up disk errors and RAID errors for disks that have yet themselves to flag themselves as failing.  This can cause all sorts of IO bottlenecks etc.

Note – if the disks are not Dells – the logs will often include SENSE errors which can be ignored e.g.

T61: DM_DevConfigParamCallback: SENSE Len 12 ResponseCode 70 senseKey 5 asc 26 ascq 0
T61: DM_DevConfigParamCallback: Rdm x80500400  FAILED Pd 2 Data 84275da0 Len 10 Cdb 15 pageIndex 3

Every so often the RAID controller goes off and does a “Patrol Read” which may look something like this:

09/01/12  3:52:49: prCallback: PR completed for pd=00
09/01/12  4:08:51: EVT#06109-09/01/12  4:08:51:  94=Patrol Read progress on PD 03(e0x20/s3) is 19.99%(3971s)
09/01/12  4:09:00: EVT#06110-09/01/12  4:09:00:  94=Patrol Read progress on PD 02(e0x20/s2) is 19.99%(3964s)
09/01/12  4:09:24: EVT#06111-09/01/12  4:09:24:  94=Patrol Read progress on PD 05(e0x20/s5) is 19.99%(3980s)
.
.
09/01/12 10:15:48: EVT#06141-09/01/12 10:15:48:  94=Patrol Read progress on PD 03(e0x20/s3) is 99.99%(25220s)
09/01/12 10:15:50: prCallback: PR completed for pd=03
09/01/12 10:17:33: EVT#06142-09/01/12 10:17:33:  94=Patrol Read progress on PD 05(e0x20/s5) is 99.99%(25349s)
09/01/12 10:17:36: prCallback: PR completed for pd=05
09/01/12 10:18:34: EVT#06143-09/01/12 10:18:34:  94=Patrol Read progress on PD 02(e0x20/s2) is 99.99%(25378s)
09/01/12 10:18:37: prCallback: PR completed for pd=02
09/01/12 10:21:50: EVT#06144-09/01/12 10:21:50:  94=Patrol Read progress on PD 04(e0x20/s4) is 99.99%(25606s)

This is basically just checking disks are all in alignment etc – and is good if it completes without errors as above (though it may temporarily slow down I/O a little)

If however your log has lots of sections like this in it you have problems.

06/12/13 21:43:38: EVT#36801-06/12/13 21:43:38: 113=Unexpected sense: PD 04(e0x20/s4) Path 4433221103000000, CDB: 8f 00 00 00 00 00 c3 e9 4f 7b 00 00 10 00 00 00, Sense: 3/11/00
06/12/13 21:43:38: Raw Sense for PD 4: f0 00 03 c3 e9 4f 8e 0a 00 00 00 00 11 00 00 00 00 00
06/12/13 21:43:38: DEV_REC:Medium Error DevId[4] devHandle f RDM=807ad600 retires=0
06/12/13 21:43:38: MedErr is for: cmdId=2489, ld=1, src=0, cmd=2, lba=662bea96, cnt=20, rmwOp=0
06/12/13 21:43:38: ErrLBAOffset (13) LBA(c3e94f7b) BadLba=c3e94f8e
06/12/13 21:43:38: prCallback: Medium Error on pd=04, StartLba=c3e94f7b, ErrLba=c3e94f8e
06/12/13 21:43:38: prRecQueue: starting pd=04 recovery - blocking host commands
06/12/13 21:43:38: prRecGo: Ready to attempt recovery errLBA=c3e94f8e on pd=04
06/12/13 21:43:38: prRecGo: dataErr found on ld=1, span=0, physArm=2, logArm=1, row=187d29f, strip=49777de, startRef=e,type=0
06/12/13 21:43:38: prRecGo: data NOT valid in cache
06/12/13 21:43:38: Issuing write verify pd=04 physArm=2 span=0 startBlk=c3e94f8e numBlks=1
06/12/13 21:43:38: EVT#36802-06/12/13 21:43:38: 110=Corrected medium error during recovery on PD 04(e0x20/s4) at c3e94f8e
06/12/13 21:43:38: EVT#36803-06/12/13 21:43:38:  93=Patrol Read corrected medium error on PD 04(e0x20/s4) at c3e94f8e

This indicates that Disk 4 (PD4) is on its way out and should be replaced ASAP. 
Note Disk 4 is probably Disk 4 as installed in the server – not necessarily disk 4 in that raid array – so take care – and remember the first dis is Disk 0

Comments

Andrew

re: Looking at PERC Raid controller log files for disk errors

24 September 2016

Thank you!  That was very helpful!  I had a SQL server giving me some fits but everything was reporting ok, obviously i've got a bad drive and you helped me locate it.  Thanks for posting this.

Find out more