Reset Search
 

 

Article

Continuous MEM_FAIL errors and elevated CPU with Hardware Maintenance Process on the blade

« Go Back

Information

 
TitleContinuous MEM_FAIL errors and elevated CPU with Hardware Maintenance Process on the blade
Symptoms

Hardware Maintenance CPU process is high on the blade ( between 40 -70 %)
Constant repeated MEM_FAIL Errors as below:
 
==============================================================================
Message 1/144 Syslog Message 08.32.01.0021 03/24/2016 13:27:41
<3>bcmStrat[11.tLcIntr]MEM_FAIL_INT_STAT=0x00000000, EP_INTR_STATUS=0x00
000000, IP0_INTR_STATUS=0x00000000, IP1_INTR_STATUS=0x00000040, IP2_INTR
_STATUS=0x00000000, XQPORT_INTR_STATUS.xgport0=0x00000000, XQPORT_INTR_S
TATUS.xgport1=0x00000000
==============================================================================
Message 2/144 Syslog Message 08.32.01.0021 03/24/2016 13:27:41
<3>bcmStrat[11.tLcIntr]MEM_FAIL interrupt occurred on chip 4!
==============================================================================

also the following has been seen :
 
Message  74/89   Syslog Message            08.42.01.0007   08/01/2016 06:21:39

    <3>bcmStrat[10.tNimIntr]MEM_FAIL_INT_STAT=0x00000400, EGR_INTR0_STATUS=0
    x00000000, EGR_INTR1_STATUS=0x00000000, IP0_INTR_STATUS=0x00000001, IP1_
    INTR_STATUS=0x00000000, IP2_INTR_STATUS=0x00000000, IP2_INTR_STATUS_2=0x
    00000000, IP3_INTR_STATUS=0x00000000, IP4_INTR_STATUS=0x00000000, IP5_IN
    TR_STATUS=0x00000000, IP5_INTR_STATUS_1=0x00000000, IP5_INTR_STATUS_2=0x
    00000000, PG4_XINTR_STATUS=0x00000000, PG4_YINTR_STATUS=0x00000000, PG5_
    XINTR_STATUS=0x00000000, PG5_YINTR_STATUS=0x00000000
==============================================================================
Message  75/89   Syslog Message            08.42.01.0007   08/01/2016 06:21:39

    <3>bcmStrat[10.tNimIntr]MEM_FAIL interrupt occurred on chip 12!
==============================================================================

May require manual reboot of the blade or may eventually crash with reset like this in the faultlog

 
==============================================================================
Message 22/154 Informational 08.32.01.0021 03/29/2016 17:03:19
Device was last fully operational in user mode: 42 seconds ago. Last res
et caused by: processor exception.
==============================================================================
Message 23/154 Informational 08.32.01.0021 03/29/2016 17:02:58
Board reset itself
==============================================================================

Message 24/154 Exception PPC750 Info 08.32.01.0021 03/29/2016 17:02:58

Exc Vector: DSI exception (0x00000300)
Thread Name: bcmDPC
Exc Addr: 0x010d21cc
Thread Stack: 0x0c113000..0x0c10f000
Stack Pointer: 0x0c112bb0
Traceback Stack:

[ 0] 0x010d2100
[ 1] 0x010cec4c
[ 2] 0x01101c60
[ 3] 0x011027b4
[ 4] 0x01089c3c
[ 5] 0x016731a4
[ 6] 0xeeeeeeee
[ 7] 0x00000000

GENERAL EXCEPTION INFO:
srr0 : 0x010d21cc srr1 : 0x0000b032 dar : 0x00009954
cr : 0x20002822 xer : 0x00000000 fpcsr:0x00000000
dsisr:0x40000000

GENERAL REGISTER INFO:
msr : 0x0000b032 lr : 0x010d2100 ctr : 0x014d9e74
pc : 0x010d21cc cr : 0x20002822 xer : 0x00000000
r[ 0]:0x00000010 r[ 1]:0x0c112cd0 r[ 2]:0x03491fd0 r[ 3]:0x0c112cd8
r[ 4]:0x00000000 r[ 5]:0x00000064 r[ 6]:0x0c112bac r[ 7]:0x00000001
r[ 8]:0x00001762 r[ 9]:0x00010000 r[10]:0x00001763 r[11]:0x0c112be0
r[12]:0x00000b04 r[13]:0x040d1e40 r[14]:0x0c112e18 r[15]:0xfffffff6
r[16]:0x049f8958 r[17]:0x00000050 r[18]:0x00000005 r[19]:0x02cd2ce4
r[20]:0x00000000 r[21]:0x00000000 r[22]:0x00000003 r[23]:0x04a20fb4
r[24]:0x049f8958 r[25]:0x000001e0 r[26]:0x04a21194 r[27]:0x00000004
r[28]:0x04a211cc r[29]:0x0c112cd8 r[30]:0x0c112d88 r[31]:0x00000003
--------------------------------------------------------------------


 
Environment
S-Series
K-Series
Firmware 8.32.xx
Firmware 8.42.xx
Cause
The root cause of the issue was the bcmStrat[11.tLcIntr]MEM_FAIL interrupt error. This is a non fatal error that may occur rarely and under normal circumstances.
There is a SER ( Software Error recovery feature) that was introduced in version 8.31 that recovers it without any impact to the switch. Here is a link to an article with details on of this:

MEM_FAIL_INT_STAT errors in debug error log NOT accompanied by a blade reset
 
In the vast majority of cases this resolves the issue.

However, a couple of rare  cases it was seen,  with the memfail error above, where this error recovery did not work and repeated messages were seen constantly, resulting in the reset eventually.
Resolution
Additional notes
Note that to match this the error in all parts must be exactly the same. It is highly likely that GTAC diagnosis will be required

Feedback

 

Was this article helpful?


   

Feedback

Please tell us how we can make this article more useful.

Characters Remaining: 255