Reset Search



Continuous MEM_FAIL errors and elevated CPU with Hardware Maintenance Process on the blade

« Go Back


TitleContinuous MEM_FAIL errors and elevated CPU with Hardware Maintenance Process on the blade

Hardware Maintenance CPU process is high on the blade ( between 40 -70 %)
Constant repeated MEM_FAIL Errors as below:
Message 1/144 Syslog Message 03/24/2016 13:27:41
<3>bcmStrat[11.tLcIntr]MEM_FAIL_INT_STAT=0x00000000, EP_INTR_STATUS=0x00
000000, IP0_INTR_STATUS=0x00000000, IP1_INTR_STATUS=0x00000040, IP2_INTR
_STATUS=0x00000000, XQPORT_INTR_STATUS.xgport0=0x00000000, XQPORT_INTR_S
Message 2/144 Syslog Message 03/24/2016 13:27:41
<3>bcmStrat[11.tLcIntr]MEM_FAIL interrupt occurred on chip 4!

also the following has been seen :
Message  74/89   Syslog Message     08/01/2016 06:21:39

    <3>bcmStrat[10.tNimIntr]MEM_FAIL_INT_STAT=0x00000400, EGR_INTR0_STATUS=0
    x00000000, EGR_INTR1_STATUS=0x00000000, IP0_INTR_STATUS=0x00000001, IP1_
    INTR_STATUS=0x00000000, IP2_INTR_STATUS=0x00000000, IP2_INTR_STATUS_2=0x
    00000000, IP3_INTR_STATUS=0x00000000, IP4_INTR_STATUS=0x00000000, IP5_IN
    TR_STATUS=0x00000000, IP5_INTR_STATUS_1=0x00000000, IP5_INTR_STATUS_2=0x
    00000000, PG4_XINTR_STATUS=0x00000000, PG4_YINTR_STATUS=0x00000000, PG5_
    XINTR_STATUS=0x00000000, PG5_YINTR_STATUS=0x00000000
Message  75/89   Syslog Message     08/01/2016 06:21:39

    <3>bcmStrat[10.tNimIntr]MEM_FAIL interrupt occurred on chip 12!

May require manual reboot of the blade or may eventually crash with reset like this in the faultlog

Message 22/154 Informational 03/29/2016 17:03:19
Device was last fully operational in user mode: 42 seconds ago. Last res
et caused by: processor exception.
Message 23/154 Informational 03/29/2016 17:02:58
Board reset itself

Message 24/154 Exception PPC750 Info 03/29/2016 17:02:58

Exc Vector: DSI exception (0x00000300)
Thread Name: bcmDPC
Exc Addr: 0x010d21cc
Thread Stack: 0x0c113000..0x0c10f000
Stack Pointer: 0x0c112bb0
Traceback Stack:

[ 0] 0x010d2100
[ 1] 0x010cec4c
[ 2] 0x01101c60
[ 3] 0x011027b4
[ 4] 0x01089c3c
[ 5] 0x016731a4
[ 6] 0xeeeeeeee
[ 7] 0x00000000

srr0 : 0x010d21cc srr1 : 0x0000b032 dar : 0x00009954
cr : 0x20002822 xer : 0x00000000 fpcsr:0x00000000

msr : 0x0000b032 lr : 0x010d2100 ctr : 0x014d9e74
pc : 0x010d21cc cr : 0x20002822 xer : 0x00000000
r[ 0]:0x00000010 r[ 1]:0x0c112cd0 r[ 2]:0x03491fd0 r[ 3]:0x0c112cd8
r[ 4]:0x00000000 r[ 5]:0x00000064 r[ 6]:0x0c112bac r[ 7]:0x00000001
r[ 8]:0x00001762 r[ 9]:0x00010000 r[10]:0x00001763 r[11]:0x0c112be0
r[12]:0x00000b04 r[13]:0x040d1e40 r[14]:0x0c112e18 r[15]:0xfffffff6
r[16]:0x049f8958 r[17]:0x00000050 r[18]:0x00000005 r[19]:0x02cd2ce4
r[20]:0x00000000 r[21]:0x00000000 r[22]:0x00000003 r[23]:0x04a20fb4
r[24]:0x049f8958 r[25]:0x000001e0 r[26]:0x04a21194 r[27]:0x00000004
r[28]:0x04a211cc r[29]:0x0c112cd8 r[30]:0x0c112d88 r[31]:0x00000003

Firmware 8.32.xx
Firmware 8.42.xx
The root cause of the issue was the bcmStrat[11.tLcIntr]MEM_FAIL interrupt error. This is a non fatal error that may occur rarely and under normal circumstances.
There is a SER ( Software Error recovery feature) that was introduced in version 8.31 that recovers it without any impact to the switch. Here is a link to an article with details on of this:

MEM_FAIL_INT_STAT errors in debug error log NOT accompanied by a blade reset
In the vast majority of cases this resolves the issue.

However, a couple of rare  cases it was seen,  with the memfail error above, where this error recovery did not work and repeated messages were seen constantly, resulting in the reset eventually.
Additional notes
Note that to match this the error in all parts must be exactly the same. It is highly likely that GTAC diagnosis will be required



Was this article helpful?



Please tell us how we can make this article more useful.

Characters Remaining: 255