Reset Search
 

 

Article

BR-MLX-100Gx2-CFP2 line card module suddenly reboots

« Go Back

Information

 
TitleBR-MLX-100Gx2-CFP2 line card module suddenly reboots
Symptoms
In an MLXe system running NI 5.8.00ec, a BR-MLX-100Gx2-CFP2 line card module suddenly reboots and leaves behind in "show save" a crash dump with the following elements:
EXCEPTION 1200, Data TLB error

Possible Stack Trace (function call return address list)
21580f78: memcpy(pc)
2118df08: kbp_memcpy(lr)
20b3d984: kbp_npxxpt_compare_data
20b3ceec: kbp_npxxpt_execute_req
20b3cce8: kbp_npxxpt_service_reqs
214b3af0: kbp_xpt_service_requests
214b238c: kbp_dm_12k_cbwlpm
214990b0: device_compare
2149a308: kbp_instruction_search
215042c8: NlmNsTrie__CheckAndFixRpt
2150435c: NlmNsTrie__FindIptUnderRpt
21504378: NlmNsTrie__FindIptUnderRpt
21504378: NlmNsTrie__FindIptUnderRpt
21504378: NlmNsTrie__FindIptUnderRpt
21504378: NlmNsTrie__FindIptUnderRpt
21504378: NlmNsTrie__FindIptUnderRpt
21504378: NlmNsTrie__FindIptUnderRpt
215043e4: NlmNsTrie__FindRptEntries
215043f4: NlmNsTrie__FindRptEntries
215043f4: NlmNsTrie__FindRptEntries
215043f4: NlmNsTrie__FindRptEntries
215043f4: NlmNsTrie__FindRptEntries
215043f4: NlmNsTrie__FindRptEntries
215043f4: NlmNsTrie__FindRptEntries
215043f4: NlmNsTrie__FindRptEntries
215043f4: NlmNsTrie__FindRptEntries
21504538: NlmNsTrie__SearchAndRepairRpt
21512bec: kbp_ftm_search_and_repair_rpt
214f3420: kbp_lpm_db_advanced_search_and_repair
21525ce0: kbp_device_advanced_fix_errors
214a14c4: kbp_device_12k_fix_parity_errors
21496c38: kbp_device_fix_errors
20b37004: netroute_ifsr_fix_errors
20ab0ea0: nlcam_ifsr_netroute_scan_errors
20ab04e4: nlcam_ifsr_fifo_poll
200058b0: perform_callback
200062b8: timer_timeout
00040160: sys_end_entry
0005e49c: suspend
0005cf74: dev_sleep
00005024: xsyscall
207ed6a0: main
00040158: sys_end_task

An examination of "show logging" shows that the line card reboot happened after a series of TCAM In-Field Soft Repairs for that module. For example:
Jul 28 08:09:55:N:System: Module down in slot 3, reason CARD_DOWN_REASON_REBOOTED. Error Code 0
Jul 28 08:08:39:I:IFSR: Soft Repair at TCAM index 0x00062192 of SLOT 3 PPCR 2
Jul 28 08:08:39:I:IFSR: Soft Repair failed on SLOT 3 PPCR 1
Jul 28 08:07:39:I:IFSR: Soft Repair at TCAM index 0x00062192 of SLOT 3 PPCR 2
Jul 28 08:07:39:I:IFSR: Soft Repair at TCAM index 0x00062192 of SLOT 3 PPCR 2
Jul 28 08:07:39:I:IFSR: Soft Repair failed on SLOT 3 PPCR 1
Jul 28 08:06:39:I:IFSR: Soft Repair at TCAM index 0x00062192 of SLOT 3 PPCR 2
Jul 28 08:05:39:I:IFSR: Soft Repair at TCAM index 0x00062192 of SLOT 3 PPCR 2
Jul 28 08:04:38:I:IFSR: Soft Repair at TCAM index 0x00062192 of SLOT 3 PPCR 2
Jul 28 08:04:38:I:IFSR: Soft Repair at TCAM index 0x00062192 of SLOT 3 PPCR 2
Jul 28 08:03:38:I:IFSR: Soft Repair at TCAM index 0x00062192 of SLOT 3 PPCR 2
Jul 28 08:02:38:I:IFSR: Soft Repair at TCAM index 0x00062197 of SLOT 3 PPCR 2
Jul 28 08:02:38:I:IFSR: Soft Repair at TCAM index 0x00062196 of SLOT 3 PPCR 2
Jul 28 08:02:38:I:IFSR: Soft Repair at TCAM index 0x00062195 of SLOT 3 PPCR 2
Jul 28 08:02:38:I:IFSR: Soft Repair at TCAM index 0x00062194 of SLOT 3 PPCR 2
Jul 28 08:02:38:I:IFSR: Soft Repair at TCAM index 0x00062193 of SLOT 3 PPCR 2
Jul 28 08:02:38:I:IFSR: Soft Repair at TCAM index 0x00062192 of SLOT 3 PPCR 2
Jul 28 08:02:38:I:IFSR: Soft Repair at TCAM index 0x00062191 of SLOT 3 PPCR 2
Jul 28 08:02:38:I:IFSR: Soft Repair at TCAM index 0x00062190 of SLOT 3 PPCR 2
Jul 28 08:01:38:I:IFSR: Soft Repair at TCAM index 0x00062197 of SLOT 3 PPCR 2
Jul 28 08:01:38:I:IFSR: Soft Repair at TCAM index 0x00062196 of SLOT 3 PPCR 2
Jul 28 08:01:38:I:IFSR: Soft Repair at TCAM index 0x00062195 of SLOT 3 PPCR 2
Jul 28 08:01:38:I:IFSR: Soft Repair at TCAM index 0x00062194 of SLOT 3 PPCR 2
Jul 28 08:01:38:I:IFSR: Soft Repair at TCAM index 0x00062193 of SLOT 3 PPCR 2
Jul 28 08:01:38:I:IFSR: Soft Repair at TCAM index 0x00062192 of SLOT 3 PPCR 2
Jul 28 08:01:38:I:IFSR: Soft Repair at TCAM index 0x00062192 of SLOT 3 PPCR 2
Jul 28 08:01:38:I:IFSR: Soft Repair at TCAM index 0x00062191 of SLOT 3 PPCR 2
Jul 28 08:01:38:I:IFSR: Soft Repair at TCAM index 0x00062190 of SLOT 3 PPCR 2
Jul 28 08:00:38:I:IFSR: Soft Repair at TCAM index 0x00062192 of SLOT 3 PPCR 2
Jul 28 07:59:38:I:IFSR: Soft Repair at TCAM index 0x00062192 of SLOT 3 PPCR 2

Examination of the routing table size shows a large number of BGP routes.
============================================================
BEGIN: show ip route summary
CONTEXT: MP
TIME-STAMP: 2155905152 milli sec since device started
============================================================
IP Routing Table - 831983 entries
  7 connected, 3 static, 0 RIP, 70 OSPF, 831903 BGP, 0 ISIS
  Number of prefixes:
  /0: 1 /8: 10 /9: 11 /10: 38 /11: 100 /12: 293 /13: 570 /14: 1132 /15: 1902 /16: 13263 /17: 8020 /18: 13869 /19: 25769 /20: 40295 /21: 47155 /22: 95525 /23: 77200 /24: 436252 /25: 509 /26: 509 /27: 488 /28: 308 /29: 392 /30: 416 /31: 28 /32: 67928 
Nexthop Table Entry - 36 entries

 
Environment
  • MLXe chassis
  • BR-MLX-100Gx2-CFP2 line card
  • NI 5.8.00ec software
  • BGP
  • Full global BGP routing table
Cause
This issue corresponds with DEFECT000660088, listed in release notes as fixed in versions NI 6.3.00, NI 6.0.00g, and NI 6.2.00c.
In extremely rare instances during the course of In-Field Soft Repair, reload of the line card can occur. This may be more likely when BGP, an extremely large routing table, and an extremely large number of TCAM entries are in use.

The release notes list "clear bgp neighbor" as a trigger for DEFECT000660088. During the investigation of DEFECT000660088, "
clear ip bgp neighbor all" was used in a lab reproduction to trigger the issue. The issue has been observed happening suddenly in production environments without execution of "clear ip bgp neighbor all". It is not necessary to execute "clear ip bgp neighbor all" in order to see this issue.
Resolution
As a long term solution, upgrade to a version of software with the fix for DEFECT000660088 -- such as NI 6.3.00, NI 6.0.00g, NI 6.2.00c, or later -- is necessary.

As of the creation of this article, this module reboot has not been observed happening multiple times in a given chassis or module. A short term solution is not strictly necessary. Efforts should be focused on software upgrade. If a short term, temporary solution is still desired, TCAM In-Field Soft Repair can be disabled with the global setting "
cam ifsr disable". After software upgrade, IFSR can be enabled again with "cam ifsr enable".
Additional notes

Feedback

 

Was this article helpful?


   

Feedback

Please tell us how we can make this article more useful.

Characters Remaining: 255