Reset Search
 

 

Article

Input Buffer Drops (RX discards) on VDX switches

« Go Back

Information

 
TitleInput Buffer Drops (RX discards) on VDX switches
Symptoms
Ingress discards on the interfaces between switches
Ingress discards on several interfaces on switches 
Environment
Several hops of switches connected to each other in VCS or stand alone 
Cause
Egress congestion on one or several edge ports towards a server that receives traffic from multiple sources 
Resolution

Head of Line Blocking


Switches with input buffering allow building high bandwidth environments and normally use cut-through switching with lower buffers to provide practically same latency and performance for any load up to line rate, with any packet size and features turned on.

In certain traffic scenarios, when an egress port is overloaded, these switches can suffer from so called Head of Line Blocking (HOLB) problem, that in a simple case can be illustrated as:

Head of Line Blocking

Here, with the default FIFO queuing, port P11 is congested and packets for free port P12 arriving on port P1 have to wait until P1 ingress queue can send out the packets for P11. If the congestion is long enough, the queue will build up, and eventually result in input drops on P1.

The scenario can be seen both on a single switch and in chains switches/Ethernet fabrics. There can be different reasons for the congestion on an egress port, such as:
  • Several ports on the same switch trying to transmit to the same egress port
  • Several ports on other switches trying to transmit to the same port on one switch. In this case the congestion can build up even faster, as normally switch uplinks will have a higher bandwidth, e.g. 40Gb/s, while edge ports will have a lower one, e.g. 10Gb/s
  • One server, either on the same or a different switch, with higher access port speed, e.g. 10Gb/s sending traffic to another server with a lower port speed, e.g. 1Gb/s
  • In extreme cases even one-to-one traffic between servers on different switches can result in an egress congestion, when inter switch link (ISL) has a higher bandwidth. As the transmit on an interface is always at line rate, some packets coming from 10Gb/s could theoretically be buffered on the source switch and then transmitted out on a 40Gb/s ISL to the destination switch, where they have to egress on a 10Gb/s link again
There are several different solutions, described further in separate sections, that can be combined or used individually to address the problem both within a single switch or for interconnected switches and fabrics.

Virtual Output Queuing

The most common solution is enabling Virtual Output Queuing (VOQ) on switches. When VOQ is enabled it will create one or several individual queues one from each input port to each output port. With this mechanism, all the frames received on an input port will be buffered on those newly created queues, instead of port's own input buffer. Therefore, frames for one port will not be blocked by a frame in front that is destined to a different output port.

Virtual Output Queuing

Topology Changes

When there are multiple switches connected together, as standalone or in Ethernet fabrics, the topology of the connections and traffic patterns can have significant impact on congestions and HOLB build up.

Squares and multiple hierarchy levels

Problem Topology

The topology above, with SW01 and SW02 being central/core switches in the environment and hosting uplinks to outside, shows several problems, that can have a different level of impact on HOLB. Solid lines represent 40GB/s links and dashed lines are 10Gb/s links. While hosts would often be uplinked to a pair of switches with Port Channels, any single flow will be hashed in one way, so a dashed line can represent that.

In the first case SW21 and SW22 are connected with a square topology. Some of the issues that might be more of a problem in squares, as compared to triangles are:
  • Suboptimal traffic flow for diagonal switches - SW01 to SW12 and SW12 to SW01 - that can introduce both latency and congestion. The latter can happen for example when both traffic from Host11 and from Host12 to Host21 is crossing the link between SW01 and SW21. In that case, at times, they can sum up to 20Gb/s traffic coming from SW01 to SW21 that needs to exit on 10GB/s link
  • Higher chances of congestion for a core to edge switch ISL. For example if there are two large flows towards SW21 coming via SW02 and another two via SW01, there's a higher probability of hashign to end up with three flows coming via one ISL and only one flow via the other ISL
  • The traffic to SW21 from SW22 will compete with roughly the half of the traffic to SW21 from every other switch, which would be hashed to SW22 from SW02
  • Higher convergence time for a core switch to learn about an indirect link failure and divert traffic, e.g. for SW01 to start sending traffic for SW22 to SW02 only when SW21-SW22 link fails
  • Increased number of hops, resulting in a higher delay and latency, in case of a link failure
In the second case illustrated there are 3 hierarchical levels with SW11/SW12 connected in a square to SW01/SW02 and then SW111/SW112 connected to SW111/SW112. In addition to the previous issues it's now possible that the traffic between two hosts on a pair of switches can compete with the traffic between two other hosts on a completely different pair of switches. For example the traffic from Host21 to Host111 and the traffic from Uplink1 to Host11 will both cross ISL from SW01 to SW11.

The third case represents a ring topology, which can have the same problems as both previous cases, which will affect it to higher or lesser degree, depending on the number of switches in the ring. Additionally the problems related to link failures will get much worse as the number of switches in the ring grows.

Next two topologies show a more optimal design. If changing the topology is not feasible for some reason, there still could be to improve traffic flows/congestion with some other steps, such as:
  • If there are physical servers that mostly talk to each other, they could be moved to the same switch, so that traffic between them does not cross an ISL
  • If there are hypervisors in use, hosted VMs could be rebalanced to optimize traffic patterns
  • For Ethernet fabric, multiple levels of hierarchy can be broken into separate fabrics, so that VoQ/QoS mechanisms can be applied on the links between switches
  • Group servers receiving roughly same amount of traffic on each switch. When there are only 1-2 ports edge ports on the switch receiving line rate most of the traffic from higher speed uplinks they might not have enough time to send it out. If there are 10-20 servers statistical chances of packets coming on uplink to exit on the same edge port are lower and the port might have enough time to empty its queue

Core/Edge topology with triangles

Triangle Topology

Core/Edge two level topology with triangles might be best suited for the environments where most of traffic is north-south. Normally all the hosts in this case would be connected to edge switches, with core switches only having higher speed southbound downlinks to the edges and nothrbound uplinks outside (see "Special Cases" section below).

It is still possible that a single port on an edge switch will be overloaded with traffic from other hosts and uplinks, which should be taken care of with VoQ, but now the traffic towards one switch does not interfere with the traffic to the other switches any more. The chances of an edge switch uplink being congested are lower, as statistically the traffic distribution on core switches to the edge switches should be more even, especially with a higher number of edge switches. There is also no problem with going from higher speed interfaces to lower speed ones on core switches.

Spine/Leaf Clos topology

Clos Topology

Spine/Leaf topology is a better option when most of the traffic in the environment is east-west. There are no more hosts or even uplinks on Spine switches anymore and also no links between them, so it's a Clos network, with each edge switch always two hops away from all the others.

The chances of an edge uplink congestion should be even less now, especially with increasing numbers of Spine switches, as there are multiple paths between each pair and all access interfaces have lower bandwidth as compared to interfaces between switches.

Port Settings Changes

Depending on the number of ports that experience congestion and access/core port traffic over-subscription on the switch, two different approaches to modifying port speed and configuration:
  • Add an extra links between switches. This will double the ingress buffers available on the switch uplinks, without negative contribution on flow delay, as each flow will still take one interface/path. Additionally this can help with receive discards on edge ports in cases when the switch is oversubscribed and the uplink bandwidth is not enough to accommodate the amount of traffic received on edge ports.
  • Increase the bandwidth on permanently overloaded edge ports, e.g. by converting them to port channels or using higher speed access ports, so that traffic can be sent out faster

PFC and Ethernet Pause

Priority flow control (PFC) is a QoS technique, defined in Data Center Bridging (DCB) Ethernet enhancements, used to provide a lossless transport for certain types of traffic (normally storage). The idea behind this technology is to use a PAUSE frame signal messaging, but, as opposed to legacy Ethernet pause mechanism, send PAUSE with a specific Class of Service (CoS) field used for this type of traffic. This CoS field is normally the one assigned to the lossless class of traffic, e.g. FCoE with CoS 3.

It is also possible to stop discards by enabling PFC pause frames inside fabric for default traffic CoS on ISLs and then PFC or Ethernet pause on edge ports towards the servers/devices connected to it. All of those edge devices should also be configured to support PFC or Ethernet pause, so they honor the PAUSE signals received. Technically this does not stop HoLB issue, but makes it more intelligent by asking connected devices to stop transmitting when input buffer is full. Since TCP/IP was not originally designed as a loseless protocol, this is more likely to result with a worse performance than allowing RED to drop some traffic, as PFC will completely pause the traffic on some ports and create upstream backpressure until the congestion is cleared, while with RED the traffic can continue to flow towards other hosts/ports that do not experience congestion.

TCP has its own and better built-in mechanisms to work with packet drops, that could be introduced with RED, instead of stopping the traffic on a port, in fact loss and delay are used as the feedback mechanisms in most of TCP congestion controls algorithms, as described in more details in the following section.

TCP congestion control

In TCP, the congestion window is one of the factors that determines the number of bytes that can be outstanding at any time. The congestion window is maintained by the sender and is a means of stopping a links between the sender and the receiver from becoming overloaded with too much traffic. It is calculated by estimating how much congestion there is on the link and the sender compares it with the TCP receive window advertised by the receiver to determine how much data it can be sent at any given time. When a connection is set up, the congestion window, a value maintained independently at each host, is set to a small multiple of the MSS allowed on that connection. If all the transmitted segments are received and the acknowledgments reach the sender on time, the window keeps growing exponentially until a timeout occurs or the sender reaches its limit.

When there is enough data to transmit (elephant flows), just two flows can easily overload the same egress port, since initially TCP congestion  window and session speed will grow fast, especially in high bandwidth low latency environment, as illustrated by a single session iperf TCP test below, where it takes less than two seconds to fill 10Gb/s links:
 
HOST1~ # iperf -s -i 1
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 10.8.8.1 port 5001 connected with 10.8.4.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  1.08 GBytes  9.31 Gbits/sec
[  3]  1.0- 2.0 sec  1.09 GBytes  9.33 Gbits/sec
[  3]  2.0- 3.0 sec  1.09 GBytes  9.33 Gbits/sec
[  3]  3.0- 4.0 sec  1.09 GBytes  9.33 Gbits/sec
[  3]  4.0- 5.0 sec  1.09 GBytes  9.33 Gbits/sec

until it hits some limit, such as interface speed limit above or a QoS policy, but, after a couple of drops and retransmits, it will quickly adjust to the minimum link speed available between the sender and the receiver.

Other QoS Mechanisms

Some other QoS mechanisms, such as adjusting buffers or configuring shaping and policing, can also address HOLB problems.

Configuring those policies just to prevent input buffer drops is somewhat overcomplicated and can be justified only when there is a need to additionally provide differentiated treatment for various traffic types.

Increasing buffers can help when the issues are a result of short bursts only, but sustained congestion will eventually fill even larger buffer. Additionally higher buffers will contribute to other potential problems, such as variable latency and delay (jitter) and, if the traffic is sensible for those, the resulting issues will be even harder to address and will require much more complex QoS policies.

Special cases

Centralized application servers

In some environments there could be hosts or devices that provide services for many other hosts and exchange constant traffic with those and each other, e.g. distributed storage nodes or load balancers. As illustrated on the diagram for core/edge topology, those can be connected to core switches with higher access speed links, similar to uplinks.

Remote SPAN sessions

In some cases there is a need to have traffic mirroring for troubleshooting, monitoring or some other purposes. Port mirroring is essentially an endless unidirectional elephant flow, as the traffic from the source port being mirrored to the destination monitor port will never stop.

It is normally fine to use Remote SPAN and cross ISLs for troubleshooting, as those as normally one-to-one mirror sessions and also de-configured relatively quickly, once the data necessary is collected. If port mirroring is to run endlessly, care should be taken when selecting destination port. Depending on the amount of traffic and number of ports being mirrored the following considerations could help:
  • connect monitoring destination to the core switches with high access speed links and Port Channels, effectively treating them as application servers in the previous case
  • make sure the ISL that is used to transport RSPAN session is not being mirrored itself
  • use multiple monitoring probes, connected to each switch, so that the traffic does not cross ISLs
  • if there has to be a single monitoring destination, connect it to an intermediate switch, which will be then uplinked to each of the other switches, where the source ports has to be mirrored
 

Appendix A. Configuration options for addressing input drops on VDX 6740

The below options can help when input drops are a result of HOLB and  congestion.

Note that receive discards can be result of some other problems, such as mac moves or corrupt packets. As VDX 6740 is a cut-through switch some of those problems might be detected on ingress at a later point, when some parts of that packet are already sent towards the destination switch in fabric and they will have to be invalidated upstream.

Configuring dynamic buffering

VDX-6740 series provides knobs to configure QoS buffer limits, thus enabling customers to adapt their networks to application needs.

Commands introduced for this feature:
  • Configure Ingress buffering upper-limit for handling high ingress burst:
With enhanced shared dynamic buffering mechanism, an interface is capable of bursting up to the recommended 2MB limit. Though a maximum of 8MB is allowed, you should consult your Brocade Engineer, as it may impact the performance of the other ports that may need to burst at the same time.
sw0(config-rbridge-1)# qos rcv-queue limit <buffering upper-limit>
*buffering_upper_limit* Defines the upper limit of buffering for the port. The range of queue limit values is from 128 KB through 8 MB. While any value within this range is valid, recommended values are 128, 256, 512, 1024, and 2048.

Default may vary depending on the NOS version and can be verified with
show qos rcv-queue interface tengigabitethernet X/Y/Z
  • Configure Egress buffering upper-limit for “many-to-one” traffic pattern”
With enhanced shared dynamic buffering mechanism, an interface is capable of bursting up to the recommended 2MB limit. Though a maximum of 8MB is allowed, you should consult your Brocade Engineer, as it may impact the performance of the other ports that may need to burst at the same time.
Sw0(config-rbridge-1)# qos tx-queue limit <buffering upper-limit>

buffering_upper_limit defines the upper limit of buffering for the port. The range of queue limit values is from 128 KB through 8 MB. While any value within this range is valid, recommended values are 128, 256, 512, 1024, and 2048.

Default may vary depending on the NOS version and can be verified with
show qos tx-queue interface tengigabitethernet X/Y/Z


Configuring Random Early Detection

Head of line blocking issues can be avoided on VDX 6740 by configuring Random Early Detection (RED), which can be done on per port bases for  each of eight traffic classes. Additionally RED will enable dropping packets for some TCP flows at configured levels in advance, prior to interface becoming fully congested. That will help to avoid TCP synchronization problem, which might happen when many TCP flows drop packets, back off and increase their speed again together.

To configure RED on VDX first a RED profile must be defined as:
RB1(config)# qos red-profile 1 min-threshold 65 max-threshold 95 drop-probability 50
Then it can applied on an interface for each or all traffic classes. In order to find the TC value of the egress port, please check the ```show qos interface ...``` of the ingress port of the switch and find the cos-TC mapping. In most cases the ingress port would be an ISL link and TC6 will be used by default.
RB1(config-Port-channel-1)# qos random-detect cos 6 red-profile-id 1

Note that this is an edge port configuration only and can not be applied on an ISL.

Any RED drops will now be displayed with its counters:
 
RB1# show qos red statistics interface port-channel 1
Statistics for interface:  1 (RbridgeId 1)
        traffic-class: 0, ProfileId: 1
        Port Statistics:
                Packets Dropped: 0, Queue Full Drops: 0
                Bytes Dropped: 0, Queue Full Drops: 0

Configuring PFC

While not recommended for the default traffic, PFC could be enabled for a specific traffic class (e.g. iSCSI), as in the example below for CoS4

1. Enable PFC for CoS4
VDX(config-cee-map-default)# priority-table 2 2 2 2 1 2 2 15.0
2012/09/08-01:05:36, [SSMD-1302], 9906,, INFO, SW-138,  CEEMap default priority table 2,2,2,1,2,2,2,15.0 is changed to 2,2,2,2,1,2,2,15.0.
2. To map the incoming ISCSI to cos4
a)  If the incoming traffic is tagged, you can enable PFC for the corresponding priority in the CEE map and apply the map on the edge port (“lldp disable” may be required if the other end does not support LLDP negotiation).
VDX# show running-config int t 134/0/21
interface TenGigabitEthernet 134/0/21
 cee default
 fabric isl enable
 fabric trunk enable
lldp disable
 no shutdown

b) If the incoming traffic is untagged, PFC cannot be generated and ethernet pause has to be used instead. CoS value can be set on the edge port:
VDX-134# show running-config int t 134/0/21
interface TenGigabitEthernet 134/0/21
 fabric isl enable
 fabric trunk enable
 qos cos 4
 qos flowcontrol tx on rx on
 no shutdown

Note that some QoS operations, such as marking or shaping are not allowed for the lossless priority


 
Additional notes

Feedback

 

Was this article helpful?


   

Feedback

Please tell us how we can make this article more useful.

Characters Remaining: 255