Developing an Optimum Design for Layer 3
To achieve high availability and fast convergence in the Cisco enterprise campus network, the designer needs to manage multiple objectives, including the following:
- Managing oversubscription and bandwidth
- Supporting link load balancing
- Routing protocol design
This section reviews design models and recommended practices for high availability and fast convergence in Layer 3 of the Cisco enterprise campus network.
Managing Oversubscription and Bandwidth
Typical campus networks are designed with oversubscription, as illustrated in Figure 2-10. The rule-of-thumb recommendation for data oversubscription is 20:1 for access ports on the access-to-distribution uplink. The recommendation is 4:1 for the distribution-to-core links. When you use these oversubscription ratios, congestion may occur infrequently on the uplinks. QoS is needed for these occasions. If congestion is frequently occurring, the design does not have sufficient uplink bandwidth.
Figure 2-10 Managing Oversubscription and Bandwidth
As access layer bandwidth capacity increases to 1 Gb/s, multiples of 1 Gb/s, and even 10 Gb/s, the bandwidth aggregation on the distribution-to-core uplinks might be supported on many Gigabit Ethernet EtherChannels, on 10 Gigabit Ethernet links, and on 10 Gigabit EtherChannels.
Bandwidth Management with EtherChannel
As bandwidth from the distribution layer to the core increases, oversubscription to the access layer must be managed, and some design decisions must be made.
Just adding more uplinks between the distribution and core layers leads to more peer relationships, with an increase in associated overhead.
EtherChannels can reduce the number of peers by creating single logical interface. However, you must consider some issues about how routing protocols will react to single link failure:
- OSPF running on a Cisco IOS Software-based switch will notice a failed link, and will increase the link cost. Traffic is rerouted, and this design leads to a convergence event.
- OSPF running on a Cisco Hybrid-based switch will not change link cost. Because it will continue to use the EtherChannel, this may lead to an overload in the remaining links in the bundle as OSPF continues to divide traffic equally across channels with different bandwidths.
- EIGRP might not change link cost, because the protocol looks at the end-to-end cost. This design might also overload remaining links.
The EtherChannel Min-Links feature is supported on LACP EtherChannels. This feature allows you to configure the minimum number of member ports that must be in the link-up state and bundled in the EtherChannel for the port channel interface to transition to the link-up state. You can use the EtherChannel Min-Links feature to prevent low-bandwidth LACP EtherChannels from becoming active.
Bandwidth Management with 10 Gigabit Interfaces
Upgrading the uplinks between the distribution and core layers to 10 Gigabit Ethernet links is an alternative design for managing bandwidth. The 10 Gigabit Ethernet links can also support the increased bandwidth requirements.
This is a recommended design:
- Unlike the multiple link solution, 10 Gigabit Ethernet links do not increase routing complexity. The number of routing peers is not increased.
- Unlike the EtherChannel solution, the routing protocols will have the ability to deterministically select the best path between the distribution and core layer.
Link Load Balancing
In Figure 2-11, many equal-cost, redundant paths are provided in the recommended network topology from one access switch to the other across the distribution and core switches. From the perspective of the access layer, there are at least three sets of equal-cost, redundant links to cross to reach another building block, such as the data center.
Figure 2-11 CEF Load Balancing (Default Behavior)
Cisco Express Forwarding (CEF) is a deterministic algorithm. As shown in the figure, when packets traverse the network that all use the same input value to the CEF hash, a "go to the right" or "go to the left" decision is made for each redundant path. When this results in some redundant links that are ignored or underutilized, the network is said to be experiencing CEF polarization.
To avoid CEF polarization, you can tune the input into the CEF algorithm across the layers in the network. The default input hash value is Layer 3 for source and destination. If you change this input value to Layer 3 plus Layer 4, the output hash value also changes.
As a recommendation, use alternating hashes in the core and distribution layer switches:
- In the core layer, continue to use the default, which is based on only Layer 3 information.
- In the distribution layer, use the Layer 3 plus Layer 4 information as input into the CEF hashing algorithm with the command Dist2-6500 (config)#mls ip cef load-sharing full.
This alternating approach helps eliminate the always-right or always-left biased decisions and helps balance the traffic over equal-cost, redundant links in the network.
Link Load Balancing
EtherChannel allows load sharing of traffic among the links in the channel and redundancy in the event that one or more links in the channel fail.
You can tune the hashing algorithm used to select the specific EtherChannel link on which a packet is transmitted. You can use the default Layer 3 source and destination information, or you can add a level of load balancing to the process by adding the Layer 4 TCP/IP port information as an input to the algorithm.
Figure 2-12 illustrates some results from experiments at Cisco in a test environment using a typical IP addressing scheme of one subnet per VLAN, two VLANs per access switch, and the RFC 1918 private address space. The default Layer 3 hash algorithm provided about one-third to two-thirds utilization. When the algorithm was changed to include Layer 4 information, nearly full utilization was achieved with the same topology and traffic pattern.
Figure 2-12 EtherChannel Load Balancing
The recommended practice is to use Layer 3 plus Layer 4 load balancing to provide as much information as possible for input to the EtherChannel algorithm to achieve the best or most uniform utilization of EtherChannel members. The command port-channel load-balance is used to present the more unique values to the hashing algorithm. This can be achieved using the command dist1-6500(config)#port-channel load-balance src-dst-port.
To achieve the best load balancing, use two, four, or eight ports in the port channel.
Routing Protocol Design
This section reviews design recommendations for routing protocols in the enterprise campus.
Routing protocols are typically deployed across the distribution-to-core and core-to-core interconnections.
Layer 3 routing design can be used in the access layer, too, but this design is currently not as common.
Layer 3 routing protocols are used to quickly reroute around failed nodes or links while providing load balancing over redundant paths.
Build Redundant Triangles
For optimum distribution-to-core layer convergence, build redundant triangles, not squares, to take advantage of equal-cost, redundant paths for the best deterministic convergence.
The topology connecting the distribution and core switches should be built using triangles, with equal-cost paths to all redundant nodes. The triangle design is shown in Figure 2-13 Model A, and uses dual equal-cost paths to avoid timer-based, nondeterministic convergence. Instead of indirect neighbor or route-loss detection using hellos and dead timers, the triangle design failover is hardware based and relies on physical link loss to mark a path as unusable and reroute all traffic to the alternate equal-cost path. There is no need for OSPF or EIGRP to recalculate a new path.
Figure 2-13 Build Redundant Triangles
In contrast, the square topology shown in Figure 2-14 Model B requires routing protocol convergence to fail over to an alternate path in the event of a link or node failure. It is possible to build a topology that does not rely on equal-cost, redundant paths to compensate for limited physical fiber connectivity or to reduce cost. However, with this design, it is not possible to achieve the same deterministic convergence in the event of a link or node failure, and for this reason the design will not be optimized for high availability.
Figure 2-14 Use Passive Interfaces at the Access Layer
Peer Only on Transit Links
Another recommended practice is to limit unnecessary peering across the access layer by peering only on transit links.
By default, the distribution layer switches send routing updates and attempt to peer across the uplinks from the access switches to the remote distribution switches on every VLAN. This is unnecessary and wastes CPU processing time.
Figure 2-14 shows an example network where with 4 VLANs per access switch and 3 access switches, 12 unnecessary adjacencies are formed. Only the adjacency on the link between the distribution switches is needed. This redundant Layer 3 peering has no benefit from a high-availability perspective, and only adds load in terms of memory, routing protocol update overhead, and complexity. In addition, in the event of a link failure, it is possible for traffic to transit through a neighboring access layer switch, which is not desirable.
As a recommended practice, limit unnecessary routing peer adjacencies by configuring the ports toward Layer 2 access switches as passive, which will suppress the advertising of routing updates. If a distribution switch does not receive routing updates from a potential peer on a specific interface, it will not need to process these updates, and it will not form a neighbor adjacency with the potential peer across that interface.
There are two approaches to configuring passive interfaces for the access switches:
- Use the passive-interface default command, and selectively use the no passive-interface command to enable a neighboring relationship where peering is desired.
Use the passive-interface command to selectively make specific interfaces passive.
Passive interface configuration example for OSPF:
AGG1(config)#router ospf 1 AGG1(config-router)#passive-interface Vlan 99 ! Or AGG1(config)#router ospf 1 AGG1(config-router)#passive-interface default AGG1(config-router)#no passive-interface Vlan 99
Passive interface configuration example for EIGRP:
AGG1(config)#router EIGRP 1 AGG1(config-router)#passive-interface Vlan 99 ! Or AGG1(config)#router EIGRP 1 AGG1(config-router)#passive-interface default AGG1(config-router)#no passive-interface Vlan 99
You should use whichever technique requires the fewest lines of configuration or is the easiest for you to manage.
Summarize at the Distribution Layer
A hierarchical routing design reduces routing update traffic and avoids unnecessary routing computations. Such a hierarchy is achieved through allocating IP networks in contiguous blocks that can be easily summarized by a dynamic routing protocol.
It is a recommended practice to configure route summarization at the distribution layer to advertise a single summary route to represent multiple IP networks within the building (switch block). As a result, fewer routes will be advertised through the core layer and subsequently to the distribution layer switches in other buildings (switch blocks). If the routing information is not summarized toward the core, EIGRP and OSPF require interaction with a potentially large number of peers to converge around a failed node.
Summarization at the distribution layer optimizes the rerouting process. If a link to an access layer device goes down, return traffic at the distribution layer to that device is dropped until the IGP converges. When summarization is used and the distribution nodes send summarized information toward the core, an individual distribution node does not advertise loss of connectivity to a single VLAN or subnet. This means that the core does not know that it cannot send traffic to the distribution switch where the access link has failed. Summaries limit the number of peers that an EIGRP router must query or the number of link-state advertisements (LSA) that OSPF must process, and thereby speeds the rerouting process.
Summarization should be performed at the boundary where the distribution layer of each building connects to the core. The method for configuring route summarization varies, depending on the IGP being used. Route summarization is covered in detail in Chapter 3, "Developing an Optimum Design for Layer 3." These designs require a Layer 3 link between the distribution switches, as shown in Figure 2-15, to allow the distribution node that loses connectivity to a given VLAN or subnet the ability to reroute traffic across the distribution-to-distribution link. To be effective, the address space selected for the distribution-to-distribution link must be within the address space being summarized.
Figure 2-15 Summarize at the Distribution Layer
Summarization relies on a solid network addressing design.
First-hop redundancy or default-gateway redundancy is an important component in convergence in a highly available hierarchical network design.
First-hop redundancy allows a network to recover from the failure of the device acting as the default gateway for end nodes on a physical segment. When the access layer is Layer 2, the distribution layer switches act as the default gateway for the entire Layer 2 domain that they support, as illustrated in Figure 2-16.
Figure 2-16 First-Hop Redundancy
A first-hop redundancy protocol is needed only if the design implements Layer 2 between the access switch and the distribution switch. If Layer 3 is supported to the access switch, the default gateway for end devices is at the access level.
In Cisco deployments, HSRP, developed by Cisco, is typically used as the FHRP. VRRP is an Internet Engineering Task Force (IETF) standards-based method of providing default-gateway redundancy. More deployments are starting to use GLBP, which can more easily achieve load balancing on the uplinks from the access layer to the distribution layer, and first-hop redundancy and failure protection.
HSRP and VRRP with Cisco enhancements both provide a robust method of backing up the default gateway, and can provide subsecond failover to the redundant distribution switch when tuned properly. HSRP is the recommended protocol over VRRP because it is a Cisco-owned standard, which allows for the rapid development of new features and functionality before VRRP. VRRP is the logical choice over HSRP when interoperability with other vendor devices is required.
HSRP or GLBP timers can be reliably tuned to achieve 800-ms convergence for link or node failure in the Layer 2 and Layer 3 boundary in the building distribution layer. The following configuration snippet shows how HSRP can be tuned down from its default 3-second hello timer and 10-second hold timer in a campus environment to achieve subsecond convergence on aggregation switches:
interface Vlan5 ip address 10.1.5.3 255.255.255.0 ip helper-address 10.5.10.20 standby 1 ip 10.1.5.1 standby 1 timers msec 200 msec 750 standby 1 priority 150 standby 1 preempt delay minimum 180
Preempt Delay Tuning
One important factor to take into account when tuning default gateway redundancy using HSRP or another protocol is its preemptive behavior.
Preemption causes the primary HSRP peer to re-assume the primary role when it comes back online after a failure or maintenance event. Preemption is the desired behavior because the RSTP root should be the same device as the HSRP primary for a given subnet or VLAN. However, if HSRP and RSTP are not synchronized after failure recovery, the interconnection between the distribution switches can become a transit link, and traffic takes a multihop Layer 2 path to its default gateway.
HSRP preemption needs to be aware of switch boot time and connectivity to the rest of the network. Preempt delay must be longer than the switch boot time:
- Layer 1 traffic forwarding on line cards
- Layer 2 STP convergence
- Layer 3 IGP convergence
It is possible for HSRP neighbor relationships to form and preemption to occur before the primary switch has Layer 3 connectivity to the core. If this happens, traffic from the access layer can be dropped until full connectivity is established to the core.
The recommended practice is to measure the system boot time, and set the HSRP preempt delay with the standby preempt delay minimum command to 50 percent greater than this value. This ensures that the HSRP primary distribution node has established full connectivity to all parts of the network before HSRP preemption is allowed to occur.
Figure 2-17 demonstrates the positive impact that proper HSRP tuning can have on network convergence.
Figure 2-17 HSRP Preempt Delay Tuning
Overview of Gateway Load Balancing Protocol
GLBP is a first-hop redundancy protocol designed by Cisco that allows packet load sharing between groups of redundant routers.
When HSRP or VRRP is used to provide default-gateway redundancy, the backup members of the peer relationship are idle, waiting for a failure event to occur before they take over and actively forward traffic. Methods to use uplinks with HSRP or VRRP are difficult to implement and manage. In one technique, the HSRP and STP or RSTP roots alternate between distribution node peers, with the even VLANs homed on one peer and the odd VLANs homed on the alternate. Another technique uses multiple HSRP groups on a single interface and uses DHCP to alternate between the multiple default gateways. These techniques work but are not optimal from a configuration, maintenance, or management perspective.
GLBP provides all the benefits of HSRP and includes load balancing, too. For HSRP, a single virtual MAC address is given to the endpoints when the endpoints use Address Resolution Protocol (ARP) to learn the physical MAC address of their default gateways. GLBP allows a group of routers to function as one virtual router by sharing one virtual IP address while using multiple virtual MAC addresses for traffic forwarding. Figure 2-18 shows a sample configuration supporting GLBP and its roles.
Figure 2-18 Gateway Load Balancing Protocol
When an endpoint uses ARP for its default gateway, by default the virtual MACs are provided by the GLBP active virtual gateway (AVG) on a round-robin basis. These gateways that assume responsibility for forwarding packets sent to the virtual MAC address are known as active virtual forwarders (AVF) for their virtual MAC address. Because the traffic from a single common subnet goes through multiple redundant gateways, all the uplinks can be used.
Failover and convergence in GLBP work in a similar fashion as HSRP. A secondary virtual forwarder (SVF) takes over for traffic destined to a virtual MAC impacted by the failure and begins forwarding traffic for its failed peer. The end result is that a more equal utilization of the uplinks is achieved with minimal configuration. As a side effect, a convergence event on the uplink or on the primary distribution node affects only half as many hosts with a pair of GLBP switches, giving a convergence event an average of 50 percent less impact.
Note that using GLBP in topologies where STP has blocked one of the access layer uplinks may result in a two-hop path at Layer 2 for upstream traffic, as illustrated in Figure 2-19.
Figure 2-19 GLBP VLAN Spanning
In environments where VLANs span across the distribution switches, HSRP is the preferred FHRP implementation.
In some cases, the STP environment can be tuned so that the Layer 2 link between the distribution switches is the blocking link while the uplinks from the access layer switches are in a forwarding state.
Figure 2-20 illustrates how you can tune STP by using the spanning-tree cost interface configuration command to change the port cost on the interface between the distribution layer switches on the STP secondary root switch. This option works if no VLANs span access switches.
Figure 2-20 GLBP and STP Tuning
However, if the same VLAN is on multiple access switches, you will have a looped figure-eight topology where one access layer uplink is still blocking. The preferred design is to not span VLANs across access switches.
Optimizing FHRP Convergence
HSRP can be reliably tuned to achieve 800-ms convergence for link or node failure. With HSRP, all flows from one subnet go through the active HSRP router; so the longest, shortest, and average convergence times are the same and less than a second.
VRRP can be tuned with subsecond timers, although the results of this timer tuning is not known. With VRRP, all flows from one subnet go through the same VRRP master router, so the longest, shortest, and average convergence times are the same and about a second.
GLBP can also be reliably tuned to achieve 800-ms convergence for link or node failure. With GLBP, a convergence event on an uplink or on the primary distribution node affects only half as many hosts, so a convergence event has an average of 50 percent less impact than with HSRP or VRRP if the default round-robin load-balancing algorithm is used.
GLBP is currently supported on the Cisco Catalyst 6500 series switches and the Cisco Catalyst 4500 series switches.
Figure 2-21 illustrates the difference in convergence times between each of the respective FHRP when deployed on a distribution to access link in a server farm.
Figure 2-21 Optimizing FHRP Convergence
Supporting a Layer 2 to Layer 3 Boundary Design
This following section reviews design models and recommended practices for supporting the Layer 2 to Layer 3 boundary in highly available enterprise campus networks.
Layer 2 to Layer 3 Boundary Design Models
There are several design models for placement of the Layer 2 to Layer 3 boundary in the enterprise campus.
Layer 2 Distribution Switch Interconnection
If the enterprise campus requirements must support VLANs spanning multiple access layer switches, the design model uses a Layer 2 link for interconnecting the distribution switches.
The design, illustrated here in Figure 2-22, is more complex than the Layer 3 interconnection of the distribution switches. The STP convergence process will be initiated for uplink failures and recoveries.
Figure 2-22 Layer 2 Distribution Switch Interconnection
You can improve this suboptimal design as follows:
- Use RSTP as the version of STP.
- Provide a Layer 2 trunk between the two distribution switches to avoid unexpected traffic paths and multiple convergence events.
- If you choose to load balance VLANs across uplinks, be sure to place the HSRP primary and the STP primary on the same distribution layer switch. The HSRP and RSTP root should be collocated on the same distribution switches to avoid using the inter-distribution link for transit.
Layer 3 Distribution Switch Interconnection (HSRP)
Figure 2-23 shows the model which supports a Layer 3 interconnection between distribution switches using HSRP as the FHRP.
Figure 2-23 Layer 3 Distribution Switch Interconnection
In this time-proven topology, no VLANs span between access layer switches across the distribution switches. A subnet equals a VLAN, which equals an access switch. The root for each VLAN is aligned with the active HSRP instance. From a STP perspective, both access layer uplinks are forwarding, so the only convergence dependencies are the default gateway and return-path route selection across the distribution-to-distribution link.
This recommended design provides the highest availability.
With this design, a distribution-to-distribution link is required for route summarization. A recommended practice is to map the Layer 2 VLAN number to the Layer 3 subnet for ease of use and management.
Layer 3 Distribution Switch Interconnection (GLBP)
GLBP can also be used as the FHRP with the Layer 3 distribution layer interconnection model, as shown in Figure 2-24.
Figure 2-24 Layer 3 Distribution Switch Interconnection with GLBP
GLBP allows full utilization of the uplinks from the access layer. However, because the distribution of ARP responses is random, it is less deterministic than the design with HSRP. The distribution-to-distribution link is still required for route summarization. Because the VLANs do not span access switches, STP convergence is not required for uplink failure and recovery.
Layer 3 Access to Distribution Interconnection
The design extending Layer 3 to the access layer, shown here in Figure 2-25, provides the fastest network convergence.
Figure 2-25 Layer 3 Access to Distribution Interconnection
A routing protocol such as EIGRP, when properly tuned, can achieve better convergence results than designs that rely on STP to resolve convergence events. A routing protocol can even achieve better convergence results than the time-tested design placing the Layer 2 to Layer 3 boundary at the distribution layer. The design is easier to implement than configuring Layer 2 in the distribution layer because you do not need to align STP with HSRP or GLBP.
This design supports equal-cost Layer 3 load balancing on all links between the network switches. No HSRP or GLBP configuration is needed because the access switch is the default gateway for the end users. VLANs cannot span access switches in this design.
The convergence time required to reroute around a failed access-to-distribution layer uplink is reliably under 200 ms as compared to 900 ms for the design placing the Layer 2 and Layer 3 boundary at the distribution layer. Return-path traffic is also in the sub-200 ms of convergence time for an EIGRP reroute, again compared to 900 ms for the traditional Layer 2 to Layer 3 distribution layer model.
Because both EIGRP and OSPF loads share over equal-cost paths, this design provides a convergence benefit similar to GLBP. Approximately 50 percent of the hosts are not affected by a convergence event because their traffic is not flowing over the link or through the failed node.
However, some additional complexity associated with uplink IP addressing and subnetting and the loss of flexibility is associated with this design alternative.
Routing in the access layer is not as widely deployed in the enterprise environment as the Layer 2 and Layer 3 distribution layer boundary model.
EIGRP Access Design Recommendations
When EIGRP is used as the routing protocol for a fully routed or routed access layer solution, with tuning it can achieve sub-200 ms convergence.
EIGRP to the distribution layer is similar to EIGRP in the branch, but it's optimized for fast convergence using these design rules:
Limit scope of queries to a single neighbor:
Summarize at the distribution layer to the core as is done in the traditional Layer 2 to Layer 3 border at the distribution layer. This confines impact of an individual access link failure to the distribution pair by stopping EIGRP queries from propagating beyond the core of the network. When the distribution layer summarizes toward the core, queries are limited to one hop from the distribution switches, which optimizes EIGRP convergence.
Configure all access switches to use EIGRP stub nodes so that the access devices are not queried by the distribution switches for routes. EIGRP stub nodes cannot act as transit nodes and do not participate in EIGRP query processing. When the distribution node learns through the EIGRP hello packets that it is talking to a stub node, it does not flood queries to that node.
- Control route propagation to access switches using distribution lists. The access switches need only a default route to the distribution switches. An outbound distribution list applied to all interfaces facing the access layer from the distribution switch will conserve memory and optimize performance at the access layer.
Set hello and dead timers to 1 and 3 as a secondary mechanism to speed up convergence. The link failure or node failure should trigger convergence events. Tune EIGRP hello and dead timers to 1 and 3, respectively, to protect against a soft failure in which the physical links remain active but hello and route processing has stopped.
EIGRP optimized configuration example:
interface GigabitEthernet1/1 ip hello-interval eigrp 100 2 ip hold-time eigrp 100 6 router eigrp 100 eigrp stub connected
OSPF Access Design Recommendations
When OSPF is used as the routing protocol for a fully routed or routed access layer solution with tuning it can also achieve sub-200-ms convergence.
OSPF to the distribution layer is similar to OSPF in the branch, but it's optimized for fast convergence. With OSPF, summarization and limits to the diameter of OSPF LSA propagation is provided through implementation of Layer 2 to Layer 3 boundaries or Area Border Routers (ABR). It follows these design rules:
- Control the number of routes and routers in each area:
- Configure each distribution block as a separate, totally stubby OSPF area. The distribution switches become ABRs with their core-facing interfaces in area 0, and the access layer interfaces in unique, totally stubby areas for each access layer switch. Do not extend area 0 to the access switch because the access layer is not used as a transit area in a campus environment. Each access layer switch is configured into its own unique, totally stubby area. In this configuration, LSAs are isolated to each access layer switch so that a link flap for one access layer switch is not communicated beyond the distribution pairs.
- Tune OSPF millisecond hello, dead-interval, SPF, and LSA throttle timers as a secondary mechanism to improve convergence. Because CPU resources are not as scarce in a campus environment as they might be in a WAN environment, and the media types common in the access layer are not susceptible to the same half-up or rapid transitions as are those commonly found in the WAN, OSPF timers can safely be tuned, as shown in the configuration snippet here:
interface GigabitEthernet1/1 ip ospf dead-interval minimal hello-multiplier 4 router ospf 100 area 120 stub no-summary timers throttle spf 10 100 5000 timers throttle lsa all 10 100 5000 timers lsa arrival 80
Potential Design Issues
The following sections discuss potential design issues for placement of the Layer 2 to Layer 3 boundary in the enterprise campus.
Daisy Chaining Access Layer Switches
If multiple fixed-configuration switches are daisy chained in the access layer of the network, there is a danger that black holes will occur in the event of a link or node failure.
In the topology in Figure 2-26, before failures no links are blocking from a STP or RSTP perspective, so both uplinks are available to actively forward and receive traffic. Both distribution nodes can forward return-path traffic from the rest of the network toward the access layer for devices attached to all members of the stack or chain.
Figure 2-26 Daisy Chaining Access Layer Switches
Two scenarios can occur if a link or node in the middle of the chain or stack fails. In the first case, the standby HSRP peer can go active as it loses connectivity to its primary peer, forwarding traffic outbound for the devices that still have connectivity to it. The primary HSRP peer remains active and also forwards outbound traffic for its half of the stack. Although this is not optimum, it is not detrimental from the perspective of outbound traffic.
The second scenario is the issue. Return-path traffic has a 50 percent chance of arriving on a distribution switch that does not have physical connectivity to the half of the stack where the traffic is destined. The traffic that arrives on the wrong distribution switch is dropped.
The solution to this issue with this design is to provide alternate connectivity across the stack in the form of a loop-back cable running from the top to the bottom of the stack. This link needs to be carefully deployed so that the appropriate STP behavior will occur in the access layer.
An alternate design uses a Layer 2 link between the distribution switches.
Cisco StackWise Technology in the Access Layer
Cisco StackWise technology can eliminate the danger that black holes occur in the access layer in the event of a link or node failure. It can eliminate the need for loop-back cables in the access layer or Layer 2 links between distribution nodes.
StackWise technology, shown in the access layer in Figure 2-27, supports the recommended practice of using a Layer 3 connection between the distribution switches without having to use a loop-back cable or perform extra configuration.
Figure 2-27 StackWise Technology
The true stack creation provided by the Cisco Catalyst 3750 series switches makes using stacks in the access layer much less complex than chains or stacks of other models. A stack of 3750 switches appears as one node from the network topology perspective.
If you use a modular chassis switch to support ports in the aggregation layer, such as the Cisco Catalyst 4500 or Catalyst 6500 family of switches, these design considerations are not required.
Too Much Redundancy
Be aware that even if some redundancy is good, more redundancy is not necessarily better.
In Figure 2-28, a third switch is added to the distribution switches in the center. This extra switch adds unneeded complexity to the design and leads to these design questions:
- Where should the root switch be placed? With this design, it is not easy to determine where the root switch is located.
- What links should be in a blocking state? It is very hard to determine how many ports will be in a blocking state.
- What are the implications of STP and RSTP convergence? The network convergence is definitely not deterministic.
Figure 2-28 Too Much Redundancy
- When something goes wrong, how do you find the source of the problem? The design is much harder to troubleshoot.
Too Little Redundancy
For most designs, a link between the distribution layer switches is required for redundancy.
Figure 2-29 shows a less-than-optimal design where VLANs span multiple access layer switches. Without a Layer 2 link between the distribution switches, the design is a looped figure-eight topology. One access layer uplink will be blocking. HSRP hellos are exchanged by transiting the access switches.
Figure 2-29 Too Little Redundancy
Initially, traffic is forwarded from both access switches to the Distribution A switch that supports the STP root and the primary or active HSRP peer for VLAN 2. However, this design will black-hole traffic and be affected by multiple convergence events with a single network failure.
Example: Impact of an Uplink Failure
This example looks at the impact of an uplink failure on the design when there is no link between the distribution layer switches.
In Figure 2-30, when the uplink from Access A to Distribution A fails, three convergence events occur:
- Access A sends traffic across its active uplink to Distribution B to get to its default gateway. The traffic is black-holed at Distribution B because Distribution B does not initially have a path to the primary or active HSRP peer on Distribution A because of the STP blocking. The traffic is dropped until the standby HSRP peer takes over as the default gateway after not receiving HSRP hellos from Distribution A.
Figure 2-30 Impact of an Uplink Failure
- The indirect link failure is eventually detected by Access B after the maximum-age (max_age) timer expires, and Access B removes blocking on the uplink to Distribution B. With standard STP, transitioning to forwarding can take as long as 50 seconds. If BackboneFast is enabled with PVST+, this time can be reduced to 30 seconds, and RSTP can reduce this interval to as little as 1 second.
- After STP and RSTP converge, the distribution nodes reestablish their HSRP relationships and Distribution A (the primary HSRP peer) preempts. This causes yet another convergence event when Access A endpoints start forwarding traffic to the primary HSRP peer. The unexpected side effect is that Access A traffic goes through Access B to reach its default gateway. The Access B uplink to Distribution B is now a transit link for Access A traffic, and the Access B uplink to Distribution A must now carry traffic for both the originally intended Access B and for Access A.
Example: Impact on Return-Path Traffic
Because the distribution layer in Figure 2-31 is routing with equal-cost load balancing, up to 50 percent of the return-path traffic arrives at Distribution A and is forwarded to Access B. Access B drops this traffic until the uplink to Distribution B is forwarding. This indirect link-failure convergence can take as long as 50 seconds. PVST+ with UplinkFast reduces the time to three to five seconds, and RSTP further reduces the outage to one second. After the STP and RSTP convergence, the Access B uplink to Distribution B is used as a transit link for Access A return-path traffic.
Figure 2-31 Impact on Return Path Traffic
These significant outages could affect the performance of mission-critical applications, such as voice or video. Traffic engineering or link-capacity planning for both outbound and return-path traffic is difficult and complex, and must support the traffic for at least one additional access layer switch.
The conclusion is that if VLANs must span the access switches, a Layer 2 link is needed either between the distribution layer switches or the access switches.
Asymmetric Routing (Unicast Flooding)
When VLANs span access switches, an asymmetric routing situation can result because of equal-cost load balancing between the distribution and core layers.
Up to 50 percent of the return-path traffic with equal-cost routing arrives at the standby HSRP, VRRP, or alternate, nonforwarding GLBP peer. If the content-addressable memory (CAM) table entry ages out before the ARP entry for the end node, the peer may need to flood the traffic to all access layer switches and endpoints in the VLAN.
In Figure 2-32, the CAM table entry ages out on the standby HSRP router because the default ARP timers are four hours and the CAM aging timers are five minutes. The CAM timer expires because no traffic is sent upstream by the endpoint toward the standby HSRP peer after the endpoint initially uses ARP to determine its default gateway. When the CAM entry has aged out and is removed from the CAM table, the standby HSRP peer must forward the return-path traffic to all ports in the common VLAN. The majority of the access layer switches do not have a CAM entry for the target MAC, and they broadcast the return traffic on all ports in the common VLAN. This unicast traffic flooding can have a significant performance impact on the connected end stations because they may receive a large amount of traffic that is not intended for them.
Figure 2-32 Asymmetric Routing
Unicast Flooding Prevention
The unicast flooding situation can be easily avoided by not spanning VLANs across access layer switches.
Unicast flooding is not an issue when VLANs are not present across multiple access layer switches because the flooding occurs only to switches supporting the VLAN where the traffic would have normally been switched. If the VLANs are local to individual access layer switches, asymmetric routing traffic is flooded on only the one VLAN interface on the distribution switch. Traffic is flooded out the same interface that would be used normally to forward to the appropriate access switch. In addition, the access layer switch receiving the flooded traffic has a CAM table entry for the host because the host is directly attached, so traffic is switched only to the intended host. As a result, no additional end stations are affected by the flooded traffic.
If you must implement a topology where VLANs span more than one access layer switch, the recommended workaround is to tune the ARP timer so that it is equal to or less than the CAM aging timer. A shorter ARP cache timer causes the standby HSRP peer to use ARP for the target IP address before the CAM entry timer expires and the MAC entry is removed. The subsequent ARP response repopulates the CAM table before the CAM entry is aged out and removed. This removes the possibility of flooding asymmetrically routed return-path traffic to all ports. You can also consider biasing the routing metrics to remove the equal cost routes.