This post represents the solution and explanation for quiz-18. Have a look at it to understand the problem.
Quiz Review
A company using a multi-vendor routing platforms (Cisco and Juniper) has a HQ and multiple spoke sites connected by an MPLS provider. Each remote site has a GRE tunnel with the Headquarter (HQ) and runs BGP over it.
After attending a security training, your Security Team raised concerns about ICMP-based attacks and decided
Some time later, all the BGP sessions between Cisco and Juniper devices started flapping up/down, impactiving the connectivity between HQ and Juniper-based sites, while the BGP sessions between HQ (Cisco-based) to other Cisco-based sites were ok.
As most of you spotted already, dropping all ICMP messages affects Path MTU Discovery (PMTUD) which in turn impacts end to end connectivity (in this case, BGP session)... but why is there a difference between Cisco and Juniper ? We will see that, but before let's do some simple math:
- by default,
Ethernet MTU is 1500 bytes (full Ethernet is 1518 = 1514 Ethernet II header + 4 bytes checksum) - by default,
GRE tunnel MTU is 1476 = 1500 - (20 bytes IP header + 4 bytes GRE header) MPLS adds a 4-byte overhead for each label - by default, if MPLS MTU is not configured, this will be1492 bytes (accounting for 2 labels)- by default, the
TCP MSS (Maximum Segment Size) is automatically calculated by substracting40 (20-bytes IP header + 20-bytes TCP header) from the MTU of the outgoing interface:- TCP MSS is the maximum size of the TCP payload
- TCP MSS is negociated (the lower should be chosen) between source and destination during the TCP 3-way handshake, in the SYN & SYN/ACK packets
- for example: MSS for a TCP outgoing an Ethernet interface would be 1500 - 40 =
1460 bytes - another example: MSS for a TCP outgoing a GRE tunnel interface would be 1476 - 40 =
1436 bytes
Note the entire frame size of 1522 = 1508 (packet with 2 MPLS labels) + 14 (Ethernet II header)
Now, for the BGP sessions, the math is like this:
- the maximum BGP UPDATE message would have a size of
1436 bytes = this is the TCP MSS for a BGP over GRE tunnel (see above) - when such a packet reaches the PE, its size would be
1500 bytes = 1436 (BGP payload) +20 (TCP header) +20 (IP header) +4 (GRE header) +20 (outer IP header) - the quiz does not make any reference to the MTU size inside the MPLS cloud as there is no MTU configuration on the MPLS links - this is done on purpose to create the quiz =>
a packet of 1500 bytes is too large for the MPLS links (because PE needs to add 2 labels = another 8 bytes) - as a result,
the PE will need to perform fragmentation of the BGP UPDATE message
Path MTU Discovery (PMTUD)
For completeness of this article, in short, Path MTU Discovery consists of:
- source host sets DF-bit in the IP header to indicate that packet must not be fragmented in transit
- intermediate routers (PE in our quiz) will drop these large packets: because they exceed the MTU of outgoing interface and because they are not allowed to fragment them due to DF-bit setting
- intermediate routers will send an ICMP "Fragmentation Needed and DF set" back to source host (CE router for BGP session, in our quiz)
- very important: the ICMP "Fragmentation Needed" messages contains also the recommended MTU value
Cisco vs. Juniper
The difference between the BGP sessions established between Cisco-only sites (that were not impacted) and Cisco-Juniper ones (sites impacted) lies in the DF-bit setting !
- By default,
Cisco does not set DF-bit for GRE tunnels => this means that a BGP UPDATE of 1500-bytes would be fragmented by the PE before sending them over the 1492-bytes MPLS links. Junipers , on the other hand, by defaultset the DF-bit for GRE tunnels => so a 1500-bytes BGP UPDATE with DF-bit set would not fit the 1492-bytes MPLS links. The PEs woulddrop them and send back to CEs an ICMP "Fragmentation Needed" indicating the MTU of the outgoing link (see above screenshot: 1492).
This is visible on both Cisco PE and Juniper CE:
- debugging ICMP on PE:
R5# *Mar 1 00:22:32.851: ICMP: dst (192.168.255.1) frag. needed and DF set unreachable sent to 192.168.255.2 *Mar 1 00:22:33.747: ICMP: dst (192.168.255.1) frag. needed and DF set unreachable sent to 192.168.255.2 R5# *Mar 1 00:22:40.291: ICMP: dst (192.168.255.1) frag. needed and DF set unreachable sent to 192.168.255.2 *Mar 1 00:22:41.699: ICMP: dst (192.168.255.1) frag. needed and DF set unreachable sent to 192.168.255.2
- firewall logs on Juniper CE with the drops:
root@Router-1>show firewall Filter: DENY_ICMP-ge-0/0/0.0-i Counters: Name Bytes Packets deny-icmp-ge-0/0/0.0-i 22400400 root@Router-1>show firewall log Log : Time Filter Action Interface Protocol Src Addr Dest Addr 23:14:23 DENY_ICMP-ge-0/0/0.0-i D ge-0/0/0.0 ICMP 192.168.2.1 192.168.255.2 23:14:07 DENY_ICMP-ge-0/0/0.0-i D ge-0/0/0.0 ICMP 192.168.2.1 192.168.255.2 23:13:59 DENY_ICMP-ge-0/0/0.0-i D ge-0/0/0.0 ICMP 192.168.2.1 192.168.255.2 ... root@Router-1>show firewall log detail Time of Log: 2014-01-24 23:14:23 UTC,Filter: DENY_ICMP-ge-0/0/0.0-i, Filter action: discard , Name of interface: ge-0/0/0.0 Name of protocol: ICMP, Packet Length: 54189, Source address: 192.168.2.1, Destination address: 192.168.255.2ICMP type: 3, ICMP code: 4 Time of Log: 2014-01-24 23:14:07 UTC, Filter: DENY_ICMP-ge-0/0/0.0-i, Filter action: discard, Name of interface: ge-0/0/0.0 Name of protocol: ICMP, Packet Length: 54189, Source address: 192.168.2.1, Destination address: 192.168.255.2 ICMP type: 3, ICMP code: 4
If you are curious how the BGP session behaves on each end, here it is:
Cisco CE in HQ
The BGP session gets established but it does not learn any route. Notice:
- the 0 counter on the PfxRcd
- the Up/Down timer never gets more that "1:29" = 90 sec (the BGP default holdtime)
CE-HQ# %BGP-5-ADJCHANGE:neighbor 192.168.12.2 Up %BGP-3-NOTIFICATION: sent to neighbor 192.168.12.2 4/0(hold time expired) 0 bytes %BGP-5-NBR_RESET: Neighbor 192.168.12.2 reset (BGP Notification sent) %BGP-5-ADJCHANGE: neighbor 192.168.12.2 Down BGP Notification sent CE-HQ# *Jan 24 22:42:18.519:%BGP-5-ADJCHANGE: neighbor 192.168.12.2 Up CE-HQ#sh ip bgp summary ... Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd 192.168.12.2 4 65200 2 8 1935 0 000:01:27 0 192.168.13.2 4 65300 11 15 1935 0 0 00:04:35 848 CE-HQ# %BGP-3-NOTIFICATION: sent to neighbor 192.168.12.2 4/0 (hold time expired) 0 bytes
Juniper CE in remote site
The BGP session gets established and prefixes are learned over it. Notice the Flaps counter is non-zero
>root@Router-1>show bgp summary Groups: 1 Peers: 1 Down peers: 0 Table Tot Paths Act Paths Suppressed History Damp State Pending inet.0 1934 1934 0 0 0 0 Peer AS InPkt OutPkt OutQ Flaps Last Up/Dwn State|#Active/Received... 192.168.12.1 65000 9 16 0 101:27 1934/1934/1934/0 root@Router-1>show firewall Filter: DENY_ICMP-ge-0/0/0.0-i Counters: Name Bytes Packets deny-icmp-ge-0/0/0.0-i 515292 root@Router-1> show firewall log detail Time of Log: 2014-01-24 22:46:38 UTC,Filter: DENY_ICMP-ge-0/0/0.0-i, Filter action: discard , Name of interface: ge-0/0/0.0 Name of protocol: ICMP, Packet Length: 54189, Source address: 192.168.2.1, Destination address: 192.168.255.2ICMP type: 3, ICMP code: 4
Solutions
1. Set the higher MTU inside MPLS
As mentioned above, the MPLS MTU was not set to take into account the labels, for the sake of this quiz. Considering the default MTU of physical interface of 1500, the MPLS MTU would be 1492 (for 2 labels). This value is easily seen in the ICMP "Fragmentation Needed" messages, as shown above.
Usually MPLS providers do provide an MTU of 1500 bytes to their customers. To do this we need to increase the MPLS MTU to at least 1508 - usually you set the MPLS MTU to 1516 (to accomodate 4 labels), but for this quiz we use only 2 MPLS labels:
PE-2#sh mpls interfaces Interface IP Tunnel Operational FastEthernet0/1 Yes (ldp) No Yes PE-2# PE-2#conf t Enter configuration commands, one per line. End with CNTL/Z. PE-2#(config)#int fa0/1 PE-2#(config-if)#mpls mtu 1508 PE-2# *Mar 1 00:47:45.651: %SYS-5-CONFIG_I: Configured from console by console PE-2#sh mpls int detail Interface FastEthernet0/1: IP labeling enabled (ldp): Interface config LSP Tunnel labeling not enabled BGP tagging not enabled Tagging operational Fast Switching Vectors: IP to MPLS Fast Switching Vector MPLS Turbo VectorMTU = 1508 PE-2#
This is, by far, the best solution, because it avoids fragmentation!!
2. Allow ICMP "Fragmentation Needed" into the ACL (on Juniper side)
Another solutions to this problem is to modify the access-list / Juniper filter to permit the ICMP messages type 4 (destination unreachable) - code 3 (fragmentation needed) that are used to achieve the PMTUD (Path MTU Discovery).
In general, it's a good practice to allow the ICMP "Fragmentation needed" messages into access-lists, whenever ICMP protocol is filtered.
For this quiz, these ICMP messages needs to be allowed by the firewall filter only on the Juniper devices (because it's the Juniper that sets DF-bit in the GRE packets).
root@Router-1>show configuration firewall filter DENY_ICMP interface-specific;term ALLOW_PMTUD { from { protocol icmp; icmp-type unreachable; icmp-code fragmentation-needed; } then { count allow-pmtud; log; accept; } } term DENY_ICMP { from { protocol icmp; } then { count deny-icmp; log; discard; } } term ALLOW_ALL { then accept; }
After commiting this change, the BGP between Cisco-HQ and Juniper-sites became stable. The firewall counters and logs show that ICMP "Fragmentation Needed" messages are allowed on Juniper:
root@Router-1>show firewall Filter: __default_bpdu_filter__ Filter: DENY_ICMP-ge-0/0/0.0-i Counters: Name Bytes Packetsallow-pmtud-ge-0/0/0.0-i 224 4 deny-icmp-ge-0/0/0.0-i 168 2 root@Router-1>show firewall log detail Time of Log: 2014-01-22 22:55:30 UTC, Filter:DENY_ICMP-ge-0/0/0.0-i, Filter action: accept , Name of interface: ge-0/0/0.0 Name of protocol: ICMP, Packet Length: 54189, Source address: 192.168.2.1, Destination address: 192.168.255.2 ICMP type: 3, ICMP code: 4
3. Apply the allow-fragmentation
on the Tunnel Interface (on Juniper)
By default, GRE packets will be dropped if they exceed the MTU of the outgoing physical interface. Instead of dropping them, you can tell the Juniper router to split them into more IP fragments - this is achieved with command allow-fragmentation
under the gr- (tunnel) interface:
root@Router-1>show configuration interfaces gr-0/0/0 unit 0 { tunnel { source 192.168.255.2; destination 192.168.255.1;allow-fragmentation ; } family inet { address 192.168.12.2/30; } }
Since you allow fragmentation of the GRE packets, then it will not set the DF-bit anymore. This is the reason why I consider this solution to be more of a workaround since in fact you don't solve the problem: large BGP Updates messages are still sent and they get fragmented on MPLS PE routers.
A real solution would be to avoid fragmentation !
4. Implement MSS Clamping
Another good solution to avoid fragmentation is to use the "MSS Clamping". This feature will modify (usually decrease) the MSS value in the SYN and SYN/ACK packets to the configured value. As shown above the MSS value for the BGP sessions that run over GRE tunnels is 1436 (= 1476 (GRE MTU) - 40 (IP+TCP headers)).
On Cisco devices, this is implemented at the global level with ip tcp mss
or at the interface level with ip tcp adjust-mss
:
CE-HQ(config)#ip tcp mss 1400 CE-HQ(config)#end CE-HQ# CE-HQ#clear ip bgp 192.168.12.2 CE-HQ# %BGP-5-ADJCHANGE: neighbor 192.168.12.2 Down User reset %BGP_SESSION-5-ADJCHANGE: neighbor 192.168.12.2 IPv4 Unicast topology base removed from session User reset %BGP-5-ADJCHANGE: neighbor 192.168.12.2 Up CE-HQ#sh ip bgp s CE-HQ#sh ip bgp summary ... Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd 192.168.12.2 4 65200 13 12 2835 0 000:00:19 999 192.168.13.2 4 65300 83 88 2835 0 0 01:10:53 848 CE-HQ# CE-HQ#sh ip bgp nei 192.168.12.2 | i max Number of NLRIs in the update sent: max 1010, min 0 minRTT: 48 ms, maxRTT: 484 ms, ACK hold: 200 ms Datagrams (max data segment is 1400 bytes ): CE-HQ#
This is a screenshot of the TCP 3-way handshake for the BGP between HQ and remote-site:
5. Additional tests run on Juniper
On Juniper, I tried several other options that, theoretically, represent solution to this quiz:
- use
no-gre-path-mtu-discovery
to disable PMTUD for GRE. This can be applied either on the GRE interface or under system internet-options
For unknown reasons (I suspect due to virtual hardware that I used for testing) this solution did not work for me. - use
no-path-mtu-discovery
to disable PMTUD for all outgoing TCP connections.
This can also be applied either on the GRE interface or under "system internet-options".
Although this may look as a solution at the first sight, it's not working because it disables PMTUD on the TCP (BGP sessions, in our case) which represents the inner header, not for the outer IP header.
Last but not least, let me mention here that, with current IOS version, BGP performs PMTUD by default:
CE-HQ#sh ip bgp nei 192.168.12.2 | i path-mtuTransport(tcp) path-mtu-discovery is enabled CE-HQ#
Uuuuu, I had not idea that this post will be sooo long... but I wanted to touch all aspects of the quiz and I hope it will be an interesting reading !
Thank you for your comments and interest in the quiz!
Subscribe to this blog to get more interesting quizzes and detailed solutions.
Comments
comments powered by Disqus