This post represents the solution and explanation for quiz-18. Have a look at it to understand the problem.

Quiz Review

A company using a multi-vendor routing platforms (Cisco and Juniper) has a HQ and multiple spoke sites connected by an MPLS provider. Each remote site has a GRE tunnel with the Headquarter (HQ) and runs BGP over it.

After attending a security training, your Security Team raised concerns about ICMP-based attacks and decided to block ICMP messages on all physical interfaces connected to outside networks, on all border routers, in all sites.
Some time later, all the BGP sessions between Cisco and Juniper devices started flapping up/down, impactiving the connectivity between HQ and Juniper-based sites, while the BGP sessions between HQ (Cisco-based) to other Cisco-based sites were ok.

As most of you spotted already, dropping all ICMP messages affects Path MTU Discovery (PMTUD) which in turn impacts end to end connectivity (in this case, BGP session)... but why is there a difference between Cisco and Juniper ? We will see that, but before let's do some simple math:

Review of different MTU values
  • by default, Ethernet MTU is 1500 bytes (full Ethernet is 1518 = 1514 Ethernet II header + 4 bytes checksum)
  • by default, GRE tunnel MTU is 1476 = 1500 - (20 bytes IP header + 4 bytes GRE header)
  • MPLS adds a 4-byte overhead for each label - by default, if MPLS MTU is not configured, this will be 1492 bytes (accounting for 2 labels)
  • by default, the TCP MSS (Maximum Segment Size) is automatically calculated by substracting 40 (20-bytes IP header + 20-bytes TCP header) from the MTU of the outgoing interface:
    • TCP MSS is the maximum size of the TCP payload
    • TCP MSS is negociated (the lower should be chosen) between source and destination during the TCP 3-way handshake, in the SYN & SYN/ACK packets
    • for example: MSS for a TCP outgoing an Ethernet interface would be 1500 - 40 = 1460 bytes
    • another example: MSS for a TCP outgoing a GRE tunnel interface would be 1476 - 40 = 1436 bytes

solution-quiz-18-all How Could MTU affect BGP Sessions
Note the entire frame size of 1522 = 1508 (packet with 2 MPLS labels) + 14 (Ethernet II header)

Now, for the BGP sessions, the math is like this:

  • the maximum BGP UPDATE message would have a size of 1436 bytes = this is the TCP MSS for a BGP over GRE tunnel (see above)
  • when such a packet reaches the PE, its size would be 1500 bytes = 1436 (BGP payload) + 20 (TCP header) + 20 (IP header) + 4 (GRE header) + 20 (outer IP header)
  • the quiz does not make any reference to the MTU size inside the MPLS cloud as there is no MTU configuration on the MPLS links - this is done on purpose to create the quiz => a packet of 1500 bytes is too large for the MPLS links (because PE needs to add 2 labels = another 8 bytes)
  • as a result, the PE will need to perform fragmentation of the BGP UPDATE message

Path MTU Discovery (PMTUD)

For completeness of this article, in short, Path MTU Discovery consists of:

  • source host sets DF-bit in the IP header to indicate that packet must not be fragmented in transit
  • intermediate routers (PE in our quiz) will drop these large packets: because they exceed the MTU of outgoing interface and because they are not allowed to fragment them due to DF-bit setting
  • intermediate routers will send an ICMP "Fragmentation Needed and DF set" back to source host (CE router for BGP session, in our quiz)
  • very important: the ICMP "Fragmentation Needed" messages contains also the recommended MTU value

icmp-fragmentation-needed

Cisco vs. Juniper

The difference between the BGP sessions established between Cisco-only sites (that were not impacted) and Cisco-Juniper ones (sites impacted) lies in the DF-bit setting !

  • By default, Cisco does not set DF-bit for GRE tunnels => this means that a BGP UPDATE of 1500-bytes would be fragmented by the PE before sending them over the 1492-bytes MPLS links.
  • Junipers, on the other hand, by default set the DF-bit for GRE tunnels => so a 1500-bytes BGP UPDATE with DF-bit set would not fit the 1492-bytes MPLS links. The PEs would drop them and send back to CEs an ICMP "Fragmentation Needed" indicating the MTU of the outgoing link (see above screenshot: 1492).

This is visible on both Cisco PE and Juniper CE:
- debugging ICMP on PE:

R5#
*Mar  1 00:22:32.851: ICMP: dst (192.168.255.1) frag. needed and DF set unreachable sent to 192.168.255.2
*Mar  1 00:22:33.747: ICMP: dst (192.168.255.1) frag. needed and DF set unreachable sent to 192.168.255.2
R5#
*Mar  1 00:22:40.291: ICMP: dst (192.168.255.1) frag. needed and DF set unreachable sent to 192.168.255.2
*Mar  1 00:22:41.699: ICMP: dst (192.168.255.1) frag. needed and DF set unreachable sent to 192.168.255.2

- firewall logs on Juniper CE with the drops:

root@Router-1> show firewall

Filter: DENY_ICMP-ge-0/0/0.0-i
Counters:
Name                                                Bytes              Packets
deny-icmp-ge-0/0/0.0-i                              22400                  400

root@Router-1> show firewall log
Log :
Time      Filter    Action         Interface       Protocol        Src Addr                         Dest Addr
23:14:23  DENY_ICMP-ge-0/0/0.0-i D ge-0/0/0.0      ICMP            192.168.2.1                      192.168.255.2
23:14:07  DENY_ICMP-ge-0/0/0.0-i D ge-0/0/0.0      ICMP            192.168.2.1                      192.168.255.2
23:13:59  DENY_ICMP-ge-0/0/0.0-i D ge-0/0/0.0      ICMP            192.168.2.1                      192.168.255.2
...
root@Router-1> show firewall log detail
Time of Log: 2014-01-24 23:14:23 UTC, Filter: DENY_ICMP-ge-0/0/0.0-i, Filter action: discard, Name of interface: ge-0/0/0.0
Name of protocol: ICMP, Packet Length: 54189, Source address: 192.168.2.1, Destination address: 192.168.255.2
ICMP type: 3, ICMP code: 4
Time of Log: 2014-01-24 23:14:07 UTC, Filter: DENY_ICMP-ge-0/0/0.0-i, Filter action: discard, Name of interface: ge-0/0/0.0
Name of protocol: ICMP, Packet Length: 54189, Source address: 192.168.2.1, Destination address: 192.168.255.2
ICMP type: 3, ICMP code: 4


If you are curious how the BGP session behaves on each end, here it is:

Cisco CE in HQ
The BGP session gets established but it does not learn any route. Notice:
- the 0 counter on the PfxRcd
- the Up/Down timer never gets more that "1:29" = 90 sec (the BGP default holdtime)

CE-HQ#
%BGP-5-ADJCHANGE: neighbor 192.168.12.2 Up
%BGP-3-NOTIFICATION: sent to neighbor 192.168.12.2 4/0 (hold time expired) 0 bytes
%BGP-5-NBR_RESET: Neighbor 192.168.12.2 reset (BGP Notification sent)
%BGP-5-ADJCHANGE: neighbor 192.168.12.2 Down BGP Notification sent
CE-HQ#
*Jan 24 22:42:18.519: %BGP-5-ADJCHANGE: neighbor 192.168.12.2 Up
CE-HQ#sh ip bgp summary
...
Neighbor        V           AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
192.168.12.2    4        65200       2       8     1935    0    0 00:01:27        0
192.168.13.2    4        65300      11      15     1935    0    0 00:04:35      848
CE-HQ#
%BGP-3-NOTIFICATION: sent to neighbor 192.168.12.2 4/0 (hold time expired) 0 bytes


Juniper CE in remote site
The BGP session gets established and prefixes are learned over it. Notice the Flaps counter is non-zero

>root@Router-1> show bgp summary
Groups: 1 Peers: 1 Down peers: 0
Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
inet.0              1934       1934          0          0          0          0
Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received...
192.168.12.1          65000          9         16       0       10        1:27 1934/1934/1934/0

root@Router-1> show firewall

Filter: DENY_ICMP-ge-0/0/0.0-i
Counters:
Name                                                Bytes              Packets
deny-icmp-ge-0/0/0.0-i                              5152                   92

root@Router-1> show firewall log detail
Time of Log: 2014-01-24 22:46:38 UTC, Filter: DENY_ICMP-ge-0/0/0.0-i, Filter action: discard, 
Name of interface: ge-0/0/0.0
Name of protocol: ICMP, Packet Length: 54189, 
Source address: 192.168.2.1, Destination address: 192.168.255.2
ICMP type: 3, ICMP code: 4

Solutions

1. Set the higher MTU inside MPLS

As mentioned above, the MPLS MTU was not set to take into account the labels, for the sake of this quiz. Considering the default MTU of physical interface of 1500, the MPLS MTU would be 1492 (for 2 labels). This value is easily seen in the ICMP "Fragmentation Needed" messages, as shown above.

Usually MPLS providers do provide an MTU of 1500 bytes to their customers. To do this we need to increase the MPLS MTU to at least 1508 - usually you set the MPLS MTU to 1516 (to accomodate 4 labels), but for this quiz we use only 2 MPLS labels:

PE-2#sh mpls interfaces
Interface              IP            Tunnel   Operational
FastEthernet0/1        Yes (ldp)     No       Yes
PE-2#
PE-2#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
PE-2#(config)#int fa0/1
PE-2#(config-if)#mpls mtu 1508
PE-2#
*Mar  1 00:47:45.651: %SYS-5-CONFIG_I: Configured from console by console
PE-2#sh mpls int detail
Interface FastEthernet0/1:
        IP labeling enabled (ldp):
          Interface config
        LSP Tunnel labeling not enabled
        BGP tagging not enabled
        Tagging operational
        Fast Switching Vectors:
          IP to MPLS Fast Switching Vector
          MPLS Turbo Vector
        MTU = 1508
PE-2#

This is, by far, the best solution, because it avoids fragmentation!!

2. Allow ICMP "Fragmentation Needed" into the ACL (on Juniper side)

Another solutions to this problem is to modify the access-list / Juniper filter to permit the ICMP messages type 4 (destination unreachable) - code 3 (fragmentation needed) that are used to achieve the PMTUD (Path MTU Discovery).
In general, it's a good practice to allow the ICMP "Fragmentation needed" messages into access-lists, whenever ICMP protocol is filtered.
For this quiz, these ICMP messages needs to be allowed by the firewall filter only on the Juniper devices (because it's the Juniper that sets DF-bit in the GRE packets).

root@Router-1> show configuration firewall filter DENY_ICMP
interface-specific;
term ALLOW_PMTUD {
    from {
        protocol icmp;
        icmp-type unreachable;
        icmp-code fragmentation-needed;
    }
    then {
        count allow-pmtud;
        log;
        accept;
    }
}
term DENY_ICMP {
    from {
        protocol icmp;
    }
    then {
        count deny-icmp;
        log;
        discard;
    }
}
term ALLOW_ALL {
    then accept;
}

After commiting this change, the BGP between Cisco-HQ and Juniper-sites became stable. The firewall counters and logs show that ICMP "Fragmentation Needed" messages are allowed on Juniper:

root@Router-1> show firewall

Filter: __default_bpdu_filter__

Filter: DENY_ICMP-ge-0/0/0.0-i
Counters:
Name                                                Bytes              Packets
allow-pmtud-ge-0/0/0.0-i                            224                    4
deny-icmp-ge-0/0/0.0-i                              168                    2

root@Router-1> show firewall log detail
Time of Log: 2014-01-22 22:55:30 UTC, Filter: DENY_ICMP-ge-0/0/0.0-i, Filter action: accept, Name of interface: ge-0/0/0.0
Name of protocol: ICMP, Packet Length: 54189, Source address: 192.168.2.1, Destination address: 192.168.255.2
ICMP type: 3, ICMP code: 4

3. Apply the allow-fragmentation on the Tunnel Interface (on Juniper)

By default, GRE packets will be dropped if they exceed the MTU of the outgoing physical interface. Instead of dropping them, you can tell the Juniper router to split them into more IP fragments - this is achieved with command allow-fragmentation under the gr- (tunnel) interface:

root@Router-1> show configuration interfaces gr-0/0/0
unit 0 {
    tunnel {
        source 192.168.255.2;
        destination 192.168.255.1;
        allow-fragmentation;
    }
    family inet {
        address 192.168.12.2/30;
    }
}

Since you allow fragmentation of the GRE packets, then it will not set the DF-bit anymore. This is the reason why I consider this solution to be more of a workaround since in fact you don't solve the problem: large BGP Updates messages are still sent and they get fragmented on MPLS PE routers.
A real solution would be to avoid fragmentation !

4. Implement MSS Clamping

Another good solution to avoid fragmentation is to use the "MSS Clamping". This feature will modify (usually decrease) the MSS value in the SYN and SYN/ACK packets to the configured value. As shown above the MSS value for the BGP sessions that run over GRE tunnels is 1436 (= 1476 (GRE MTU) - 40 (IP+TCP headers)).
On Cisco devices, this is implemented at the global level with ip tcp mss or at the interface level with ip tcp adjust-mss:

CE-HQ(config)#ip tcp mss 1400
CE-HQ(config)#end
CE-HQ#
CE-HQ#clear ip bgp 192.168.12.2
CE-HQ#
 %BGP-5-ADJCHANGE: neighbor 192.168.12.2 Down User reset
 %BGP_SESSION-5-ADJCHANGE: neighbor 192.168.12.2 IPv4 Unicast topology base removed from session  User reset
 %BGP-5-ADJCHANGE: neighbor 192.168.12.2 Up
CE-HQ#sh ip bgp s
CE-HQ#sh ip bgp summary
...

Neighbor        V           AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
192.168.12.2    4        65200      13      12     2835    0    0 00:00:19      999
192.168.13.2    4        65300      83      88     2835    0    0 01:10:53      848
CE-HQ#
CE-HQ#sh ip bgp nei 192.168.12.2 | i max
  Number of NLRIs in the update sent: max 1010, min 0
minRTT: 48 ms, maxRTT: 484 ms, ACK hold: 200 ms
Datagrams (max data segment is 1400 bytes):
CE-HQ#

This is a screenshot of the TCP 3-way handshake for the BGP between HQ and remote-site:
mss-values MSS Values

5. Additional tests run on Juniper

On Juniper, I tried several other options that, theoretically, represent solution to this quiz:

  1. use no-gre-path-mtu-discovery to disable PMTUD for GRE. This can be applied either on the GRE interface or under system internet-options
    For unknown reasons (I suspect due to virtual hardware that I used for testing) this solution did not work for me.
  2. use no-path-mtu-discovery to disable PMTUD for all outgoing TCP connections.
    This can also be applied either on the GRE interface or under "system internet-options".
    Although this may look as a solution at the first sight, it's not working because it disables PMTUD on the TCP (BGP sessions, in our case) which represents the inner header, not for the outer IP header.

Last but not least, let me mention here that, with current IOS version, BGP performs PMTUD by default:

CE-HQ#sh ip bgp nei 192.168.12.2 | i path-mtu
  Transport(tcp) path-mtu-discovery is enabled
CE-HQ#

Uuuuu, I had not idea that this post will be sooo long... but I wanted to touch all aspects of the quiz and I hope it will be an interesting reading !

Thank you for your comments and interest in the quiz!
Subscribe to this blog to get more interesting quizzes and detailed solutions.