C

This post represents the solution and explanation for quiz-22.
Have a look at it to understand the problem.

Quiz Review

This quiz starts an interesting discussion about fragmentation and Router ACL behavior. Here are the details of the network configuration from the quiz:

  • Network configuration of your company:
    • the company has 3 sites (behind R1, R2 and R3)
    • Site-1 (R1) and Site-2 (R2) have dedicated internet uplink, while Site-3 (R3) is connecting to everything (intranet and internet) via R2
    • for backup purposes, a backdoor link exists between the sites, R1 and R2
    • your main application server, 1.1.1.10, is hosted in Site-1 (behind R1)
    • the main applications on this server are using TCP 1001 and TCP 1002

quiz-22-acl-for-fragments-flows How do ACLs handle fragments ?

  • Connectivity from the Site-3 (behind R3, from 172.16.1.0/24):
    • a GRE tunnel is buit between R3 and R2 with an MTU of 1440 (due to constraints in the transit network between them)
    • since the TCP 1001 application is consuming a lot of bandwidth, Policy Based Routing (PBR) was configured on R2 to forward TCP 1001 over the backdoor link (so that internet access for users in Site-2 will not be impacted)
    • traffic for the TCP 1002 application (and for other potential applications) will be NAT-ed and sent over the Internet
  • As shown in the above diagram, the TCP connectivity is ok for both applications, TCP 1001 and 1002 (SYN - SYN/ACK - ACK)

Problem Description

Unfortunately, Site-3 users (172.16.1.10) are reporting the following, while trying to upload data to the server:

  • the application on TCP 1001 works OK, using the backdoor link
  • the application on TCP 1002 does not work: the connection to server 1.1.1.10 gets established but the data transfer freezes soon after connection is made and, in the end, will timeout

It may seem strange at the beginning how is it possible that the control channel (TCP connectivity) is working fine while the data transfer does not ? There's no firewall in the path, only routing is involved !
One of the things that you should consider in such scenarios is MTU ! Reading the quiz with closer attention will show that MTU is indeed involved (set to 1440 on the GRE Tunnel between R3 and R2) and its presence is also visible in the 2 packet captures attached to the quiz - both of them show "Fragmented IP protocol" (fragments):

TCP 1001 (working)
TCP 1002 (not working)

How do ACLs handle fragments ?

The trick in this quiz is the way ACLs handle fragments, especially ACLs that contain lines referring to Layer 4 (port numbers). The problem is that not all fragments contain the Layer 4 information.
Let's see how ACLs deal with fragments:

  • ACL contain only Layer-3 information:
    • if match, perform the action of that ACL entry (permit or deny)
    • if NO match, evaluate the next ACL entry/line
  • ACL contain only Layer-3 information with the "fragments" keyword:
    • if packet is a fragment, then perform the action (permit or deny)
    • if packet is NOT a fragment, then evaluate the next ACL entry
  • ACL contain Layer-3 and Layer-4 information:
    • non-fragments (contain both L3 and L4) => ACL can evaluate all fields and perform indicated action
    • initial fragments also contain both L3 & L4 => ACL can evaluate all fields and perform indicated action
    • non-initial fragments contain only Layer-3 info => ACL cannot evaluate all fields:
      • L3 info in the fragment matches the ACL and action is PERMIT => packet allowed
      • L3 info in the fragment matches the ACL and action is DENY => move to next ACL entry

Now, getting back to our quiz and the PBR on R2:

quiz-22-ac-for-fragments-explanation How do ACLs handle fragments ?

NOTE that this example is for a end-to-end Path MTU of 1500.
The calculations for our quiz (with GRE header involved) are a bit more complicated.

Applying the rules explained above, here is what happens in case of flows over TCP 1002:

  • non-fragments do not match ACL (TCP port does not match) => no PBR & sent over the internet WITH NAT
  • initial fragments do not match ACL (TCP port does not match) => no PBR & sent over the internet WITH NAT
  • non-initial fragments match ACL (these fragments don't contain Layer-4 port numbers and since the Layer-3/IP addresses match, they will be allowed) => PBR is applied and packets are sent over the Backdoor link WITHOUT NAT

To demonstrate this, let's perform a packet capture at the server side for TCP 1002:

root@vb02-freebsd9:~ # tcpdump -vvv -n -i em2
tcpdump: listening on em2, link-type EN10MB (Ethernet), capture size 65535 bytes
#
# This is the SYN-SYN/ACK-ACK
# Note the MSS=1436
# Note the TTL=60 for all packet routes over the internet
#
IP (tos 0x0, ttl 60, id 413, offset 0, flags [none], proto TCP (6), length 60)
    2.2.2.2.51668 > 1.1.1.10.1002: Flags [S], cksum 0x8f4f (correct), seq 1602060910, win 65535, options [mss 1436,nop,wscale 6,sackOK,
    TS val 1442555 ecr 0], length 0
IP (tos 0x0, ttl 64, id 227, offset 0, flags [none], proto TCP (6), length 60)
    1.1.1.10.1002 > 2.2.2.2.51668: Flags [S.], cksum 0x063d (incorrect -> 0x1ac9), seq 454032707, ack 1602060911, win 65535, options [mss 1436,nop,wscale 6,sackOK,
    TS val 2252658141 ecr 1442555], length 0
IP (tos 0x0, ttl 60, id 414, offset 0, flags [none], proto TCP (6), length 52)
    2.2.2.2.51668 > 1.1.1.10.1002: Flags [.], cksum 0x4510 (correct), seq 1, ack 1, win 1045, options [nop,nop,
    TS val 1442642 ecr 2252658141], length 0
#
#
# A fragment follows without NAT - note the following:
# - the source address is real IP instead of NAT
# - there is no Layer-4 information
# - the non-zero fragment offset (this is a fragment)
# - the TTL=61 (lower number of hops since it goes via backdoor)
#
IP (tos 0x0, ttl 61, id 415, offset 1416, flags [none], proto TCP (6), length 60)
    172.16.1.10 > 1.1.1.10: ip-proto-6
#
#
# This is an initial-fragment - note the following:
# - it arrives out of order
# - Layer-3 info: fragment offset=0 (1st fragment), flags ="More Fragments"
# - it contains Layer-4 info
#
IP (tos 0x0, ttl 60, id 415, offset 0, flags [+], proto TCP (6), length 1436)
    2.2.2.2.51668 > 1.1.1.10.1002: Flags [.], seq 1:1385, ack 1, win 1045, options [nop,nop,TS val 1442642 ecr 2252658141], length 1384
...
...
#
# This repeats over and over (retransmissions) until it times-out
#
#############################################
#
# The NAT table on R2:
#
R2#sh ip nat tra
Pro Inside global      Inside local       Outside local      Outside global
tcp 2.2.2.2:51668      172.16.1.10:51668  1.1.1.10:1002      1.1.1.10:1002
R2#

Solutions

The best solution (and recommendation from design point of view) is to eliminate the fragments and the best way to achieve this is with Path MTU Discovery (PMTUD).
Attention though, you cannot solve this quiz by tweaking the ACL used for the PBR (and fulfilling the requirements at the same time) - details below!

1. Enable PMTUD
If you want to review how Path MTU Discovery (PMTUD) works, have a look at this section that I wrote in a separate article.
Most of the servers already have PMTUD enabled by default...but since we don't live in a perfect world, there might be exceptions.
Another thing that you must be aware is that servers do cache the PMTUD value per destination. Everytime the server wants to initiate a TCP connection to that particular destination it sets the MSS (Maximum Segment Size) to the value of MTU - 40. For the MTU, it uses either the PMTUD cached value OR the MTU of the outgoing interface (usually 1500) in case there's no cached value for that particular destination.

#
# Check if PMTUD is enabled (1) or disabled (0)
#
root@vb03-freebsd:~/client # sysctl -a | grep mtu_discovery
net.inet.tcp.path_mtu_discovery: 0
#
# Check if there's a cache list of hosts:
# The Count number should tell how may hosts are cached
#
root@vb03-freebsd:~/client # sysctl -a | grep cache | grep tcp | grep count
net.inet.tcp.hostcache.count: 1
#
# Check what's the cached PMTUD values
#
root@vb03-freebsd:~/client # sysctl -o net.inet.tcp.hostcache.list
net.inet.tcp.hostcache.list:
IP address        MTU  SSTRESH      RTT   RTTVAR BANDWIDTH     CWND SENDPIPE RECVPIPE HITS  UPD  EXP
1.1.1.10         1476        0     97ms     31ms         0    11252        0        0   16    3 1800

root@vb03-freebsd:~/client #

As you saw in my quiz, the client has a cached PMTUD value of 1476 and, at this moment, PMTUD is disabled!
You may wonder why there's a cache PMTUD since PMTUD is disabled ? The answer is: to demonstrate this quiz, I did the following:

  • with default server config (PMTUD enabled) and no MTU on GRE tunnel -> made a test connection -> client learned the PMTUD = 1476 (1500-24/GRE)
  • then I configured lower MTU 1440 on the GRE tunnels
  • also I disabled PMTUD with command sysctl -w net.inet.tcp.path_mtu_discovery="0" so the server cannot learn the new PMTUD value

You will say that it was not nice of me to hack it this way, but I'll say: it worth demonstrate this quiz ☺

As already mentioned, the best solution for the quiz would be to re-enable back the PMTUD on the client so that it will discover the new MTU:

root@vb03-freebsd:~/client # sysctl -w net.inet.tcp.path_mtu_discovery=1
net.inet.tcp.path_mtu_discovery: 0 -> 1
root@vb03-freebsd:~/client #
root@vb03-freebsd:~/client # nc 1.1.1.10 1002 < test.file                !! This is my test and it's successful !!
root@vb03-freebsd:~/client #
root@vb03-freebsd:~/client # sysctl -o net.inet.tcp.hostcache.list
net.inet.tcp.hostcache.list:
IP address        MTU  SSTRESH      RTT   RTTVAR BANDWIDTH     CWND SENDPIPE RECVPIPE HITS  UPD  EXP
1.1.1.10         1440        0    106ms     28ms         0    11252        0        0   26    6 3600


2. MSS Clamping

MSS Clamping means that the network engineer will configure the routers to modify the MSS value that the client and server exchange in the TCP 3-way handshake. For the sake of exercise, I will use a value of 1300:

R3#conf t
R3(config)#int tun1
R3(config-if)#ip tcp adjust-mss ?
  <500-1460>  Maximum segment size in bytes

R3(config-if)#ip tcp adjust-mss 1300
R3(config-if)#end
root@vb03-freebsd:~/client # nc 1.1.1.10 1002 < test.file   !! Test ok !!
root@vb03-freebsd:~/client #
root@vb03-freebsd:~/client # sysctl -o net.inet.tcp.hostcache.list
net.inet.tcp.hostcache.list:
IP address        MTU  SSTRESH      RTT   RTTVAR BANDWIDTH     CWND SENDPIPE RECVPIPE HITS  UPD  EXP
1.1.1.10         1476        0    111ms     40ms         0    11252        0        0   13    3 3600

root@vb03-freebsd:~/client #

As shown, the PMTUD value is still the wrong one, 1476, since PMTUD is disabled in the quiz !... but this does not influence anymore the communication because the largest packet will be 1300 bytes, as decided with the MSS.

3. Modify the ACL PBR to send TCP 1002 over backdoor (same as working TCP 1001)

As some of you already indicated in the quiz, you can modify the ACL used for the PBR to send TCP 1002 over the backdoor link, same as for TCP 1001:

R2#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
R2(config)#ip access-list extended ACL_BACKDOOR
R2(config-ext-nacl)#15 permit tcp host 172.16.1.10 host 1.1.1.10 eq 1002
R2(config-ext-nacl)#end

Problems with this solution:
- it violates the quiz requirements (that could represent a business requirement that you, as a network enginner, would have to follow!)
- it still does not solve the problems for other flows that contains fragments ( other TCP ports different than 1001 & 1002)

4. Deny fragments from PBR - breaks TCP 1001 !!

Since the problem is that fragments of TCP 1002 match the PBR ACL, one could modify this ACL to deny such fragments from being routed over the backdoor by the PBR:

R2#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
R2(config)#ip access-list extended ACL_BACKDOOR
R2(config-ext-nacl)#5 deny ip host 172.16.1.10 host 1.1.1.10 fragments
R2(config-ext-nacl)#end
R2#sh access-list
Extended IP access list ACL_BACKDOOR
    5 deny ip host 172.16.1.10 host 1.1.1.10 fragments
    10 permit tcp host 172.16.1.10 host 1.1.1.10 eq 1001 (296 matches)
    20 deny ip any any (433 matches)
root@vb03-freebsd:~/client # nc 1.1.1.10 1002 < test.file        !! Test ok !!
root@vb03-freebsd:~/client #
root@vb03-freebsd:~/client # nc 1.1.1.10 1001 < test.file        !! TCP 1001 does not work anymore !!               
^C
root@vb03-freebsd:~/client #

As you can see, we made it work for TCP 1002 but at the same time we damaged the TCP 1001...Why is that ?
It's because TCP 1001 also uses fragments and those fragments are now denied by the new ACL entry #5 which makes the router send them via Internet (while TCP 1001 non-fragments are allowed by the ACL and PBRed over the backdoor) !

5. Adjust (lower) the MTU on all links

Adjusting the MTU on all links will eliminate the need for the servers to use PMTUD and thus the fragments will disappear... unfortunately, even though it's recommended and desirable to have same MTU values in the path, it's not always possible, especially for links that are not under your administration.

Thanks again for all your comments in the quiz !
Subscribe to this blog to get more interesting quizzes and detailed solutions.