How is Pod traffic addressed in VPN-connected Google Kubernetes Engine?

A post-mortem describing how I learned to pay closer attention to Pod secondary IP ranges and SNAT in GKE private clusters and external resources in a VPN-connected cloud.

Josh Bielick
9 min readMay 13, 2020
A network diagram of Google Cloud and Amazon Web Service Private Clouds networked via VPN

The following post is a redacted incident report I wrote recently for a temporary connectivity outage between our GCP Composer workloads in a GKE cluster and our AWS VPC private resources. This may be helpful to you if your GCP VPC is VPN-connected without BGP dynamic routing.

What happened?

Around 9PM EDT on 4/16/2020, GCP Composer Airflow workers lost connectivity to AWS VPC private resources like Redshift and other databases. This caused the majority of Airflow tasks to continuously fail until the issue was resolved at 12:24 PM EDT on 4/17/2020.

The errors looked similar to this:

Can't connect to MySQL server on 'prod-db.us-east1.rds.amazonaws.com'

Why did this happen?

An automatic node upgrade occurred on our Composer kubernetes cluster to kubelet 1.14, in which GKE introduced new masquerading (SNAT) rules for Pod IP address translation. More details are available in this table.

Prior to kubelet 1.14, a Pod’s IP address was masqueraded as the Node’s IP for all outbound traffic except that destined for 10.0.0.0/8 (GCP VPC local). Therefore, when a Pod sends traffic to the AWS VPC, its source address is a GCE VM(good). The router routes this properly and the firewalls are configured to allow this traffic (based on a similar source range assumption that 10.128.0.0/9 is anything on the Google side).

In GKE, Pod IP addresses are taken from a different range (secondary ranges—10.2.0.0/20 for example) than the existing subnet range the node pool resides in. When the Pod traffic is not masqueraded by the Node—the source IP address is the Pod’s, not the Node’s. Thus, the traffic could not be routed properly and was also being blocked by firewall rules. The subnet source addresses we expected were 10.128.0.0/9, but the traffic was coming from a secondary range 10.2.0.0/20 —outside of the expected and allowed subnet range.

In the GKE kubelet 1.14, 172.16.0.0/12was introduced as a "non-masquerade destination" (meaning traffic to that range would not be masqueraded by the Node) because it assumes all private, reserved ranges are internal and would prefer a Pod IP. This 172 range overlaps with our AWS VPC private range.

Therein lies the problem. Traffic sent to an AWS VPC private IP inside 172.16.0.0/12 was no longer being masqueraded by the Node and thus had a source address of the Pod’s IP.

Discovery

Without any initial debugging or context, the error messages from the airflow jobs indicated they could not connect to a host and the easiest explanations for this that I could think of were the following:

  1. One or both of the VPN tunnels from GCP to AWS were down and not allowing traffic
  2. DNS was resolving public IP addresses and public traffic is not permitted to the private resources
  3. A firewall rule change was suddenly blocking traffic or traffic wasn’t being routed through the VPN tunnel
  4. Something is broken/misconfigured in the kubernetes/GCP network fabric

My typical debugging process is partitioning the problem space into a tree of causality possibilities. The root node is the symptom we’re experiencing (what’s wrong). Children of the root node are possible causes or contributors of the symptom. More edges and nodes grow from each child to get to something that would describe the root cause. When a few good root cause possibilities are identified, you can assign a cost to collecting data and proving or disproving each root (or subsequent) edge. This might look like breadth-first traversal. “Low-cost” edges that connect large branches are the often the best to get out of the way. In my experience they’re often the root cause you’re looking for, too (see: Occam’s Razor).

1. One or both of the VPN tunnels from GCP to AWS were down and not allowing traffic

The easiest thing to verify from the above list was that the tunnel and AWS DNS server were up and running. I checked the tunnel status in GCP and both were “Established”. Google recently released a fantastic new Connectivity Test tool (in which we created a VPN connectivity test to an AWS server) and the test showed that connectivity was “Reachable”. Thought we would have seen an alert, it’s also necessary to confirm and collect data (i.e. can we connect right now? what is the tunnel status right now?).

Conclusion: VPN Tunnel Connectivity was not the issue. Next.

2. DNS was resolving public IP addresses and public traffic is not permitted to the private resources

I then checked the status of the AWS Route53 Resolver (Inbound DNS Server) through the AWS console. The Google VPC uses this server to resolve amazon domain queries—so that we always use private IP addresses for AWS endpoints. The AWS Console did not show any Resolvers in our region/account. This is an unrelated issue that I’ve seen before. I verified our DNS server still exists via terraform and it was reported to be online (it is used by employee VPN connections as well).

Conclusion: The DNS server is up and running. Next.

The next easiest thing to verify from the above list was that DNS was resolving to private IPs properly. I SSHed into a Composer kubernetes node and tested DNS inside and outside of a container. DNS was resolving properly from both, which meant A.) AWS DNS in GCP was working correctly and B.) connectivity to AWS was definitely working.

However, connectivity to the resolved private IP address was only possible from the Node, not the Pod. (argh!)

This suggested (to me) some unexpected behavior in the Kubernetes network plugin (kubenet). Networking in Kubernetes is still a bit mystical to me, so I did not opt to begin debugging this until verifying that traffic from a pod was either reaching the AWS side of the network or not.

Conclusion: DNS was not the issue (though they say it usually is). Next.

3. A firewall rule change was suddenly blocking traffic or traffic wasn’t being routed through the VPN tunnel

Verifying traffic is leaving one VPC and entering another through a VPN is tricky. I set out to verify that traffic from GCP Pods was actually reaching either A.) the VPN tunnel to AWS or B.) a network interface in AWS. Typically, tracerouteor tcptraceroute can help, but in this case there were no detectable hops between the GCP network and AWS network. Debugging continues. There are no logs for AWS Site-to-Site VPN, so finding log information about packets getting routed was a no go. The next idea I had was looking at AWS VPC flow logs, which can show ALLOW or REJECT for connections on all network interfaces in your VPC. I turned on flow logs for the AWS VPC and used a private host in the AWS VPC as a test target for traffic on a port that allowed inbound from the GCP range. I watched the flow logs of this EC2 instance's network interface in CloudWatch while attempting to connect to that EC2 instance from the GCP node and Pods within the node.

For a request from the GCP Node, the source address was as expected (in the 10.128.0.0/9 range). 10.142.0.6 for example.

2020-04-17T12:01:17... eni-... 10.142.15.203 172.30.12.110 ... ACCEPT OK

When attempting to connect via the Pod, the source address was the Pod’s IP address! Remember that secondary range? These packets were rejected by the firewall rules.

2020-04-17T11:55:14... eni-... 10.44.14.15 172.30.12.110 ... REJECT OK

The Pod IP as source address is an issue because Pod’s are using separate CIDR blocks than the normal GCP VPC range and the cloud router on the AWS side probably doesn’t know how to route that. Furthermore, the firewall rules on private instances did not allow traffic from this secondary Pod range — hence the REJECT.

Root cause: Nodes were no longer masquerading Pod traffic.

This seemed like brand new behavior because this Airflow cluster is more than a year old. I had always been under the impression that the Node would NAT pod traffic before it leaves the Node.

I began testing traffic behavior from a different production cluster that wasn’t having any issues making new connections to private AWS infrastructure and confirmed that pod traffic was getting masqueraded as the node. So I began googling.

I found a particularly helpful github issue thread. It clearly describes the issue that we were experiencing and even offers some debug output from others and great discourse on the underlying reasons things would not be working as expected.

This comment, which is highly praised, links to a Google GKE guide for manually create an IP Masquerade Agent in your cluster with settings specific to your liking. This agent (DaemonSet) automatically updates the iptables on each node with some routes based on the configuration you give it. The kubernetes docs also have some information about this agent. It overrides the default rules for non-masquerading all reserved ranges (like 172.16.0.0/12, which is the problematic range).

I also found this SO post which explains GKE/kubelet versions prior to 1.14included Pod IP masqerade behavior for any range that wasn't 10.0.0.0/8. What this means is that 10.0.0.0/8was the only destination range for which masquerading was turned off (pod IP, not Node IP is source address on packet).

I believe the underlying kubernetes cluster for Compose environment was updated to Kubernetes 1.14.10-gke.27. This cluster is part of a hybrid-google-managed product, Cloud Composer, and control / maintenance of the cluster is shared.

That information is corroborated in the following table in the section: “Versions before 1.14” and “other image types without ip-masq-agent enabled”.

It is my belief that our cluster was pre-1.14 before the outage and that the node image was not COS (Container-Optimized OS) and I can attest that it did not have ip-masq-agent enabled (network policy was not enabled and Cluster's Pod CIDR range is within 10.0.0.0/8).

The node upgrade theory is supported by the fact that 18 hours ago, 9PM EST, the pods on those nodes were evicted. Furthermore, the node’s current uptimeis 18 hours, suggesting it was restarted (or started for the first time) at 9PM last night.

What did the “default non-masquerade destinations” look like after the upgrade?

Matching this table in the IP Masquerade Agent Docs, the node had the following iptables rules:

Chain IP-MASQ (2 references)
target prot opt source destination
RETURN all -- anywhere 169.254.0.0/16 /* local traffic is not subject to MASQUERADE */
RETURN all -- anywhere 10.0.0.0/8 /* local traffic is not subject to MASQUERADE */
RETURN all -- anywhere 172.16.0.0/12 /* local traffic is not subject to MASQUERADE */
^ this overlaps with AWS range,
where we *need* masquerading
RETURN all -- anywhere 192.168.0.0/16 /* local traffic is not subject to MASQUERADE */
RETURN all -- anywhere 100.64.0.0/10 /* local traffic is not subject to MASQUERADE */
RETURN all -- anywhere 192.0.0.0/24 /* local traffic is not subject to MASQUERADE */
RETURN all -- anywhere 192.0.2.0/24 /* local traffic is not subject to MASQUERADE */
RETURN all -- anywhere 192.88.99.0/24 /* local traffic is not subject to MASQUERADE */
RETURN all -- anywhere 198.18.0.0/15 /* local traffic is not subject to MASQUERADE */
RETURN all -- anywhere 198.51.100.0/24 /* local traffic is not subject to MASQUERADE */
RETURN all -- anywhere 203.0.113.0/24 /* local traffic is not subject to MASQUERADE */
RETURN all -- anywhere 240.0.0.0/4 /* local traffic is not subject to MASQUERADE */
MASQUERADE all -- anywhere anywhere /* outbound traffic is subject to MASQUERADE (must be last in chain) */

The problematic route is 172.16.0.0/12. 172.16.0.0/12overlaps with our AWS VPC CIDR block of 172.30.0.0/16. Thus, traffic to private IPs in AWS like 172.30.10.112 was not masqueraded, which meant the VPC saw the traffic coming from Pod IP addresses (outside of the allow list of the GCP VPC CIDR of 10.142.0.0/9 because Pod IP ranges are secondary ranges).

Thus, traffic to private IPs in AWS like 172.30.10.112 was not masqueraded

Resolution

I created an IP Masquerade Agent DaemonSet to revert back to the old masquerade rule of just 10.0.0.0/8per this guide.

Final Thoughts

The silver lining of this situation is that it did not occur to the core production cluster and now we know what we need to fix. The HA Cloud VPN now available in GCP with BGP dynamic routes is next on our implementation list. I walked away having a much better understanding of Pod IP address translation and different Kubernetes CNI plugins. I hope this helps someone solve a similar issue.

If you’d like to hear more about a multi-cloud setup with all private DNS and VPN connectivity, connect with me on Twitter and let’s chat!

--

--

Josh Bielick

loves systems, bikes, sociology, coffee, and the sound of music. VP of Infrastructure @Adwerx