Ghost in the Edge: How a 30-Second EIP Fix Ended a Production Outage That Defied Every Diagnostic
A deep-dive into a phantom AWS networking failure where every indicator said the server was healthy, every diagnostic came back clean, and the fix turned out to be one CLI command that most troubleshooting guides never mention.
The Sunday Morning Alert
It started the way production incidents always start: quietly, at a bad time. Early on a Sunday morning, routine monitoring showed that web3 — a public-facing Amazon Linux 2 EC2 instance in us-east-1 — was responding intermittently. Pings were dropping. SSH connections were sluggish and unreliable. HTTP requests were timing out.
The initial ping test told the story in numbers:
$ ping web3
PING ec2-52-201-74-79.compute-1.amazonaws.com (52.201.74.79): 56 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
64 bytes from 52.201.74.79: icmp_seq=2 ttl=246 time=20.107 ms
64 bytes from 52.201.74.79: icmp_seq=3 ttl=246 time=19.997 ms
Request timeout for icmp_seq 4
64 bytes from 52.201.74.79: icmp_seq=5 ttl=246 time=20.666 ms
64 bytes from 52.201.74.79: icmp_seq=6 ttl=246 time=20.216 ms
64 bytes from 52.201.74.79: icmp_seq=7 ttl=246 time=20.209 ms
64 bytes from 52.201.74.79: icmp_seq=8 ttl=246 time=20.249 ms
64 bytes from 52.201.74.79: icmp_seq=9 ttl=246 time=19.765 ms
64 bytes from 52.201.74.79: icmp_seq=10 ttl=246 time=19.556 ms
64 bytes from 52.201.74.79: icmp_seq=11 ttl=246 time=19.983 ms
Request timeout for icmp_seq 12
64 bytes from 52.201.74.79: icmp_seq=13 ttl=246 time=19.928 ms
64 bytes from 52.201.74.79: icmp_seq=14 ttl=246 time=20.326 ms
--- ec2-52-201-74-79.compute-1.amazonaws.com ping statistics ---
15 packets transmitted, 11 packets received, 26.7% packet loss
Twenty-seven percent packet loss to a production endpoint. On its own, ICMP loss isn't conclusive — routers regularly deprioritize ping traffic. But SSH confirmed the problem was real: connections established but were crippled, with visible lag and frequent stalls. This wasn't cosmetic. This was a production outage affecting all public traffic.
Ruling Out the Obvious
The first instinct in any EC2 networking incident is to look at the instance itself. Is the NIC failing? Has the kernel wedged something? Did a reboot break the driver? We ran through the standard checklist methodically, and everything came back clean.
ENA Driver and NIC Health
The Elastic Network Adapter statistics — the gold standard for diagnosing EC2 networking problems — showed nothing wrong:
# ethtool -S eth0 (filtered)
tx_timeout: 0
missing_intr: 0
missing_tx_cmpl: 0
bw_in_allowance_exceeded: 0
bw_out_allowance_exceeded: 0
pps_allowance_exceeded: 0
conntrack_allowance_exceeded: 0
conntrack_allowance_available: 51299
queue_0_rx_page_alloc_fail: 0
queue_0_rx_dma_mapping_err: 0
queue_0_rx_bad_desc_num: 0
Every counter that matters was zero. No bandwidth allowance exhaustion, no packet-per-second throttling, no conntrack overflow, no DMA mapping errors, no missed interrupts. The ENA driver logged a completely normal initialization sequence on boot with no resets, link flaps, or timeout storms. On paper, this NIC was in perfect health.
Interface Counters
The interface-level statistics told the same story:
# ip -s link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP
RX: bytes packets errors dropped missed mcast
4359848 12188 0 0 0 0
TX: bytes packets errors dropped carrier collsns
3347121 8855 0 0 0 0
Zero errors, zero drops, zero missed packets at the interface level. Whatever was happening to traffic wasn't being caught by the local NIC's counters.
Firewall and Routing
The routing table was textbook-simple: a default gateway through the VPC router, a local subnet route, and the metadata service endpoint. No stray routes, no blackholes, no policy routing complexity.
# ip route
default via 10.1.0.1 dev eth0
10.1.0.0/24 dev eth0 proto kernel scope link src 10.1.0.207
169.254.169.254 dev eth0
Firewall rules were minimal — just a single ipset-based blocklist with a default ACCEPT policy on all chains. No NAT rules. No nftables ruleset. Nothing that could explain blanket public-path degradation.
The Reboot Question
An important clarification came early: the problem started before any reboot. The reboot was attempted as a remediation, not a root cause. This was significant because it immediately deprioritized kernel driver regression, post-reboot NIC initialization failures, and DHCP lease problems — the usual suspects when trouble appears after a restart.
A full stop/start cycle — which, unlike a reboot, migrates the instance to a completely different physical hypervisor — was also performed. The problem persisted. Whatever was wrong wasn't tied to the underlying hardware host.
Key Distinction: Reboot vs. Stop/Start
An EC2 reboot restarts the operating system on the same physical host. A stop/start deallocates the instance entirely and relaunches it on a new hypervisor, giving you a new underlying server, new NUMA topology, and potentially a different rack. The fact that stop/start didn't help was a strong signal: this wasn't a host-level hardware or hypervisor problem.
The Decisive Clue: Private vs. Public Path
The breakthrough came from comparing SSH socket statistics between two simultaneous connections to the same server — one arriving over the private VPC path from a jump host, and one arriving over the public internet.
# ss -ti (SSH sockets, side by side)
# Public SSH session (from Mac via Vermontel ISP):
ESTAB 10.1.0.207:ssh ← 216.66.125.161:55080
cwnd:2 ssthresh:2 bytes_retrans:20212
bytes_acked:15157 retrans:0/25
send 1.12Mbps
# Private SSH session (from jump host within VPC):
ESTAB 10.1.0.207:ssh ← 10.1.1.21:58860
cwnd:20 ssthresh:20 bytes_retrans:0
bytes_acked:70577 retrans:0/0
send 2.09Gbps
Look at those numbers. The private path was running at full wire speed with a congestion window of 20, zero retransmissions, and sub-millisecond RTT. The public path had collapsed to a congestion window of 2, had 25 retransmissions on a single session, and was barely pushing a megabit. Same server, same kernel, same NIC, same moment in time. The server was healthy. The private network was healthy. Something between the AWS edge and the public internet was broken.
The MTR Comparison That Sealed It
To confirm, we ran MTR tests from the same Mac client to two different EC2 instances — one to web3 (the problem host) and one to dev10 (a healthy host in the same region). Both tests traversed the same ISP path through Vermontel, the same upstream routers, and the same initial hops:
| Test Target | Final Hop Loss | Avg Latency | Verdict |
|---|---|---|---|
dev10 (prod06.thewyz.net) | 0.0% | 19.7ms | CLEAN |
web3 (web2.thewyz.net) | 14.0% | 17.3ms | DEGRADED |
Same client. Same ISP. Same upstream path through Vermontel. One AWS host was clean, the other was losing 14% of packets at the final hop. The MTR from within web3 to the client's public IP also showed dramatic latency spikes — hop 5, sitting between the AWS edge network (AS16509) and the client's ISP, averaged 363ms with spikes to 7,197ms:
# mtr from web3 to client (hop 5)
HOST: web3.thewyz.net Loss% Snt Last Avg Best Wrst StDev
5. AS??? 206.82.104.8 0.0% 100 1026. 363.1 6.8 7197. 933.4
That hop — 206.82.104.8 — sat at the boundary between AWS's internal edge network and the transit path toward Vermontel. It was the inflection point where packets went from healthy to sick.
One More Confirmation
We also verified that web3 was reachable cleanly from dev10 inside AWS. That meant the instance itself, its VPC path, its security groups, and its internal networking were all fine. The problem was exclusively on the public-facing path — specifically, on the path associated with web3's Elastic IP.
Understanding the Invisible Layer: How EC2 Public IPs Actually Work
To understand why this happened and why the fix worked, you need to understand something that AWS doesn't heavily advertise: EC2 instances never actually have public IP addresses.
When you assign a public IPv4 address — whether it's an auto-assigned public IP or an Elastic IP — that address doesn't live on the instance's network interface. Run ifconfig or ip addr on an EC2 instance and you'll only see the private IP. The public address exists only as a NAT mapping maintained by the Internet Gateway (IGW) at the edge of AWS's network.
As AWS's own documentation states, public IPv4 addresses are "technically implemented as a network address translation mechanism at the edge of AWS's network." Here's the packet flow for every single public request hitting an EC2 instance:
Inbound: A packet arrives at AWS's edge network addressed to the EIP (e.g., 52.201.74.79). The IGW translates the destination from the EIP to the instance's private IP (10.1.0.207) and forwards it into the VPC. The instance sees only a packet addressed to its private IP.
Outbound: The instance sends a packet from its private IP. The VPC router forwards it to the IGW. The IGW translates the source from the private IP to the EIP and sends it out to the internet.
This 1:1 NAT mapping is maintained by the Internet Gateway — a managed, horizontally scaled AWS service that operates at the edge of the VPC. It's the invisible layer between your instance and the public internet. You never see it. You can't SSH into it. You can't reboot it. You can't even ping it. But every public packet to and from your instance passes through it.
The Fix: 30 Seconds, Two Commands
With the diagnosis pointing squarely at the edge-layer mapping rather than the instance, we tried the simplest possible intervention: disassociating the Elastic IP from the instance and immediately re-associating it.
# Step 1: Get current association details
$ aws ec2 describe-addresses --public-ips 52.201.74.79
# Step 2: Disassociate the EIP
$ aws ec2 disassociate-address \
--association-id eipassoc-xxxxxxxx
# Step 3: Re-associate the same EIP to the same instance
$ aws ec2 associate-address \
--allocation-id eipalloc-xxxxxxxx \
--instance-id i-xxxxxxxx
That was it. Same EIP. Same instance. No DNS change. No server migration. No configuration change. The entire operation took less than 30 seconds.
Immediately after re-association, the public path was clean. Pings returned to 20ms with zero loss. SSH was instantaneous. HTTP traffic flowed normally. The production outage was over.
What Actually Went Wrong: A Technical Speculation
AWS does not publish the internal architecture of the Internet Gateway or its EIP NAT subsystem in detail. What follows is informed speculation based on publicly available information about AWS networking architecture, observed behavior, and general principles of large-scale NAT and edge networking systems.
The IGW as a Distributed NAT Fabric
AWS's Internet Gateway is not a single device. It's a horizontally scaled, distributed service that operates at the edge of each VPC. When an EIP is associated with an instance, the IGW creates an internal mapping record that ties the public address to the instance's private address and ENI. This mapping determines not just the address translation, but also the physical path that packets take through AWS's edge infrastructure.
AWS's edge network connects to the public internet through a mesh of peering points, transit agreements, and edge routers across each region. The AS16509 hops visible in traceroute output represent this edge infrastructure. Different EIPs — even in the same region and AZ — may be mapped to different physical edge nodes based on load balancing, IP range assignments, and internal topology decisions.
Hypothesis 1: Stale or Wedged Edge-Node Mapping
The most likely explanation is that the EIP's association had become bound to a specific edge node or NAT processing path that was experiencing degradation. This could happen through several mechanisms.
Large-scale NAT systems often maintain persistent forwarding state for each mapping. This state includes not just the address translation rule, but also the specific forwarding path — which edge router, which line card, which interface. If the underlying node experiences a partial failure (think: a single line card dropping packets intermittently, or a buffer overflow in a specific forwarding ASIC), the NAT mapping would continue to send traffic through the degraded path because the mapping itself was still "valid."
Disassociating and re-associating the EIP forced the IGW to tear down the existing mapping and create a new one from scratch. The new mapping was assigned to a different (healthy) edge path, and traffic immediately recovered.
Hypothesis 2: Asymmetric Path Degradation
The MTR data showed different behavior depending on direction and source. Traffic from dev10 (another AWS host) to web3 was clean — because that traffic never leaves the AWS fabric. Traffic from web3 to the client showed massive latency spikes at the edge boundary. This pattern is consistent with a specific outbound edge path being degraded.
In large BGP-based routing fabrics, the outbound path (AWS → internet) and the inbound path (internet → AWS) are often asymmetric. AWS's edge routers select outbound paths based on BGP best-path calculations, local preference settings, and traffic engineering policies. An EIP mapped to a particular edge node would have its outbound traffic follow that node's BGP-selected path. If that specific path was congested or partially failed, all traffic through that mapping would suffer — while other EIPs mapped to different edge nodes would be unaffected.
This perfectly explains why dev10 (different EIP, different edge mapping) was clean from the same client, while web3 was degraded.
Hypothesis 3: AWS Internal Maintenance or Micro-Outage
AWS operates a massive edge network that peers with thousands of ISPs and transit providers. Within this infrastructure, maintenance events — BGP session resets, line card replacements, firmware updates, fiber cuts — happen continuously. Most are invisible because traffic is rerouted seamlessly.
However, if an EIP's NAT mapping was pinned to a specific edge path during a micro-outage, and the IGW's internal health-checking didn't detect the partial degradation (perhaps because the node was still forwarding some packets, just with high loss), the mapping could remain stuck on the bad path indefinitely. The stop/start didn't help because it moves the instance to a new hypervisor — it doesn't remap the EIP's edge path. Only disassociating and re-associating the EIP forced the edge-layer remapping.
Why Stop/Start Didn't Fix It
This is the crucial architectural point. When you stop and start an EC2 instance, several things change: the underlying physical host, the hypervisor slot, and potentially the rack. But the EIP association is maintained transparently across stop/start cycles — that's the entire point of Elastic IPs. AWS preserves the mapping so your public endpoint remains stable.
The problem is that "preserving the mapping" likely also preserves the edge-layer forwarding state. The IGW doesn't rebuild the NAT mapping from scratch during a stop/start — it maintains the existing mapping and simply updates the internal private-IP target when the instance comes back on a new host. The edge path stays the same. The degraded forwarding path stays the same.
Only explicitly breaking and recreating the EIP association forces the IGW to fully tear down and rebuild the mapping — including the edge forwarding path selection.
The Car Analogy
Imagine you're driving to work every day using GPS navigation. One day, a bridge on your usual route develops a dangerous pothole that causes intermittent tire damage. Your GPS keeps routing you over that bridge because the bridge is technically "open." Buying a new car (stop/start = new hypervisor) doesn't help — the GPS still picks the same route. Even moving to a different house on the same street (instance resize) doesn't help. The fix is to close and reopen the GPS app (disassociate/re-associate the EIP), forcing it to recalculate the route from scratch and pick a different bridge.
The Diagnostic Trail: Why Each Test Mattered
What made this incident challenging was that every standard diagnostic returned clean results. Here's a summary of what each test told us — and, critically, what it didn't tell us:
| Diagnostic | Result | What It Proved |
|---|---|---|
ethtool -S eth0 | ALL ZEROS | ENA driver and NIC hardware are healthy |
ip -s link show | NO ERRORS | Interface is passing packets cleanly at local level |
ip route | NORMAL | No routing anomalies inside the instance |
iptables -L -n -v | MINIMAL | No firewall rules blocking or degrading traffic |
| Private SSH (jump host) | 2.09 Gbps | Instance internals, VPC networking, private path perfect |
| Public SSH (Mac/ISP) | 1.12 Mbps | Public path is severely degraded |
| MTR to dev10 (good host) | 0% LOSS | ISP path to AWS is healthy. Problem is host-specific |
| MTR to web3 (bad host) | 14% LOSS | Something specific to web3's public endpoint is broken |
| dev10 → web3 (AWS internal) | CLEAN | Problem is not on the instance. It's on the public edge path |
| Stop/start (hypervisor migration) | NO CHANGE | Problem is not hardware. EIP mapping preserved bad path |
| EIP disassociate/re-associate | FIXED | Problem was in the EIP's edge-layer forwarding state |
The TCP Evidence That Tells the Whole Story
The netstat -s output captured during the incident provides a TCP-level view of the damage. These counters represent cumulative pain across all connections on the instance:
| Counter | Value | Significance |
|---|---|---|
| Segments retransmitted | 444 | Substantial retransmission load for a lightly-trafficked host |
| TCPLostRetransmit | 215 | Retransmitted segments themselves lost — double loss |
| Fast retransmits | 79 | TCP detected loss via duplicate ACKs, not just timeouts |
| TCPSackRecoveryFail | 35 | SACK-based recovery couldn't fix the loss |
| IpOutNoRoutes | 61 | Some packets had no route — possibly edge-layer churn |
The TCPLostRetransmit counter at 215 is particularly telling. This means the kernel retransmitted a segment, and the retransmission itself was lost. That only happens with sustained, non-trivial packet loss — exactly what you'd expect from a degraded forwarding path at the edge layer. The SACK recovery failures (35 events) compound this: even TCP's most sophisticated loss-recovery mechanism (Selective Acknowledgment) was unable to recover gracefully because the underlying path was continuously dropping packets.
The per-socket state on the degraded public SSH connection showed the TCP congestion control algorithm had given up trying to grow the window. The cwnd:2 and ssthresh:2 values mean TCP's congestion window had collapsed to its minimum — the connection was operating in permanent slow-start-like behavior, unable to sustain throughput because every attempt to open the window was met with more loss.
Why This Diagnosis Was So Hard
This incident was tricky because it violated several standard assumptions that guide network troubleshooting:
Assumption: if the NIC is healthy, the network is healthy. Not true. The NIC only sees packets after the edge-layer NAT. A degraded edge path drops or delays packets before they ever reach the NIC on the inbound side, and after they leave the NIC on the outbound side. The NIC's counters will be spotless even as the public path bleeds packets.
Assumption: a stop/start fixes host-level problems. It does — for hypervisor, hardware, and NIC problems. It does not reset the EIP's edge-layer forwarding state. The EIP association is maintained across stop/start cycles by design.
Assumption: if the problem isn't the ISP, it must be the instance. Not necessarily. The IGW's edge-layer NAT is a third party in the conversation — neither the ISP nor the instance. It's an invisible, unmonitorable intermediary that you can't SSH into, can't traceroute through, and can't inspect with any standard tool.
Assumption: if another host works from the same client, the problem is on the failing instance. Close, but not quite. It could also be on the failing instance's EIP mapping — a distinction that matters enormously for selecting the right fix.
Broader Lessons for AWS Operators
Add EIP Reassociation to Your Troubleshooting Playbook
Most AWS troubleshooting guides for EC2 networking focus on security groups, NACLs, route tables, ENA driver health, and instance-level firewalls. Almost none mention EIP disassociation and re-association as a diagnostic or remediation step. Based on this incident, it should be among the first things you try when you see public-path-specific degradation with clean private-path behavior. It takes 30 seconds and has no downside when the public path is already broken.
Always Compare Private and Public Paths
The single most valuable diagnostic in this incident was the side-by-side ss -ti comparison of a private-path SSH socket and a public-path SSH socket. If you have a jump host or bastion in the same VPC, use it. Compare congestion windows, retransmission counts, and throughput. If the private path is perfect and the public path is degraded, you know the problem is above the instance — somewhere in the edge/IGW/transit layer.
Test From Multiple External Paths
This incident would have been resolved faster if we had initially tested from a second ISP path (a cellular hotspot, a VPN endpoint, or a remote colleague). Confirming that the problem was specific to one EIP's edge path — rather than a general AWS issue or a general ISP issue — would have pointed directly at EIP reassociation as the fix.
Don't Migrate When You Can Remap
The initial plan was a full server migration from Amazon Linux 2 to Rocky Linux 10 — a multi-hour project under production outage pressure. That migration is still strategically correct (AL2 reaches end of support on June 30, 2026), but doing it as an emergency response to a networking incident would have been unnecessarily risky. The actual fix took 30 seconds. The migration can now happen on a scheduled maintenance window with proper testing and validation.
References and Further Reading
- AWS VPC NAT Gateways Documentation — How NAT gateways perform source-NAT and how the IGW maps private addresses to Elastic IPs at the edge.
- AWS re:Post — EIP NAT at the Edge — Confirms that public IPv4 addresses are "technically implemented as a network address translation mechanism at the edge of AWS's network."
- AWS Architecture Blog — Internet Routing and Traffic Engineering — Deep dive into how AWS's edge network uses BGP to route traffic.
- AWS Knowledge Center — Troubleshoot VPC to On-Premises Over IGW — Using MTR, tcpdump, and bidirectional traceroutes.
- AWS ENA Driver Troubleshooting — Diagnosing ENA-level issues.
- AWS Network Peering Policy — BGP peering requirements and edge network operations.
- AWS Enhanced Networking with ENA — ENA capabilities and
ethtool -Scounters. - AWS Knowledge Center — Fix Connection with Elastic IP — AWS's troubleshooting guide for EIP connection issues.
Timeline
| Time (ET) | Event |
|---|---|
| ~03:00 | Symptoms first noticed — intermittent SSH, ICMP loss |
| ~03:30 | Reboot attempted — no improvement |
| ~04:00 | Full stop/resize/start cycle — no improvement |
| ~04:30 | ENA, interface, firewall, routing diagnostics — all clean |
| ~05:00 | TCP socket comparison reveals private-path healthy, public-path degraded |
| ~05:05 | MTR comparison: dev10 clean, web3 degraded from same client |
| ~05:10 | Jump host confirms private access to web3 is perfect |
| ~05:15 | dev10 → web3 confirmed clean (AWS internal path healthy) |
| ~05:16 | EIP disassociated and re-associated |
| ~05:17 | SERVICE RESTORED — all public paths clean |
Total time from first symptom to resolution: approximately 2 hours 17 minutes. Time spent on the actual fix: approximately 30 seconds.
Final Thought
The lesson of this incident isn't "EIPs are unreliable." They're not — this was a rare edge case, probably a one-in-a-million interaction between a specific EIP mapping and a specific edge node state. The lesson is that AWS's abstraction layers are deep, and when something goes wrong in a layer you can't see, the symptoms can be profoundly confusing. Adding EIP reassociation to your mental toolkit — right alongside "have you tried turning it off and on again" — could save you hours of misdiagnosis on a day when hours matter.
Published March 29, 2026 · Written during a live production incident · No servers were harmed in the writing of this post (one was fixed)
← Back to Wyzaerd Consulting