Issue: asymmetric bmx6 tunnels break PMTUD - lime-dev

23 Oct 2013

Hey axel!
so, we've scratching our heads today, facing some MTU issues on our 
shiny new libre-mesh network,
and after a few hours debugging, tcpdumping and discussing, we came to 
these conclusions:

* the invented src-addr in the bmxOut tunnel makes it impossible for 
hops along the path to return "Packet Too Big" to the originating node. 
So, if a particular link has a smaller MTU than the first link (say, a 
VPN is involved), the packet will be silently dropped.

     A==1500==B---1400---C===1500===D

A wants to send a packet to a network announced by D; so it creates a 
tunnel with D as destination, and a "D-derived" fake address as src, 
that matches the bmxIn "catchall" tunnel peer-addr in D
Then sends a 1460 + 40 = 1500 bytes packet through that tunnel,
B cannot push that packet to C, then tries to send back a ICMPv6 PTB...
but the src-addr it finds in the encapsulation is not A, but instead the 
"D-derived" fake addr. Then, the ICMPv6 PTB is lost and A can never find 
out about the smaller MTU.
This fundamentally breaks PMTUD which is a bad idea in IPv6

to avoid this there are 3 options:
* set mtu=1280 on every bmxOut tunnel (yuck! :( )
* probe before establishing each tunnel, with the real src-addr so that 
PMTUD can happen correctly, until it reaches the desired endpoint node. 
Then, use discovered PMTU for new tunnel. (*downside*: this will only 
work as long as path doesn't change to pass through some thinner link. 
In that case, PMTU will not be rediscovered, and packets will be dropped 
again.)

* use the current bmxIn "catchall" tunnels only for sending special bmx6 
control packets, that ask for a symmetric tunnel.
i.e.
1) A sends to D (2001:db8::D) a packet (encapsulated with a fake 
src-addr, to be catched by the catchall @ D) with content "I'm A and 
this is my real address 2001:db8::A; please make a dedicated tunnel for me"
2) D gets that packet and creates a tunnel with "peer 2001:db8::A", then 
sends back an ACK to A, again using "A-derived" fake-addr as src
3) A gets the ACK and creates the tunnel with "peer 2001:db8::D"
*** now both ends have a symmetric tunnel between them, with real src 
and dest address ***
4) A finally sends the real payload through the symmetric tunnel, this 
payload (may be bigger than 1280, say... 1450) will be encapsulated with 
the real src-addr of A, so if any node in the path needs to send back a 
Packet-too-big, will be able to, and PMTUD will happen correctly.

(at the cost of a full RTT latency before the first payload packet, but 
with a reasonable tunnel expire time as it has currently, that shouldn't 
be terrible)

back in April, i remember we discussed the idea of symmetric tunnels, 
and you brought up this "control connection" idea which i'm simply 
redescribing here.
But that discussion was in another context, more like a 'feature', and 
finally didn't really solve the idea we had originally,
yet, this PMTUD issue was not taken into account AFAIR, so the 
"symmetric tunnel" idea now becomes more like a bugfix (i.e. don't 
create PMTUD blackholes)

what do you think?

Cheers!!

gui