Closed Bug 1439231 Opened 4 years ago Closed 4 years ago

Firefox TRR Mode 3 semi-reliably crashes Mac

Categories

(Core :: Networking, defect, P1)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: ekr, Assigned: bagder)

References

Details

(Whiteboard: [necko-triaged][trr])

Repro

1. Turn on mode 3
2. Restart Firefox
3. Type something in the URL bar

Results:
Crash with the following stack trace


*** Panic Report ***
panic(cpu 3 caller 0xffffff800ec0f8af): assertion failed: inp->inp_flowhash != 0, file: /BuildRoot/Library/Caches/com.apple.xbs/Sources/xnu/xnu-4570.41.2/bsd/netinet/tcp_output.c, line: 1860
Backtrace (CPU 3), Frame : Return Address
0xffffff9225bcb800 : 0xffffff800e84f606 
0xffffff9225bcb850 : 0xffffff800e97c654 
0xffffff9225bcb890 : 0xffffff800e96e149 
0xffffff9225bcb910 : 0xffffff800e801120 
0xffffff9225bcb930 : 0xffffff800e84f03c 
0xffffff9225bcba60 : 0xffffff800e84edbc 
0xffffff9225bcbac0 : 0xffffff800ec0f8af 
0xffffff9225bcbc60 : 0xffffff800ec1cbb4 
0xffffff9225bcbcc0 : 0xffffff800ed722bc 
0xffffff9225bcbde0 : 0xffffff800ed82b93 
0xffffff9225bcbed0 : 0xffffff800ed82891 
0xffffff9225bcbf40 : 0xffffff800edfa978 
0xffffff9225bcbfa0 : 0xffffff800e801906 

BSD process name corresponding to current thread: firefox

Mac OS version:
17D47
Obviously this is a MacOS defect, but given that this seems to be the only thing in Firefox that triggers it, we should probably try to figure out how to avert.
fwiw google shows a similar problem related to a vpn, but that's the only search result for that assertion..
Assignee: nobody → daniel
Blocks: 1434852
Priority: -- → P1
Whiteboard: [necko-triaged][trr]
Do you have any further details on the stack with some Firefox code? I mean, which particular function/method in Firefox triggers this problem?
No. Because the entire operating system crashes, no stack gets gathered.
At least none I was able to find.
Not that it helps much, but it looks like it is this assert: https://github.com/apple/darwin-xnu/blob/master/bsd/netinet/tcp_output.c#L1860
Ack, I could easily reproduce this on my mac (which is a mac mini fully updated with the latest macos version). I also repeatedly get the *exact* same backtrace as showed up here in the original post.

When this crash happens, the machine reboots instantly so it is really hard to figure out exactly what Firefox did to trigger it:

1. Regular MOZ_LOG-logging of "nsHostResolver" to a file doesn't get flushed enough so the file ends up blank after reboot.
2. Running 'mach run --debug' doesn't help because gdb won't catch any problem before the reboot.
I also tried using "sync" with MOZ_LOG, but the last lines it caught were not really helpful:

[838:DNS Resolver #1]: D/nsHostResolver CompleteLookup: [dns host namea] has [IP]
[838:DNS Resolver #1]: D/nsHostResolver nsHostResolver record 0x10344f680 calling back dns users
[838:Socket Thread]: D/nsHostResolver Checking blacklist for host [dns host name], host record [0x10344f680].
(repeated a few times)
(Apple at least looked at my bug report yesterday)

"Engineering has determined that your bug report (37706926) is a duplicate of 34406902 and will be closed."
daniel and I had a good meeting on this bug today.

1 - daniel has observed through testing that it is linked to aaaa. If we tweak TRR to not lookup aaaa the crash appears gone (though the testing isn't extensive).

2 - we observe that both daniel and ekr have no e2e v6. daniel has a link local v6 address.

3 - logging with trr off (mode 0) shows no v6 is returned to nsHostResolver from the system resolver. This is a bit different than other platforms.. e.g. linux sees v6 addresses here and then has to fallback when it cannot use them.

4 - hypothesis: system resolver normally filters v6 addresses when v6 is not confirmed to be working.. when trr bypasses the system resolver and tries to use addresses for which there is no connectivity (maybe no route?) the kernel panics.

5 - todo - daniel to have trr honor the disable v6 gecko pref. This is not very useful (its normally set to enable independent of the connectivity).

6 - todo - daniel to add a pref to disable aaaa only on mac as a temporary workaround

7 - todo - patrick to ask around and see if there is a way to query when v6 would be returned by the system resolver

8 - todo - daniel to consider feasibility to add a v6 probe for aaaa example.com using the system resolver to determine when to bypass 6. don't do it in trr-only mode.
it has been suggested that this might be related to a known TFO bug in MacOS that is believed fixed in the 10.13.4 beta.
Depends on: 1444453
Confirmed! It is certainly TFO related. When I switch off "network.tcp.tcp_fastopen_enable", I can leave the AAAA code in there and I've been able to click around, open and close many tabs to an extend that was not previously possible on the mac.
bug 1444453 has landed, which indirectly solves this one as well.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.