My Nightly instance where the issue happened has crashed without reason. Way before that happened, I recorded syscalls by launching [Instruments from Xcode](https://help.apple.com/instruments/mac/current/#/devc1724975) and selecting the existing Nightly process. In my test case, I disabled automatic https requests (see comment 12), and loaded `http://httpbin.org/?httponly` . I chose this domain because it responds with plain HTTP, without forcing HTTPS upgrades. Also, it only has IPv4 addresses, to rule out IPv6 issues. I'm observing the following: - I confirmed that there are no other threads that interacts with the fd returned by `socket()` for this request. The remaining calls/logs in this list are all on the **Socket Thread**. - After the socket is created, `connect()` is called on it and we start polling via `PR_Poll`, which ends up calling [`ssl_Poll`, which ultimately triggers the `getpeername` syscall](https://searchfox.org/mozilla-central/rev/6e2b186c296474e032d9ae2e000b7c870396775c/security/nss/lib/ssl/sslsock.c#3407) that appears in the syscall trace. This tells that the socket is not connected (this is normal; at this point the socket is new and we want to wait until it is connected). - The next syscall after `getpeername` is `semaphore_signal_trap` triggered via some Glean code, rooted in [an `OnSocketReady` call immediately after the `PR_ConnectContinue` call returned](https://searchfox.org/mozilla-central/rev/6e2b186c296474e032d9ae2e000b7c870396775c/netwerk/base/nsSocketTransport2.cpp#2183-2185). - Note that this `PR_ConnectContinue` was already mentioned in comment 13. From the log analysis in comment 13, it is obvious that the internal state is already bad here. - The next syscall is `sys_ulock_wait2`, from a `malloc`, indirectly called via `nsSocketTransportService::DetachSocket`. - [`nsSocketTransportService::DetachSocket` logs](https://searchfox.org/mozilla-central/rev/6e2b186c296474e032d9ae2e000b7c870396775c/netwerk/base/nsSocketTransportService2.cpp#433-436) `nsSocketTransportService::DetachSocket [handler=`. This log appears in the initial report and the other logs and profiles that I have shared in this bug so far, and always consistently after `ErrorAccordingToNSPR [in=-5999 out=80004005]`. - The next syscall (ignoring syscalls related to memory management rooted in `DetachSocket`) is `sys_close`, and passed the fd. - This happens [immediatly after the log call above](https://searchfox.org/mozilla-central/rev/6e2b186c296474e032d9ae2e000b7c870396775c/netwerk/base/nsSocketTransportService2.cpp#440-443) (after the `nsSocketTransport::OnSocketDetached ` log entry). So, based on these observations, there is no indication from syscalls that told Firefox that the fd was bad, yet Firefox somehow thought otherwise.
Bug 1980171 Comment 14 Edit History
Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.
My Nightly instance where the issue happened has crashed without reason. Way before that happened, I recorded syscalls by launching [Instruments from Xcode](https://help.apple.com/instruments/mac/current/#/devc1724975) and selecting the existing Nightly process. In my test case, I disabled automatic https requests (see comment 12), and loaded `http://httpbin.org/?httponly` . I chose this domain because it responds with plain HTTP, without forcing HTTPS upgrades. Also, it only has IPv4 addresses, to rule out IPv6 issues. I'm observing the following: - I confirmed that there are no other threads that interacts with the fd returned by `socket()` for this request. The remaining calls/logs in this list are all on the **Socket Thread**. - After the socket is created, `connect()` is called on it and we start polling via `PR_Poll`, which ends up calling [`ssl_Poll`, which ultimately triggers the `getpeername` syscall](https://searchfox.org/mozilla-central/rev/6e2b186c296474e032d9ae2e000b7c870396775c/security/nss/lib/ssl/sslsock.c#3407) that appears in the syscall trace. This tells that the socket is not connected (this is normal; at this point the socket is new and we want to wait until it is connected). - The next syscall is `select`, still from `PR_Poll`. This blocks the thread until `select` returns. - The next syscall is `semaphore_signal_trap` triggered via some Glean code, rooted in [an `OnSocketReady` call immediately after the `PR_ConnectContinue` call returned](https://searchfox.org/mozilla-central/rev/6e2b186c296474e032d9ae2e000b7c870396775c/netwerk/base/nsSocketTransport2.cpp#2183-2185). - Note that this `PR_ConnectContinue` was already mentioned in comment 13. From the log analysis in comment 13, it is obvious that the internal state is already bad here. - The next syscall is `sys_ulock_wait2`, from a `malloc`, indirectly called via `nsSocketTransportService::DetachSocket`. - [`nsSocketTransportService::DetachSocket` logs](https://searchfox.org/mozilla-central/rev/6e2b186c296474e032d9ae2e000b7c870396775c/netwerk/base/nsSocketTransportService2.cpp#433-436) `nsSocketTransportService::DetachSocket [handler=`. This log appears in the initial report and the other logs and profiles that I have shared in this bug so far, and always consistently after `ErrorAccordingToNSPR [in=-5999 out=80004005]`. - The next syscall (ignoring syscalls related to memory management rooted in `DetachSocket`) is `sys_close`, and passed the fd. - This happens [immediatly after the log call above](https://searchfox.org/mozilla-central/rev/6e2b186c296474e032d9ae2e000b7c870396775c/netwerk/base/nsSocketTransportService2.cpp#440-443) (after the `nsSocketTransport::OnSocketDetached ` log entry). So, based on these observations, there is no indication from syscalls that told Firefox that the fd was bad, yet Firefox somehow thought otherwise.