Closed Bug 1402724 Opened 8 years ago Closed 3 years ago

Crash in shutdownhang | mozilla::net::detail::BlockingIOWatcher::WatchAndCancel

Categories

(Core :: Networking: Cache, defect, P3)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
108 Branch
Tracking Status
firefox-esr52 --- wontfix
firefox-esr60 --- wontfix
firefox-esr102 --- wontfix
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 --- wontfix
firefox58 --- wontfix
firefox59 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 --- verified disabled
firefox108 --- fixed

People

(Reporter: yoasif, Assigned: valentin)

References

Details

(Keywords: crash, regression, Whiteboard: [necko-triaged] [qa-not-actionable])

Crash Data

This bug was filed from the Socorro interface and is report bp-f2766625-52bc-49d2-b62b-05b550170924. ============================================================= 201 crashes in the last week. Two commenters mention Facebook: mozilla kompletnie się wiesza, nie dzialają wtyczki, problem z FB (konkretnie brak połączenia z czatem) a3bf6bb6-ea3e-4a35-b234-559020170923 This happens every time I go to Facebook's site. ca23ee9e-3573-41d1-a3ba-45ecf0170921
Product: Core → Firefox
Seems low volume, but has picked up in 56 and 52.4.0 ESR. Any recent landing that might be suspect?
Component: General → Networking: Cache
Keywords: regression
Product: Firefox → Core
This is a known issue that used to be pretty massive which we mitigated to become a minimum as much as possible. Despite that we don't allow much IO operations after shutdown in the HTTP cache, there are still possibilities to hang even before shutting the browser down for reason we still don't fully understand. May that be too busy disk, buggy drivers, network disks used to store the cached data disconnected, or a windows kernel bug. I think a windows update could change this more likely than a code change (we haven't landed any major change here for quite a long time.)
Priority: -- → P5
Priority: P5 → P3
Whiteboard: [necko-triaged]
(In reply to Asif Youssuff from comment #0) > This happens every time I go to Facebook's site. > ca23ee9e-3573-41d1-a3ba-45ecf0170921 I noticed klsihk64.dll on the stack which should be something from Kaspersky Lab. This might cause IO slowdown. Try to disable it. Anyway, if you can reproduce it reliably can you provide a log? See https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging#Using_aboutnetworking, you should use MOZ_LOG timestamp,sync,nsHttp:5,nsSocketTransport:5,nsStreamPump:5,nsHostResolver:5:cache2:5
Flags: needinfo?(yoasif)
Sorry, I can't provide the information since I am not experiencing the issue. I just copied comments from the comments available on Socorro.
Flags: needinfo?(yoasif)
bp-cbc66c74-c30a-470b-bd15-abb450180103 has modules iNetSafe.dll and KeyCrypt32(9).dll. That user also has several other crash sigs bp-7a7e9d38-2fc8-4c70-b258-6893f0180104 shutdownhang | ZwQueryAttributesFile bp-3a569570-d6bd-4aca-9438-90e650180103 shutdownhang | ZwCreateFile bp-1d87c26c-64e0-4567-962f-63fda0180103 shutdownhang | NtSetInformationFile

This signature and the one in bug 1435343 are both spiking as of Thursday April 12. That could correlate with the 66.0.3 release or could be unrelated. There are also crashes now showing up in 67 beta 9.

While this still isn't a very high volume crash, I'd like to keep an eye on this and it may warrant investigation.

the recent crash spike seems to be skewed towards users with german builds - 80% of crash reports have antivirus software from "Avira" showing up in the telemetry environment.

I left a comment on the Avira support site - let's see if they answer.

Selena - since this crash volume is now quite high for release and beta, can you help find someone to investigate these crashes? Thanks!

Priority: P3 → P1

Selena is out - Dragana, could you take a look at this one

Flags: needinfo?(dd.mozilla)

This is not a new bug. There were always shutdown hangs caused by the cache code.

We are trying to join the IO thread at shutdown that may be blocked doing a sync io. The code where this crash happens is trying to interrupt that blocking IO, which we strongly suspect doesn't really work for these extreme cases when e.g. A/V software is involved. If we remove that interruption code, the crash will not be removed, we will still hang joining the thread.

If we remove the cache2 io thread join (or make it timeout-able) and just let that thread leak in production builds and let the system terminate it, I believe any IO blocking will just shift somewhere else, to some other code like cookies, storage.

I'm author of the cache thread code and the io interruption code, but can't attend this bug sooner than next week. Michal knows it too, he may think of some urgent type of a fix for this.

Flags: needinfo?(dd.mozilla) → needinfo?(michal.novotny)

I've noticed that in some of the reports cache thread is doing IO which comes from WriteEvent. It's quite strange because we ignore any write operation 2 seconds after shutdown was requested https://searchfox.org/mozilla-central/rev/ec489aa170b6486891cf3625717d6fa12bcd11c1/netwerk/cache2/CacheFileIOManager.cpp#1969

If the call stacks are correct, this means that either we have very short time to finish IO operations when shutting down, or cache thread is blocked on the write event for some reason.

Flags: needinfo?(michal.novotny)

The write event may be hanging there for quite a long time (a minute or more). And blockade of the cache io thread may cause pages/resources to stop loading (channels will indefinitely wait for cache entries) that may contribute to the reason why users try to close firefox and start it again to fix the problem.

the recent crash spike from german users is gone again since the beginning of the week, so we can probably downgrade the priority of this bug again.

Dropping to P3 because of not being very actionable.

Priority: P1 → P3
Whiteboard: [necko-triaged] → [necko-triaged] [qa-not-actionable]
Severity: critical → S2

I'm removing BlockingIOWatcher::WatchAndCancel in bug 1794376. Should fix these crashes.

Depends on: 1794376
Severity: S2 → S3

This should have been fixed in bug 1794376.
Setting a reminder to check crash reports.

Whiteboard: [necko-triaged] [qa-not-actionable] → [necko-triaged] [qa-not-actionable][reminder-deprecation 2023-01-15]
See Also: → 1800864

2 months ago, Valentin Gosu [:valentin] (he/him) placed a reminder on the bug using the whiteboard tag [reminder-deprecation 2023-01-15] .

kershaw, please refer to the original comment to better understand the reason for the reminder.

Flags: needinfo?(kershaw)
Whiteboard: [necko-triaged] [qa-not-actionable][reminder-deprecation 2023-01-15] → [necko-triaged] [qa-not-actionable]

Looks like this is really fixed. I didn't see any crash after 108.

Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(kershaw)
Resolution: --- → FIXED
Assignee: nobody → valentin.gosu
Target Milestone: --- → 108 Branch
You need to log in before you can comment on or make changes to this bug.