Request hangs until cancellation because cache entry callback is never fired
Categories
(Core :: Networking: Cache, defect)
Tracking
()
People
(Reporter: kershaw, Assigned: jesup, NeedInfo)
References
Details
(Keywords: webcompat:platform-bug)
User Story
user-impact-score:225
Attachments
(1 file)
Summary: Channel 163296100 stalls waiting on a cache callback that is never invoked
Channel: nsHttpChannel 163296100 (nsIChannel=163296140)
Request: https://apis.google.com/js/api.js (script), partitioned under (https,google.com)
Cache entry: 14beede40, callback 163296b10
Log file: log.txt-main.31466.moz_log
The channel opened cache entry 14beede40 while it was REVALIDATING and owned by
another channel ("not the writer"). It registered callback 163296b10 and parked in
AwaitingCacheCallbacks, but the callback was never invoked. The request hung ~46s
and only ended when the doc loader was stopped (NS_BINDING_ABORTED, 0x804b0002) -
it never opened a connection or issued a transaction.
Timeline
-
Created & opened - 16:40:46.790-.793
54206 D/nsHttp Creating nsHttpChannel [this=163296100, nsIChannel=163296140]
54708 D/nsHttp nsHttpChannel::AsyncOpen [this=163296100]
54730 D/nsHttp nsHttpChannel::BeginConnect [this=163296100]
54950 D/nsHttp nsHttpChannel::Connect [this=163296100] -
Opens cache entry - REVALIDATING, owned by another channel - 16:40:46.795 (root of the stall)
54987 D/nsHttp nsHttpChannel::OpenCacheEntry [this=163296100]
54997 D/cache2 CacheStorageService::AddStorageEntry [entryKey=:https://apis.google.com/js/api.js, contextKey=O^partitionKey=%28https%2Cgoogle.com%29,]
55005 D/cache2 CacheEntry::AsyncOpen [this=14beede40, state=REVALIDATING, flags=20, callback=163296b10]
55015 D/cache2 CacheEntry::RememberCallback [this=14beede40, cb=163296b10, state=REVALIDATING]
55019 D/cache2 entry is being written/revalidated
55023 D/cache2 not the writer
55025 D/nsHttp nsHttpChannel::Connect 163296100 AwaitingCacheCallbacks forces async -
~46 seconds of silence - 16:40:46.795 -> 16:41:32.541
Callback 163296b10 is NEVER invoked for 163296100 (no OnCacheEntryAvailable).
Meanwhile entry 14beede40 is repeatedly purged/re-added in the background:91687 D/cache2 CacheEntry::Purge [this=14beede40, what=2]
... (recurs through 613248) ... -
Cancelled by doc-loader stop, then torn down - 16:41:32.541
624964 D/nsHttp nsHttpChannel::Cancel [this=163296100 status=804b0002, reason=nsDocLoader::Stop]
624967 D/nsHttp HttpAsyncAborter::AsyncAbort [this=163296100 status=804b0002]
624977 D/nsHttp HttpChannelParent::OnStopRequest: [this=15f14a8d0 aRequest=163296140 status=804b0002]
624990 D/nsHttp Destroying HttpChannelParent [this=15f14a8d0]
Note: this is a definite stall-until-cancellation; why the callback never fired
(revalidating writer never releasing waiters, repeated purges, etc.) is not yet
established from this log and is the thing to investigate.
Comment 2•9 days ago
|
||
I experienced this issue and provided a log. Feel free to let me know if additional info is needed.
Updated•9 days ago
|
| Assignee | ||
Comment 3•9 days ago
|
||
Looks like this may be an existing race condition which became more likely due to the No-Vary-Search landings. Claude analysis:
Here's what I've found. There's a pre-existing race condition in CacheEntry::InvokeCallback that I believe is the root cause, and the recent cache2 changes may increase the chance of hitting it.
The race
InvokeCallback releases mLock for OnCacheEntryCheck (line 831). While the lock is released, the management thread can run BackgroundOp(Ops::CALLBACKS) → InvokeCallbacks() → process a different callback from the same entry. Both see state READY, both call OnCacheEntryCheck concurrently (the TODO at line 829 acknowledges this).
If both return ENTRY_NEEDS_REVALIDATION:
- Thread A re-acquires lock → sets
mState = REVALIDATING(line 865) - Thread A releases lock for
InvokeAvailableCallback(line 899) - Thread A calls
NewWriteHandle()→ setsmWriter = handle_A(line 1017) - Channel A gets handle_A via
OnCacheEntryAvailable - Thread B re-acquires lock → state is already REVALIDATING, no-op (line 865)
- Thread B calls
NewWriteHandle()→ overwritesmWriter = handle_B(line 1017) - Channel B gets handle_B
Now mWriter points to handle_B. When Channel A finishes and closes handle_A:
// OnHandleClosed:
if (mWriter != aHandle) { // mWriter == handle_B, aHandle == handle_A
LOG((" not the writer"));
return; // <-- NO state transition, NO callback dispatch
}
The state stays REVALIDATING. If Channel B also fails to complete (cancelled, navigated away, etc.), the entry is stuck permanently in REVALIDATING, and all waiters' callbacks never fire — exactly what bug 2043675 describes.
How the recent changes may increase exposure
Bug 2038008 (No-Vary-Search): Landed 2026-05-28, one day before 2043675 was filed. The NVS secondary lookup in AddStorageEntry can match channels to existing entries they wouldn't have matched before (different query strings → same entry). More callbacks on the same entry = more opportunities for the concurrent OnCacheEntryCheck race. For a popular script like apis.google.com/js/api.js, Google may well serve No-Vary-Search on it (excluding tracking parameters like utm_source), causing multiple channels with different query strings to share one cache entry.
The OnFileReady lock-scope change also adds mFile->GetElement("no-vary-search") inside mLock, which could subtly affect lock timing.
Bug 2036906 and 2031577: These are about doom callbacks and eviction, respectively. They don't affect the InvokeCallback race directly.
Summary
- Root cause: Pre-existing race in
InvokeCallback(acknowledged by the TODO at line 829). Two concurrentOnCacheEntryCheckcalls can both getENTRY_NEEDS_REVALIDATION, leading toNewWriteHandlebeing called twice — the second call silently overwritesmWriter, orphaning the first writer handle. - Why now?: Bug 2038008's NVS secondary lookup likely increased the number of callbacks sharing a single cache entry, making the race more probable.
- The repeated purges are a red herring — the management thread calls
Purge(PURGE_WHOLE)but it fails each time because the entry is still referenced by the waiter's handle.
| Assignee | ||
Comment 4•9 days ago
|
||
The analysis here seems reasonable (though it might not be the cause)
| Assignee | ||
Comment 5•9 days ago
|
||
Updated•9 days ago
|
Updated•8 days ago
|
| Assignee | ||
Comment 7•6 days ago
|
||
[Tracking Requested - why for this release]: For people this hits it's affecting ability to use google properties, worth considering for uplifts to beta and possibly release.
Comment 8•6 days ago
|
||
It sounds like this is a longstanding defect. Not sure it's something we need to track specifically but happy to consider uplift requests when they come along.
Updated•5 days ago
|
Comment 10•5 days ago
|
||
The patch landed in nightly and beta is affected.
:jesup, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.
- See https://wiki.mozilla.org/Release_Management/Requesting_an_Uplift for documentation on how to request an uplift.
- If no, please set
status-firefox152towontfix.
For more information, please visit BugBot documentation.
Comment 11•4 days ago
|
||
(proven in Slack that this affects release too)
Updated•3 days ago
|
Description
•