Closed Bug 1760580 Opened 2 years ago Closed 9 months ago

Gmail irrevocably loses communication after 5-12 hours online from version 98 of Firefox

Categories

(Core :: Networking, defect, P2)

Firefox 98
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: 4pub, Assigned: kershaw)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged])

Attachments

(8 files, 1 obsolete file)

User Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0

Steps to reproduce:

I entered to Gmail and I leave the computer on to keep track of new emails for long time (from several days to several weeks).
(FYI: OS: Windows Server 2008 SP1, Firefox with over a thousand tabs.)

Actual results:

After 5-12 hours with a probability of 70-80% Gmail issues a message that it cannot connect to the server, although the Internet works fine. Gmail works again only if I restart the browser. The bug appeared after the update on Firefox 98.
(FYI: I ran into this mistake 5-6 times during the week. As a result, I had to restore the disk from the backup with Firefox 97.0.2 in which there is no this bug.)

Expected results:

Gmail before FF 98 did not losing connection if everything was ok with the Internet.
Previously, Gmail automatically restored the connection if the Internet was restored after problems.

The Bugbug bot thinks this bug should belong to the 'Core::Networking' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Networking
Product: Firefox → Core

Can you try to record a HTTP log when the problem occurs.
Instructions to create a log can be fond here?
Please log out from gmail after collecting logs because the logs contain cookies.
You can also send logs to necko@mozilla.com.

Flags: needinfo?(4pub)

I tried to unsuccessfully reproduce this bug during the last 4 days. Finally, I saw this error again in the browser today in the morning (Firefox 98.0.1 64-bit).
I did not launch other browsers in parallel with FF during these days.
I tried to record HTTP Log, but Firefox crashed in a minute after that. Therefore, I'm not sure that information about the error is recorded in this first log.
After that, I launched Firefox again. I launched parallel to Pale Moon after 2-3 hours. After 5-10 minutes of work in Pale Moon, I noticed that this bug appeared in Firefox again. At the same time, Gmail continued to work correctly in Pale Moon. I do not know if this is linked, but it is possible that parallel launch of other browsers speeds up the occurrence of this bug. I recorded the second HTTP log of this bug. I sent this second http log to necko@mozilla.com.

Flags: needinfo?(4pub)

The log show that this is related to service worker.
I've seen multiple requests are redirected to intercepted channel, but they are never opened.

Eden, is this a known issue? Does it help if I forward the http log to you?

Flags: needinfo?(echuang)

Sorry for the late reply. Yes, Kershaw, Dragana, please forward me the log, I will take a look at the interception part.

Flags: needinfo?(kershaw)
Flags: needinfo?(dd.mozilla)

I'll forward the log to (In reply to Eden Chuang[:edenchuang] from comment #5)

Sorry for the late reply. Yes, Kershaw, Dragana, please forward me the log, I will take a look at the interception part.

I'll forward the log.

Flags: needinfo?(kershaw)
Flags: needinfo?(echuang)
Flags: needinfo?(dd.mozilla)

Kershaw pointed me to look at the canceled channel which uri is https://mail.google.com/sync/u/0/i/bv?hl=ru&c=113.
According to the log, the channel cancelation seems to be from the child process.
Since we had already enabled parent interception from Firefox 74, so the cancelation is not from the internal ServiceWorker.
We probably need more information for debugging why HttpChannelChild::SendCancel() is called with NS_BINDING_ABORT.

Channel creation for https://mail.google.com/sync/u/0/i/bv?hl=ru&c=113.

2022-03-28 09:57:39.479000 UTC - [Parent 6928: Main Thread]: V/nsHttp Creating nsHttpChannel [this=22791d00]
2022-03-28 09:57:39.479000 UTC - [Parent 6928: Main Thread]: E/nsHttp HttpBaseChannel::Init [this=22791d00]
2022-03-28 09:57:39.479000 UTC - [Parent 6928: Main Thread]: E/nsHttp host=mail.google.com port=-1
2022-03-28 09:57:39.479000 UTC - [Parent 6928: Main Thread]: E/nsHttp uri=https://mail.google.com/sync/u/0/i/bv?hl=ru&c=113

IntecrceptedHttpChannel creation for https://mail.google.com/sync/u/0/i/bv?hl=ru&c=113.

2022-03-28 09:57:39.487000 UTC - [Parent 6928: Main Thread]: V/nsHttp Creating HttpBaseChannel @79b95e00
2022-03-28 09:57:39.487000 UTC - [Parent 6928: Main Thread]: E/nsHttp HttpBaseChannel::Init [this=79b95e00]
2022-03-28 09:57:39.487000 UTC - [Parent 6928: Main Thread]: E/nsHttp host=mail.google.com port=-1
2022-03-28 09:57:39.487000 UTC - [Parent 6928: Main Thread]: E/nsHttp uri=https://mail.google.com/sync/u/0/i/bv?hl=ru&c=113

InterceptedHttpChannel redirect (ServiceWorker Interception)

2022-03-28 09:57:39.487000 UTC - [Parent 6928: Main Thread]: D/nsHttp HttpChannelParent::AsyncOnChannelRedirect [this=215e25c0, old=22791d40, new=79b95e40, flags=4]
2022-03-28 09:57:39.487000 UTC - [Parent 6928: Main Thread]: D/nsHttp HttpChannelParent::StartRedirect [this=215e25c0, newChannel=79b95e40 callback=34261f20]
2022-03-28 09:57:39.488000 UTC - [Parent 6928: Main Thread]: D/nsHttp HttpChannelParent::CompleteRedirect [this=215e25c0 succeeded=1]

Receiving channel cancel from child process.

2022-03-28 09:58:39.480000 UTC - [Parent 6928: Main Thread]: D/nsHttp HttpChannelParent::RecvCancel [this=215e25c0]
2022-03-28 09:58:39.480000 UTC - [Parent 6928: Main Thread]: D/nsHttp HttpAsyncAborter::AsyncAbort [this=79b95e00 status=804b0002]
2022-03-28 09:58:39.480000 UTC - [Parent 6928: Main Thread]: V/nsHttp HttpBaseChannel::DoNotifyListener this=79b95e00

The interesting thing is the channel seems to be monitored by WebRequest.jsm, which means there could be some extensions monitoring this request.
I am not sure whether this cancelation is related to the extensions or not.

2022-03-28 09:57:39.486000 UTC - [Parent 6928: Main Thread]: V/nsHttp nsHttpChannel::SuspendInternal [this=22791d00]
2022-03-28 09:57:39.486000 UTC - [Parent 6928: Main Thread]: V/nsHttp 22791d00 called from script: resource://gre/modules/WebRequest.jsm:998:18
2022-03-28 09:57:39.486000 UTC - [Parent 6928: Main Thread]: V/nsHttp Waiting until resume [this=22791d00]

Transfer NI back to Kershaw to analyze the reason for HttpChannelChild::Cancel().

Flags: needinfo?(kershaw)

Hi Reporter,

Could you try to get the http log again with the latest Firefox Nightly?
We've added more log information in newer Firefox versions. Hope this can five us some clues to move forward.

Thanks.

Flags: needinfo?(kershaw) → needinfo?(4pub)

I use Firefox in the production environment, so I have some difficulties for using Firefox Nightly. But I will try to install Firefox Nightly in the next week.
Unfortunately, the bug is very unpredictable, it may arise after launching Firefox from 5 minutes to 6 days.
According to indirect impressions, the bug may be associated with memory leakage, because It usually occurs when RAM filled on 60-80% and the total size of Firefox processes in memory exceeds 2 GB.
Also today, I saw another manifestation of this error: I tried to go to another folder in Gmail, the message "Loading" appeared and - no result.
I also found another sign of this error: if the bug arose, then an attempt to click on the "Reload" button does not reload the Firefox tab with Gmail - only the icon on the hourglass is changed.

Hi,

I sent the new http log (from Firefox 100.0 64 bit) to necko@mozilla.com

Flags: needinfo?(4pub)
Whiteboard: [necko-priority-review]
Severity: -- → S3
Priority: -- → P2
Whiteboard: [necko-priority-review] → [necko-priority-queue]

Sorry for the long delay.
I've looks at the log file. Looks like there were some requests (play.google.com) manipulated by some web extension.
For example, I saw the request fdd6900 was redirected to a data URL and got cancelled with NS_ERROR_DOM_BAD_URI.

2022-05-07 18:15:52.388 ⁃ nsHttpChannel ⁃ fdd6900 ⁃ cancelled ⁃ status=n/a ⁃ http-status=n/a ⁃ url=https://play.google.com/log?format=json&hasfast=true
...
2022-05-07 18:15:52.397000 UTC - [Parent 668: Main Thread]: D/nsHttp HttpBaseChannel::RedirectTo [this=fdd6900, uri=data:text/plain;base64,Cg==]
2022-05-07 18:15:52.397000 UTC - [Parent 668: Main Thread]: V/nsHttp fdd6900 called from script: resource://gre/modules/WebRequest.jsm:1060:20
...
2022-05-07 18:15:52.459000 UTC - [Parent 668: Main Thread]: V/nsHttp nsHttpChannel::Cancel [this=fdd6900 status=805303f4]

I am not sure if this is related. Could you try to disable all web extensions and see if you can still reproduce this?
Also, when this happens, have you noticed any error messages in web console?
Thanks.

Flags: needinfo?(4pub)

I use a few web extensions (screenshot of the list is attached) and I do not think that the cause of the bug is in them.
I turned off all the web extensions and restarted Firefox 102.0 64-bit. I saw this bug again after two days online.
I sent the new http log (from Firefox 102.0 64 bit without all web extensions) to necko@mozilla.com
I see in the logs that there are again attempts to connects to play.google.com (and also to www.facebook.com). I do not go to these sites on this OS.
I attached screenshots of error messages in web console (FF with extensions and FF without extensions).
Error messages contain links to the following URL:
https://developer.mozilla.org/docs/Web/HTTP/CORS/Errors/CORSRequestNotHttp?utm_source=devtools&utm_medium=firefox-cors-errors&utm_campaign=default
https://developer.mozilla.org/docs/Web/HTTP/CORS/Errors/CORSDidNotSucceed?utm_source=devtools&utm_medium=firefox-cors-errors&utm_campaign=default

Flags: needinfo?(4pub)

Thanks for the log.
It shows the same symptom as the previous one (comment #7): some requests to https://mail.google.com are just cancelled.
The log also shows which script in child process cancels the request (searching cancelled call in child process from script in the log), but it's not helpful to understand the reason.

Another thing I observed from the log is that there is one connection to play.google.com keeps being closed by the server side.

2022-07-03 20:02:26.882000 UTC - [Parent 2704: Socket Thread]: V/nsHttp nsHttpConnection::OnSocketWritable [this=56326100] host=play.google.com
2022-07-03 20:02:26.882000 UTC - [Parent 2704: Socket Thread]: V/nsHttp TlsHandshaker::EnsureNPNComplete [mOwner=56326100] drive TLS handshake
2022-07-03 20:02:26.882000 UTC - [Parent 2704: Socket Thread]: V/nsHttp nsHttpConnection::GetSecurityInfo trans=5c903e00 tlsfilter=0 socket=56505818
2022-07-03 20:02:26.882000 UTC - [Parent 2704: Socket Thread]: D/nsSocketTransport nsSocketOutputStream::Write [this=56505ab8 count=542]
2022-07-03 20:02:26.882000 UTC - [Parent 2704: Socket Thread]: D/nsSocketTransport   calling PR_Write [count=542]
2022-07-03 20:02:26.882000 UTC - [Parent 2704: Socket Thread]: D/nsSocketTransport   PR_Write returned [n=-1]
2022-07-03 20:02:26.882000 UTC - [Parent 2704: Socket Thread]: D/nsSocketTransport ErrorAccordingToNSPR [in=-5961 out=804b0014]
2022-07-03 20:02:26.882000 UTC - [Parent 2704: Socket Thread]: D/nsSocketTransport nsSocketTransport::OnMsgOutputClosed [this=56505800 reason=804b0014]
2022-07-03 20:02:26.882000 UTC - [Parent 2704: Socket Thread]: V/nsHttp nsHttpConnection::CloseTransaction[this=56326100 trans=5c903e00 reason=804b0014]

I am not sure why, but this looks like is caused by a firewall.

Move this bug to our priority candidate list, since it's not clear how to move this forward.

Whiteboard: [necko-priority-queue] → [necko-priority-review]

Hmm ... really http/https traffic does not go to play.google.com from browsers. But Ping works. And http/https traffic passes to IP in which play.google.com resolves.
I also see that Gmail works despite the errors of connects to play.google.com in the web console.
And also Android smartphones can go to GooglePlay without problems in the same network.
Let's temporarily leave aside the reason why http/https traffic began to constantly block from my network to play.google.com from browsers.
I have several assumptions, and I will exploring the situation over the next few days.

Let's figure out why Gmail first works, and then stops working, given that http/https traffic to play.google.com is constantly blocked.
I suppose that Firefox incorrectly processes the errors of the connection to play.google.com and some OS resources are exhausted over time (there may be memory, there may be something else). As a result, the tab with Gmail freezes when these resources are exhausted. In favor of this assumption, I see a constant slight increase of used memory of the Firefox processes in Windows Task Manager, given that I do not touch anything in the browser.

I think that the bug can be reproduced if you add a line to the hosts-file:
127.0.0.1 play.google.com
Then go to Gmail and wait a few days.
I think OS can be Windows 7 64 Bit and with a small amount of RAM.

Whiteboard: [necko-priority-review] → [necko-priority-queue]
See Also: → 1774211
Assignee: nobody → kershaw
Attachment #9292433 - Attachment description: Bug 1760580 - P1: Make nsIRequest support to store canceled reason, r=#necko → Bug 1760580 - P1: Make nsIRequest store canceled reason, r=#necko

Remove this from priority queue, since the diagnostic patches are ready to land.
This bug will be tracked by necko-hang meta bug.

Whiteboard: [necko-priority-queue] → [necko-triaged]
Attachment #9292433 - Attachment description: Bug 1760580 - P1: Make nsIRequest store canceled reason, r=#necko → Bug 1760580 - P1: Make nsIRequest support to store canceled reason, r=#necko
Pushed by kjang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b490044fd0e1
P1: Make nsIRequest support to store canceled reason, r=necko-reviewers,valentin
https://hg.mozilla.org/integration/autoland/rev/93d4e7303430
P2: Provide a reason when request is cancelled with NS_BINDING_ABORTED, r=necko-reviewers,media-playback-reviewers,valentin,chunmin
https://hg.mozilla.org/integration/autoland/rev/721e07ef9d4f
P3: Use more CancelWithReason in HttpChannelChild, r=necko-reviewers,valentin
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Keywords: leave-open
Regressions: 1789943

Comment on attachment 9293828 [details]
Port Bug 1760580 - Implement CancelWithReason methods. r=darktrojan

Revision D156906 was moved to bug 1789943. Setting attachment 9293828 [details] to obsolete.

Attachment #9293828 - Attachment is obsolete: true
See Also: → 1788173
See Also: → 1794763

The severity field for this bug is set to S3. However, the following bug duplicate has higher severity:

:kershaw, could you consider increasing the severity of this bug to S2?

For more information, please visit auto_nag documentation.

Flags: needinfo?(kershaw)

I think this can be stayed as S3.

Flags: needinfo?(kershaw)

Hi Reporter,

We've added more log information recently.
If you are still able to reproduce this issue, could you try to record the http log again?

Thanks.

Flags: needinfo?(4pub)
See Also: → 1811117

Clear a needinfo that is pending on an inactive user.

Inactive users most likely will not respond; if the missing information is essential and cannot be collected another way, the bug maybe should be closed as INCOMPLETE.

For more information, please visit auto_nag documentation.

Flags: needinfo?(4pub)

Sorry for the delay, I was sick.
I still see this bug in every new version of Firefox.

The leave-open keyword is there and there is no activity for 6 months.
:kershaw, maybe it's time to close this bug?
For more information, please visit BugBot documentation.

Flags: needinfo?(kershaw)

I filed a new bug 1845574 for analyzing the latest log from reporter.

Status: ASSIGNED → RESOLVED
Closed: 9 months ago
Keywords: leave-open
Resolution: --- → FIXED
Flags: needinfo?(kershaw)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: