Closed Bug 1268922 Opened 8 years ago Closed 8 years ago

Users report Firefox stopped loading pages after the 46 update

Categories

(Core :: Networking, defect)

46 Branch
All
Windows NT
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla49
Tracking Status
firefox46 blocking fixed
firefox47 + fixed
firefox48 + fixed
firefox49 + fixed

People

(Reporter: philipp, Assigned: mayhemer, NeedInfo)

References

Details

(Keywords: regression, Whiteboard: [necko-active])

Crash Data

Attachments

(3 files, 1 obsolete file)

[Tracking Requested - why for this release]:

This bug was filed from the Socorro interface and is 
report bp-3ac0097f-27ca-40b7-8435-d78932160428.
=============================================================

we are seeing a number of reports on sumo by users saying they cannot load any pages after the firefox 46 update - sites will just stay blank.
troubleshooting data from one user was suggesting he was hitting the mozilla::net::ShutdownEvent::PostAndWait shutdownhang (bug 1263199) frequently.

if you look through the user comments for that signature a number of them seem to describe the same issue:
*Firefox won't load any pages at all. It was working fine just 30 minutes ago, and it won't load any pages at this point. I recently updated it. Even after the update, it worked fine. I don't understand how it could suddenly stop loading pages. Other browsers work fine.
https://crash-stats.mozilla.com/report/index/3ac0097f-27ca-40b7-8435-d78932160428
*HELP. Post update - I can't use Firefox
https://crash-stats.mozilla.com/report/index/70510c05-707b-4e8f-b4ba-a6f452160427
*I cant even get to other websites when i use the search help!!
https://crash-stats.mozilla.com/report/index/75e1f0b5-2d58-4261-ba1a-cba9e2160429
*only private window loads;
https://crash-stats.mozilla.com/report/index/830604be-1533-4c5e-9104-07fdc2160429
*it doesn't load anything
https://crash-stats.mozilla.com/report/index/3bd0cc5e-4bc0-4d38-818d-fbf5f2160427
*Firefox just keeps spinning never getting to a website, ecen after upgrading.
https://crash-stats.mozilla.com/report/index/a65a6fb5-a183-4b9e-9792-15b902160428
*Firefox tries to connect but never does.
Uninstalled/reinstalled and same crash before install is complete.
https://crash-stats.mozilla.com/report/index/5a21f377-d3af-4598-87c0-045b22160429
*Will not open any web page just continues to circle anti-clockwise in black and not clockwise in blue.
https://crash-stats.mozilla.com/report/index/8e122d14-8b25-4e0a-a8ae-6c9182160429
*firefox opens to only blank tab when starting up or when trying to connect to various websites. System Restore didn't work. Firefox refresh didn't work. I've uninstalled firefox and reinstalled, problm continues
https://crash-stats.mozilla.com/report/index/20475e35-f197-4c42-8121-32b022160429
*What is the problem? I haven't been able to use browser for the last three hours. It just spins and spins!!
https://crash-stats.mozilla.com/report/index/a282cf4a-5afb-4fce-bf55-d69c52160427
and threads on sumo that look like the same issue: https://support.mozilla.org/questions/firefox?tagged=bug1268922&show=all
Marking this as blocking 46 right now. We have a large chunk of users already on 46, but updates are currently disabled.
Patrick or Honza, this sounds pretty bad on 46 release, can you investigate? Thanks.
Flags: needinfo?(mcmanus)
Flags: needinfo?(honzab.moz)
Looks, at first glance anyhow, to be cache related.. so honza and michal are your experts. (ni'd - along with jason to track it).

unfortunately I am PTO today and AFK (mostly) for the weekend - I just saw this on my phone and don't have my laptop with me. I'll check back.
Flags: needinfo?(michal.novotny)
Flags: needinfo?(mcmanus)
Flags: needinfo?(jduell.mcbugs)
Liz, can we ask for email contacts from reporters to help with reproduction?  We will probably ask for logging.  Interesting we didn't get anything like this on 46 when it was in beta..


Michal, all of the reports from comment 0 are stuck in:

NtQueryFullAttributesFile
GetFileAttributesExW
GetFileInfo
nsLocalFile::ResolveAndStat()
nsLocalFile::Exists(bool*)
mozilla::net::CacheFileIOManager::TrashDirectory(nsIFile*)
mozilla::net::CacheFileIOManager::CheckAndCreateDir(nsIFile*, char const*, bool)

and the rest of the stack is missing.  My idea is that we are failing to create the cache directory, since we retry with every IO operation, (which would be bug 1268587).


Also, all of the reports have following set of add-ons:

loop@mozilla.org	1.2.6
e10srollout@mozilla.org	1.0
firefox@getpocket.com	1.0
^^^ comment 5.
Flags: needinfo?(michal.novotny)
Flags: needinfo?(lhenry)
(In reply to Honza Bambas (:mayhemer) from comment #5)
> Also, all of the reports have following set of add-ons:
> 
> loop@mozilla.org	1.2.6
> e10srollout@mozilla.org	1.0
> firefox@getpocket.com	1.0

Which are the standard now...
OK, got it.  We definitely loop here:

https://dxr.mozilla.org/mozilla-central/source/netwerk/cache2/CacheFileIOManager.cpp#3157-3167

Interesting, Exists probably fails from some reason and we keep spinning.
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
Flags: needinfo?(michal.novotny)
Flags: needinfo?(lhenry)
Flags: needinfo?(jduell.mcbugs)
Flags: needinfo?(honzab.moz)
Attached patch v1 (obsolete) — Splinter Review
This is untested, will do it in a moment.

Adjusted theory:
- rename of the directory creates the target dir
- fails to move
- we repeat this with every cache IO op
- hence probably a huge number of trashes and the loop then takes that long
Attachment #8747150 - Flags: review?(michal.novotny)
this is the pushlog that an affected user came up with when running the mozregression tool:
https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=e355cacefc881ba360d412853b57e8e060e966f4
Almost all of the reports have AV software hooked in.  I think AVs are scanning the doomed directory from some reason, and maybe leave some files open so that we are able to create the trash.nnnn dir every time we want to create the dir structure, but we cannot actually move the content of the doomed dir into it.  When there is a huge number of trashes (reproducing it locally right now), everything slows down when AVs are involved too.  Theory, but IMO plausible.
Comment on attachment 8747150 [details] [diff] [review]
v1

Review of attachment 8747150 [details] [diff] [review]:
-----------------------------------------------------------------

The patch is OK with which I mean it should not break anything.

(In reply to Honza Bambas (:mayhemer) from comment #9)
> Adjusted theory:
> - rename of the directory creates the target dir
> - fails to move

Do you mean that this happens inside nsIFile::MoveToNative method?

One theory why this started to appear in FF46 is that we fixed the convar problem during shutdown (bug 1257611). I.e. previous versions had the same problem but we simply didn't wait for all operations to finish during shutdown. OTOH that would mean that the users would not be able to load the pages with older version too, they just wouldn't experience the shutdown crash.  But I'm not aware of any other change in this code.

Let's try this patch if it helps, but we need to understand why this is happening. Btw, Windows only bug again :-/
Attachment #8747150 - Flags: review?(michal.novotny) → review+
the one networking related bug that shows up in the changeset posted at comment #10 would be bug 1242755.
Keywords: regression
(In reply to Michal Novotny (:michal) from comment #12)
> Let's try this patch if it helps, but we need to understand why this is
> happening. 

Yes, I agree.  Also my local testing doesn't show any major slowdown or even page loading prevention.  I can see we create a trash dir on every IO op (I have over a thousand in a minute) but cache works, pages load.

> Btw, Windows only bug again :-/

Could be the AV influence, but hard to say.

Also, I strongly suspect the filedesc leaking may have influence here as well, bug 1239687 has been fixed exactly for 46 and there may be some interference with AV software.

Tentatively blocking that bug.
(In reply to philipp from comment #13)
> the one networking related bug that shows up in the changeset posted at
> comment #10 would be bug 1242755.

Very unlikely that would be related.
only passing on what i got from an affected user at https://support.mozilla.org/en-US/questions/1120184#answer-871442
i can try to get feedback from others as well...
Looks like this did show up first in beta 5 and then became worse towards the end of beta, looking at the last month of crashes with this signature, https://crash-stats.mozilla.com/report/list?range_unit=days&range_value=28&signature=shutdownhang+|+WaitForSingleObjectEx+|+WaitForSingleObject+|+PR_WaitCondVar+|+mozilla%3A%3ACondVar%3A%3AWait+|+mozilla%3A%3Anet%3A%3AShutdownEvent%3A%3APostAndWait 

For context, we had many users in beta 5 for longer than usual since there wasn't a beta 6, and beta 7 only was around for one day. Here's what the beta release schedule looked like,  

            '46.0b4' => '2016-03-23',
            '46.0b5' => '2016-03-25',
            // 46.0b6 skipped because of infrastructure issues
            '46.0b7' => '2016-04-04',
            '46.0b8' => '2016-04-05',
            '46.0b9' => '2016-04-08',
            '46.0b10' => '2016-04-12',
            '46.0b11' => '2016-04-15',

So, something else may have made the issue worse in a later beta build, but it's hard to narrow that down from the pattern of crash reports by date/build. If it were just a regression in beta 5 i would expect to see many more crashes from beta 5.
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #17)
> Looks like this did show up first in beta 5 and then became worse towards
> the end of beta, looking at the last month of crashes with this signature,
> https://crash-stats.mozilla.com/report/
> list?range_unit=days&range_value=28&signature=shutdownhang+|+WaitForSingleObj
> ectEx+|+WaitForSingleObject+|+PR_WaitCondVar+|+mozilla%3A%3ACondVar%3A%3AWait
> +|+mozilla%3A%3Anet%3A%3AShutdownEvent%3A%3APostAndWait 
> 
> For context, we had many users in beta 5 for longer than usual since there
> wasn't a beta 6, and beta 7 only was around for one day. Here's what the
> beta release schedule looked like,  
> 
>             '46.0b4' => '2016-03-23',
>             '46.0b5' => '2016-03-25',
>             // 46.0b6 skipped because of infrastructure issues
>             '46.0b7' => '2016-04-04',
>             '46.0b8' => '2016-04-05',
>             '46.0b9' => '2016-04-08',
>             '46.0b10' => '2016-04-12',
>             '46.0b11' => '2016-04-15',
> 
> So, something else may have made the issue worse in a later beta build, but
> it's hard to narrow that down from the pattern of crash reports by
> date/build. If it were just a regression in beta 5 i would expect to see
> many more crashes from beta 5.


It's because bug 1257611 has been uplifted to beta on 2016-03-24.  The problem was already there but we just changed the signature with that bug.  It would be great to hunt for crashes that have mozilla::net::CacheFileIOManager::TrashDirectory on ANY thread during shutdown watchdog fire.

My biggest suspect (bug 1239687) was regularly riding the trains.
(In reply to philipp from comment #16)
> only passing on what i got from an affected user at
> https://support.mozilla.org/en-US/questions/1120184#answer-871442
> i can try to get feedback from others as well...

Maybe we can give them a try build for testing when we have one?
Locally checked the on and on attempt to create the tree dir and delete the doomed sub-dir is gone with this patch + nothing seems broken.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a5e37830871d
Attachment #8747150 - Attachment is obsolete: true
Attachment #8747233 - Flags: review+
Keywords: checkin-needed
Comment on attachment 8747233 [details] [diff] [review]
Don't hard-fail cache init on doomed directory removal failure

Approval Request Comment
[Feature/regressing bug #]: unknown, can be indirectly caused by bug 1239687
[User impact if declined]: total browser breakdown
[Describe test coverage new/current, TreeHerder]: somewhat local testing, m-c based try push pending
[Risks and why]: I believe zero, this just bypasses an unneeded operation that otherwise causes just loops with lot of IO that never succeeds anyway
[String/UUID change made/needed]: none

Reason I ask for approval is to just get a decent build we could give the affected users to try with, so we know this patch will fix this issue on release.
Attachment #8747233 - Flags: approval-mozilla-release?
Comment on attachment 8747233 [details] [diff] [review]
Don't hard-fail cache init on doomed directory removal failure

Let's make sure this applies to m-r, and this will also give us a build for testing since the crash is hard to reproduce (yet common)
Attachment #8747233 - Flags: approval-mozilla-release? → approval-mozilla-release+
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #22)
> Comment on attachment 8747233 [details] [diff] [review]
> v1 [correct commit message]
> 
> Let's make sure this applies to m-r, and this will also give us a build for
> testing since the crash is hard to reproduce (yet common)

Does.  Waiting for push from m-r to try.
release based try (no warranty on success of the build)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=aca10cf61c76
Whiteboard: [necko-active]
a user at sumo who originally pointed us to the shutdownhang crashes related to the issue ran the test build via comment #24 unsuccessfully & it also crashed on shutdown: https://crash-stats.mozilla.com/report/index/c1b7b7e5-70ed-4d93-8290-d4f9d2160430
https://support.mozilla.org/en-US/questions/1120558#answer-871699

he then continued to run the mozregression tool which yielded inconclusive results imho, and all the tested builds worked on the first run. so it may be that the issue isn't immediately reproducible or it isn't caused by a change in firefox 46 directly but some other third-party change like a signature update or something started triggering it...
(In reply to philipp from comment #27)
> a user at sumo who originally pointed us to the shutdownhang crashes related
> to the issue ran the test build via comment #24 unsuccessfully & it also
> crashed on shutdown:
> https://crash-stats.mozilla.com/report/index/c1b7b7e5-70ed-4d93-8290-
> d4f9d2160430

This is different from other reports because it isn't stuck on cache shutdown. I don't see cache thread at all.
It's easy to check if the problem is caused by non-responsive Cache IO thread. Just try to load some page in private window.

For further investigation it would be good to have NSPR log with cache2 module logged. Also full listing of cache2 directory would be helpful.
https://hg.mozilla.org/mozilla-central/rev/63826ac5bd1f
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
(In reply to Michal Novotny (:michal) from comment #28)
> > https://crash-stats.mozilla.com/report/index/c1b7b7e5-70ed-4d93-8290-
> > d4f9d2160430
> 
> This is different from other reports because it isn't stuck on cache
> shutdown. I don't see cache thread at all.

Probably because it doesn't have symbols loaded for xul.dll and others?

Reopening based on comment 27.

(In reply to Michal Novotny (:michal) from comment #29)
> It's easy to check if the problem is caused by non-responsive Cache IO
> thread. Just try to load some page in private window.

Yep, I agree, that's a good test.

> 
> For further investigation it would be good to have NSPR log with cache2
> module logged. Also full listing of cache2 directory would be helpful.

Yes, the modules we are interested in are:

NSPR_LOG_MODULES=timestamp,sync,nsHttp:4,cache2:5

See https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging for how to set it up, please don't forget to change the list of modules to the one few lines above!

philipp, could you please contact the reporter?  thanks.
Status: RESOLVED → REOPENED
Flags: needinfo?(madperson)
Keywords: leave-open
Resolution: FIXED → ---
Also, in the report I again see the Rapport AV software involved.  This AV has already caused problems in the past when cache was more heavily used.  I think it's the main culprit here, but I'm not sure what may have triggered that behavior for 46.  

Based also on the inconclusive result of the regression tool search I think this is caused by influence of external software, like AV.
Status: REOPENED → NEW
Attachment #8747233 - Flags: checkin+
Attached file log.txt
here's the log by courtesy of that particular sumo user
Flags: needinfo?(madperson)
Trashing doomed directory doesn't finish, so the patch can't help. I guess there is a lot of trash directories, or maybe worse there are already all possible trash directories (RAND_MAX is only 32767 in windows) and in this case TrashDirectory() will never finish. We should simply limit number of tries when we try to find unused trash directory name. But this is not going to fix the basic problem that for some reason we fail to delete trash directories.
(In reply to Michal Novotny (:michal) from comment #34)
> Trashing doomed directory doesn't finish, so the patch can't help. I guess
> there is a lot of trash directories, or maybe worse there are already all
> possible trash directories (RAND_MAX is only 32767 in windows) and in this
> case TrashDirectory() will never finish. We should simply limit number of
> tries when we try to find unused trash directory name. But this is not going
> to fix the basic problem that for some reason we fail to delete trash
> directories.

Can you have a patch for the try-to-find loop?

I think the reason we create the trash dirs has already been explained.  Moving a dir checks first if it's possible to simply rename.  If not, it tries to copy and delete, which means to create a dir (trash dir in this case), then copy.  But this fails (there is some file being kept open - locked).  On every network op we try again and create a new trash dir, on and on.  My simple patch prevents the creation, but doesn't fix the endless loop when there is no dir number left to be created.
Comment on attachment 8747807 [details] [diff] [review]
limit number of tries to find unused trash directory

Review of attachment 8747807 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks.
Attachment #8747807 - Flags: review?(honzab.moz) → review+
Comment on attachment 8747807 [details] [diff] [review]
limit number of tries to find unused trash directory

Approval Request Comment
same as for the first patch, risk zero.
Attachment #8747807 - Flags: approval-mozilla-release?
Comment on attachment 8747807 [details] [diff] [review]
limit number of tries to find unused trash directory

Fix for 46 page loading / antivirus conflict, please uplift to m-r
Attachment #8747807 - Flags: approval-mozilla-release? → approval-mozilla-release+
Letting Honza figure that out for now (tomorrow)
Flags: needinfo?(honzab.moz)
(In reply to Wes Kocher (:KWierso) from comment #41)
> https://hg.mozilla.org/releases/mozilla-release/rev/5dfda1b1f575
> 
> 
> Does this need to land anywhere else?

Both patches here should uplift to aurora and beta as well IMO.
Flags: needinfo?(honzab.moz)
Closing for now, we now need verification of the new release build (comment 41) from some of the users able to reproduce.  Thanks!
Status: NEW → RESOLVED
Closed: 8 years ago8 years ago
Keywords: leave-open
Resolution: --- → FIXED
Comment on attachment 8747233 [details] [diff] [review]
Don't hard-fail cache init on doomed directory removal failure

Approved for uplift to aurora and beta from comment 43.
Attachment #8747233 - Flags: approval-mozilla-beta+
Attachment #8747233 - Flags: approval-mozilla-aurora+
Comment on attachment 8747807 [details] [diff] [review]
limit number of tries to find unused trash directory

[Triage Comment]
Attachment #8747807 - Flags: approval-mozilla-beta+
Attachment #8747807 - Flags: approval-mozilla-aurora+
Attachment #8747807 - Flags: checkin+
Attachment #8747233 - Attachment description: v1 [correct commit message] → Don't hard-fail cache init on doomed directory removal failure
issue seems fixed by the build1 candidate for 46.0.1 as per https://support.mozilla.org/questions/1120558?page=2#answer-872657
(In reply to philipp from comment #47)
> issue seems fixed by the build1 candidate for 46.0.1 as per
> https://support.mozilla.org/questions/1120558?page=2#answer-872657

Marking verified based on comment 47.
Status: RESOLVED → VERIFIED
I believe this bug was connected to Login Failures with PayPal (general account logins, or payment logins).  After support call with PayPal, regarding the failed login attempts, they were indicating that they could see "no" activity of the login (what-so-ever).  That led me to believe their was some kind of cache failure occurring.

With the 46.0.1 update, this problem appears fixed.  (I can no longer reproduce the problem).
(In reply to steven.goss from comment #52)
> I believe this bug was connected to Login Failures with PayPal (general
> account logins, or payment logins).  After support call with PayPal,
> regarding the failed login attempts, they were indicating that they could
> see "no" activity of the login (what-so-ever).  That led me to believe their
> was some kind of cache failure occurring.
> 
> With the 46.0.1 update, this problem appears fixed.  (I can no longer
> reproduce the problem).
Flags: needinfo?(shetan.st)
I have already upgraded all browsers to the 46.0.1, and I don't get the problem logging into PayPal anymore.

I would enter my password into the PayPal password (login) field, and their server would reject the login stating incorrect password.  When I know that I entered it correctly.  As I was worried (about my security), I started exploring this further.  With PayPal on the phone, I would attempt the login on an old Firefox V38, and they could see my login attempt, and its success.  However, I would try the same thing, on the Firefox V46.0, and it would fail (stating incorrect password), and the PayPal rep (on the phone) said they could not even "see" the login attempt, on the Firefox V46.0. (Even though I did get back a returned page of invalid login attempt).

The failure was on a Mac running Yosemite 10.10.5.

Sorry, that's all the info that I can now provide.  (other than 46.0.1 appears to have fixed the problem).
Comment on attachment 8747233 [details] [diff] [review]
Don't hard-fail cache init on doomed directory removal failure

Please land this patch on bith mozilla-aurora and mozilla-beta ASAP.  Both have been approved, but apparently not landed.

Thanks!
Flags: needinfo?(wkocher)
You need to log in before you can comment on or make changes to this bug.