Open Bug 1698289 Opened 3 years ago Updated 2 years ago

Intermittent website authentication failure - requires clearing site cache to repair - hugely annoying to users.

Categories

(Core :: Networking: Cookies, defect, P2)

Firefox 85
x86_64
macOS
defect

Tracking

()

UNCONFIRMED

People

(Reporter: ian.graham, Unassigned)

Details

(Whiteboard: [necko-triaged])

Attachments

(1 file)

This is an odd, intermittent (it has happened on two occasions since upgrading to FF 85) but extremely annoying bug.

Scenario 1 (FF 85) I was trying to log in to a secure (banking) website. Authentication 'seemed' to work (I did not get an 'authentication failed' response), but the returned content looked wrong / did not display the expected information. When I then tried to interaction with the page (e.g. get a list of bank accounts) I was told I needed to login. Which also seemed to succeed but just led again to the above scenario. Shutting down and restarting the browser didn't fix it.

Scenario 2. (FF 86) I was trying to access an online newspaper website that lets you to retain authentication over multi-day sessions. In this case when I returned to the site it looked as if I had been signed out. When I tried to login using the login tool authentication 'seemed' to work (there was no failure message) but the returned pages again looked wrong just as they did at the start of this scenario. And so on ..... Shutting down and restarting the browser did not fix this.

I struggled with this for a while (2 days the first time, 2 hours the second) and in both cases found I needed to clear the cache for the site (Preferences -> Privacy & Security -> Cookies & Site Data -> Manage Data ... select site and purge data) to fix the problem.

Most users will not get to this workaround - they will just get frustrated and stop using Firefox.

IMO it is the type of defect that will turn people away from FF, which is why I rated the severity so high. Moreover I have never had a similar problem using Chrome (I use Chrome / FF equally ) or Safari (used much less).

If needed I can provide additional details to whoever works on the bug.

Please can you try checking your machine for malware, we feel like this might be an effect caused by that:

https://support.mozilla.org/en-US/kb/troubleshoot-firefox-issues-caused-malware

The other option is to try disabling any the internet protection part of any anti-virus you have and see if that helps.

Also, please do not set the severity on bugs as this will cause them to be skipped in our triage processes, which may mean your bug doesn't get seen / looked at for a while.

Severity: S2 → --
Flags: needinfo?(ian.graham)
Status: NEW → UNCONFIRMED
Ever confirmed: false

Thank you for the note about Severity / needinfo request. I reviewed the malware troubleshooting document and installed and ran Malwarebytes. In summary:

  • Firefox is not displaying any of the symptoms noted in "How do I know that my Firefox problem is a result of malware?"
  • My active FF extensions: "Facebook Container; HTTPS Everywhere; RegretsReporter; To Google Translate; and uBlock Origin."
  • My installed FF plugins: OpenH265 Video Codec (Cisco Systems) and Widevine Content Decryption Module (Google).
  • OS X System Integrity Protection is enabled (and has been so since the system was set up)
  • Malwarebytes reported zero problems (no Threats or Pups, no detections ignored or quarantined).
  • System (MacOX 10.13.6) is up to date
  • I am pretty paranoid about system hygiene / safety (but try not to be over confident ;-) )

If / when the problem happens again I will disable my anti-virus (Avast) and see if that resolves the problem. But as I can fix the problem by erasing the browser's site cache, I think this is 'probably' not an AV issue.

Also, I have a few Macs, all running Avast, and when this problem happened I was able to successfully connect and login to the problematic sites from another one and from within the same internal home network (how I verified this was neither a web site nor home internal network problem).

Last - I really appreciate how hard it is to built a test scenario for and then diagnose sporadic problems like this. I will do what I can to help!

Flags: needinfo?(ian.graham)

I'll add a data point to this (personal experience); sites I visit often that sets a lot of cookies (personal banking, in my case), it gets its session cookies all messed up, likely by a server programming mistake of their own. To unbreak the site I indeed needed to purge the cookies for that domain and all was well again.
So this is basically a session cookie eviction problem, where old cookies are never automatically cleaned up after a rational expiration period. Would it be at all sensible to have a maximum session cookie lifetime of 1 year? 2 years, perhaps?

Component: General → Networking: Cookies
Product: Firefox → Core

(In reply to Mike de Boer [:mikedeboer] from comment #3)

I'll add a data point to this (personal experience); sites I visit often that sets a lot of cookies (personal banking, in my case), it gets its session cookies all messed up, likely by a server programming mistake of their own. To unbreak the site I indeed needed to purge the cookies for that domain and all was well again.
So this is basically a session cookie eviction problem, where old cookies are never automatically cleaned up after a rational expiration period. Would it be at all sensible to have a maximum session cookie lifetime of 1 year? 2 years, perhaps?

In my first scenario purging cookies did not solve the problem: that's what pushed me to purge the entire site cache (which did fix it).
Of course that doesn't mean cookies are not part of the problem .....

I think this is difficult to debug until we have enough information, so could you try to get a http log when logging into the newspaper website (I am not comfortable to ask a log from a banking site)?
Since the log may contain some privacy information, you may want to send it to my email (kershaw at mozilla dot com) directly.
I may also need another log that shows everything works fine, for comparing and finding out the problem.

Thanks.

Flags: needinfo?(ian.graham)

(In reply to Kershaw Chang [:kershaw] from comment #5)

I think this is difficult to debug until we have enough information, so could you try to get a http log when logging into the newspaper website (I am not comfortable to ask a log from a banking site)?
Since the log may contain some privacy information, you may want to send it to my email (kershaw at mozilla dot com) directly.
I may also need another log that shows everything works fine, for comparing and finding out the problem.

Thanks.

Yes I can do that, but it may take a while: right now I am not experiencing the problem (cleared the cache) - and this problem has only come up twice in the last year. The next time it happens I will (a) take a log during a failed login / access; clear the cache and then (b) take a log during a successful login.

Flags: needinfo?(ian.graham)
Severity: -- → S3
Priority: -- → P2
Whiteboard: [necko-triaged]

Finally had authentication to the Globe & Mail fail again so I turned on HTTP logging and went through the process of trying to authenticate. Hopefully there is information in this log file that will help diagnose and fix the problem. If there is something else I can do let me know - I can go for a while without clearing the cache .... but not forever .....

(In reply to Ian Graham from comment #7)

Created attachment 9235624 [details]
log.txt-main.694.moz_log.gz - Log file from unsuccessful authentication to GLobe & Mail

Finally had authentication to the Globe & Mail fail again so I turned on HTTP logging and went through the process of trying to authenticate. Hopefully there is information in this log file that will help diagnose and fix the problem. If there is something else I can do let me know - I can go for a while without clearing the cache .... but not forever .....

Did some fiddling and this time I was able to get authentication to work by clearing cache / site data (Preferences -> Privacy & Security -> Cookies & Site Data -> Manage Data.. I did not have to restart the browser.

Ok, this has now failed again, in the same way. Kershaw - was the previous log file of any help? I can re-do the log file but don't want to do this if it was not helpful.....

Hello? is anyone monitoring this bug? I am again experiencing this and can capture / send an HTTP log to someone, if there is anyone to send it to.... And perhaps It would be helpful if I got advice on how to do the logging. For example I could do it (a) while it is broken and (b) after it is fixed, and potentiall (c) somewhere in between so someone can diagnose the problem properly? If so I would benefit with some guidance / discussion on how to do that, what stages in the interaction, etc.

Sorry, adding more commentary - I am really pissed this bug has not been tackled / fixed, and want to explain why.

In my mind I would expetd defects that (a) fundamentally break a user's browser experience, and for which (b) there is no documented workaround would be top fix. Since, if such defects are not fixed - an impacted user will stop using Firefox.

This report is such a defects. Authentication to a website suddenly breaks for a FF user. There is no simple workaround (advising me, via bugzilla, to delete cache entries associated with the domain name is not a regular user workaround!!!).

Any user who hits this on a paid subscription website will just throw up their hands and stop using Firefox.

And this is not just any old website. The Globe and Mail is the most popular newspaper site in Canada with circa 300,000-400,000 online subscribers.

Does that not mean you might want to fix it?

I've even offered to help, if I can!!!!

But nothing....

How about this as a criteria: is the defect likely to cause a user to stop using Firefox? If the answer is 'yes,' then fix it!

Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-queue]

Hi Ian,

Sorry for the long delay.
I've looked at the log you provided. Looks like some http requests are cancelled by some web extension, such as:

sHttpChannel ⁃ 18e42c000 ⁃ cancelled ⁃ status=n/a ⁃ http-status=n/a ⁃ url=https://sec.theglobeandmail.com/user/static/scripts/krux.js
nsHttpChannel ⁃ 196d0f000 ⁃ cancelled ⁃ status=n/a ⁃ http-status=n/a ⁃ url=https://cdn.permutive.app/6b6b990e-d9d8-4116-a028-76da837d7607-web.js
nsHttpChannel ⁃ 1979ea000 ⁃ cancelled ⁃ status=n/a ⁃ http-status=n/a ⁃ url=https://s.go-mpulse.net/boomerang/JQ423-JN76F-XMLFJ-DAJ7S-H9FA7
nsHttpChannel ⁃ 1985ce000 ⁃ cancelled ⁃ status=n/a ⁃ http-status=n/a ⁃ url=https://confiant-integrations.global.ssl.fastly.net/hljCI0zHoD_vRE1cDUCBspfKb0s/gpt_and_prebid/config.js
nsHttpChannel ⁃ 1988c1000 ⁃ cancelled ⁃ status=n/a ⁃ http-status=n/a ⁃ url=https://s.ntv.io/serve/load.js
nsHttpChannel ⁃ 1980c9000 ⁃ cancelled ⁃ status=n/a ⁃ http-status=n/a ⁃ url=https://www.dianomi.com/smartads.epl?id=5136&url=https%3A%2F%2Fwww.theglobeandmail.com%2Fbusiness%2Farticle-pitching-in-supporting-health-care-projects-in-africa%2F
nsHttpChannel ⁃ 13daeb000 ⁃ cancelled ⁃ status=n/a ⁃ http-status=n/a ⁃ url=https://z.moatads.com/globeandmailheader638913312107/moatheader.js

Not sure if these canceled requests have something to do with this issue, but could you try to reproduce this again without all extensions disabled?

Thanks.

Flags: needinfo?(ian.graham)

Hi Kershaw;

It's possible: the first URL (sec.theglobeandmail.com) is loading a script from the newspaper site, so that could be the source of the issue.

If the bug happens again I will try that. Just to confirm the test would be something like this:

  • I wait until the authentication problem starts to happen again.
  • I the go into add-ons manager and disable all the extensions
  • I try re-accessing/re-authenticating to see if I can now authenticate properly.

Or, should I do the above one extension at a time, to see if I can identify which extension is causing the issue?

And should I try capturing log files for each of these attempts?

Ian

Flags: needinfo?(ian.graham)

Or, should I do the above one extension at a time, to see if I can identify which extension is causing the issue?

And should I try capturing log files for each of these attempts?

Thanks for offering your help.
I think it's best not to spend too much of your time on this issue, so I suggest to:

  1. Try to disable all extension or the most suspicious one (uBlock Origin) and see if you can still reproduce this. No need to capture http log when doing this, because we already know some http request is cancelled by a extension.
  2. Immediately start logging after Firefox starts, but please remember to add "rotate" to limit the size of the log file. When the authentication problem happens, stop logging and send us the log file. Hope doing this can capture any networking issues when the problem occurs.

I'd keep the ni? flag. Thanks.

Flags: needinfo?(ian.graham)

Will do my best. The problem is very intermittent (i.e. hasn't happened in a year or more....) so I will just have to wait for it to comes back.... I hate these types of bugs and appreciate how hard they are to find / resolve.

Flags: needinfo?(ian.graham)

I'll keep the ni flag to indicate this bug is waiting for more information.
Please clear ni once the log is available. Thanks.

Flags: needinfo?(ian.graham)
Whiteboard: [necko-triaged][necko-priority-queue] → [necko-triaged]

I had a issue that feels very similar to this one:

  • When using our company Redmine instance (redmine.company.tld), I was getting regular "Invalid form authenticity token." errors from Redmine - usually once peer week, today and yesterday multiple times, even after deleting all cookies of the site and logging back in.
  • Logging in to our company OpenNebula instance (cloud.company.tld) hasn't worked at all for me for several months. After entering my name and password and clicking on Log in, I would get logged out immediately after having logged in.

In both cases, the problem was easy to work around by opening the sites in incognito mode, which wasn't affected at all.

Today, the most recent "Invalid form authenticity token." error in Redmine was the last straw that broke the camel's back, and I sat down and started to properly debug the issue.

In Redmine, the following happens:

  1. When opening the site without any cookies, Redmine will create a _redmine_session cookie with session data, even before the actual login. Redmine apparently doesn't store sessions on the server, but on the client as a cookie. This session cookie contains the form authenticity token which is used to protect the site against CSRF attacks. Notably, it's Secure: false.
  2. When entering name, password and checking the "Stay logged in" checkbox and submitting the form, Redmine will create a autologin cookie. Notably, it's Secure: true.
  3. In my normal browsing window, the _redmine_session cookie vanishes as soon as the autologin cookie is created, thereby breaking Redmine. Without its session data, it is no longer able to verify the CSRF token, and will reject any requests that are protected by it, e.g. when attempting to create a new issue.
  4. In my incognito window, both cookies can coexist at the same time, and Redmine works fine.

I examined the network requests in the inspector in close detail, trying to figure out if Redmine itself was somehow deleting its own _redmine_session cookie, but this wasn't the case.

In OpenNebula, the following happens:

  1. When logging in, OpenNebula attempts to set two cookies - one-user and sunstone. Notably, both are Secure: false.
  2. In my normal browsing window, no cookie ends up in Storage -> Cookies. I'm logged out immediately after logging in.
  3. In my incognito window, both cookies end up in Storage -> Cookies, and OpenNebula works fine.

There was also no indication that OpenNebula was deleting its own cookies. At this point, I was convinced that this has to be a Firefox issue - Firefox was somehow losing my cookies?!

I started to search for a way to debug this, and found these promising options in snap run firefox --help:

  --MOZ_LOG=<modules> Treated as MOZ_LOG=<modules> environment variable,
                     overrides it.
  --MOZ_LOG_FILE=<file> Treated as MOZ_LOG_FILE=<file> environment variable,
                     overrides it. If MOZ_LOG_FILE is not specified as an
                     argument or as an environment variable, logging will be
                     written to stdout.

They are horribly documented here:

Debugging — Firefox Source Docs documentation [firefox-source-docs.mozilla.org]
prlog.h - mozsearch [searchfox.org]

Apparently, no one has bothered to actually document which modules I can enable logging for here. Luckily, googling for "moz_log cookie" confirmed that there is a cookie module, so I enabled logging for it:

snap run firefox --MOZ_LOG=cookie:9 --MOZ_LOG_FILE=/tmp/moz.log

(I tried log level 3 before, but that was too low.)

The log is written to /tmp/snap.firefox/tmp/moz.log.moz_log and is only readable as root, because snaps are always run as root for some reason. Go figure.

And here it finally is, the actual problem:

[Parent 30305: Main Thread]: D/cookie ===== COOKIE EVICTED =====
[Parent 30305: Main Thread]: D/cookie Too many cookies for this domain and the new cookie is not a secure cookie
[Parent 30305: Main Thread]: D/cookie current time: Thu Oct 27 13:52:29 2022 GMT
[Parent 30305: Main Thread]: D/cookie ----------------
[Parent 30305: Main Thread]: D/cookie name: sunstone
[Parent 30305: Main Thread]: D/cookie value: xxx
[Parent 30305: Main Thread]: D/cookie host: cloud.company.tld
[Parent 30305: Main Thread]: D/cookie path: /
[Parent 30305: Main Thread]: D/cookie expires: Sat Nov 26 13:52:28 2022 GMT
[Parent 30305: Main Thread]: D/cookie created: Thu Oct 27 13:52:29 2022 GMT
[Parent 30305: Main Thread]: D/cookie is secure: false
[Parent 30305: Main Thread]: D/cookie is httpOnly: true
[Parent 30305: Main Thread]: D/cookie sameSite: lax - rawSameSite: none
[Parent 30305: Main Thread]: D/cookie schemeMap 2 (http: false | https: true | file: false)
[Parent 30305: Main Thread]: D/cookie origin attributes: {empty}

So apparently, one of the many subdomains of company.tld has stored too many cookies on my computer, and this has caused Firefox to evict cookies for the domain - and it especially chooses insecure cookies first, which explains why specifically the session cookie of Redmine and both cookies of OpenNebula were affected. I feel like this should have warranted a bigger warning :(

I've also figured out where the "too many cookies" came from, which is surprisingly also much harder than I expected. Settings -> Privacy & Security -> Manage Data -> company.tld reports that there are 185 cookies for that domain, and apparently 180 cookies per domain is the limit. Unfortunately, you cannot actually list the cookies for the whole domain (I think this was possible in the past, but maybe I'm misremembering from back when I still used Chrome...?), and therefore cannot tell to which subdomain they belong. I opened https://company.tld in a new tab and looked in the inspector in Storage -> Cookies, but there is only a single cookie for the whole domain, so the cookies have to be part of a subdomain, but at this point I don't know which one(s).

I figured I could work around this problem by directly reading cookies.sqlite3, but they aren't there either:

sqlite> select host, count(*) from moz_cookies where host like '%company.tld' group by host;
.company.tld|1
somehost.company.tld|4
redmine.company.tld|1

https://addons.mozilla.org/de/firefox/addon/cookie-quick-manager/ finally gave the answer - the cookies are session cookies of BigBlueButton servers that we host on subdomains of company.tld. At some point, we had thousands of those (we are hosting a videoconference system for up to 3 million of users), and during the last few years, I of course used hundreds of these servers. Each server created a JSESSIONID session cookie, and I have at least a hundred of these servers listed in the cookie manager. These session cookies aren't stored in cookies.sqlite3, but in sessionstore.jsonlz4 (only when Firefox is closed) as part of the session restore feature.

Martin - this is very interesting and hopefully helps diagnose this: I have been waiting for months to see my own reported problem return (and log the results), but it has not failed for me since my report of a year ago.

Here's a thought: given your description would a 'force quit' of Firefox after a login be likely to mess up cookies that need to be stored for session management - but that are not stored unless you do a proper 'non-forced' browser shutdown? If that is possible I could try doing that to reproduce my own login problem....

Flags: needinfo?(ian.graham)

Here's a thought: given your description would a 'force quit' of Firefox after a login be likely to mess up cookies that need to be stored for session management - but that are not stored unless you do a proper 'non-forced' browser shutdown?

No, I don't think so. If a force quit could cause Firefox to lose the session data, it might even resolve the issue (e.g. in my case, if Firefox had lost the session data at some point in the last two years, all those session cookies of our BigBlueButton servers would have gone away).

Use Settings -> Privacy & Security -> Cookies and Site Data -> Manage Data and the Cookie Quick Manager extension I've linked to figure out if you're affected by the same problem as me. If there is a site some.random.domain.tld where the login doesn't work, search for domain.tld in both Manage Data and Cookie Quick Manager -> Manage all cookies; Manage Data will tell you the total amount of cookies for the domain, and Cookie Quick Manager will show you which subdomain(s) have stored the cookies. If the total amount is ~180 cookies (180 is the per-domain limit), you are affected.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: