Closed Bug 1484149 Opened 6 years ago Closed 6 years ago

Cache racing breaks NTLM authentication - Load / NTLM Auth / cache issue in Firefox and Sharepoint on premises

Categories

(Core :: Networking: HTTP, defect, P1)

60 Branch
defect

Tracking

()

RESOLVED DUPLICATE of bug 1477684
Tracking Status
geckoview62 --- wontfix
firefox-esr60 62+ fixed
firefox62 --- wontfix
firefox63 --- fixed
firefox64 --- fixed

People

(Reporter: alberto.suarez.caballero, Assigned: michal)

References

(Blocks 1 open bug)

Details

(Keywords: regression, Whiteboard: [necko-triaged][ntlm][http-conn])

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0
Build ID: 20180621064021

Steps to reproduce:

I am experiencing a very weird issue when trying to access sharepoint on premises from new Firefox version (Quantum >v60.0), (browser runnig on a windows computer).

SCENARIO:
- Firefox is correctly configured for using NTLM for Auth ( FF about:config).
- Sharepoint 2013 on premises. Firefox ESR 60.
- Sharepoint is loaded just fine in IE & Chrome, however, Firefox fails to fully load the site, apparently, on a random basis. After a page refresh, FF manages to load the site just fine for some time ( I guess this time is based on the ntlm auth cookie expiration ).
- Other firefox users experience the same issue.
- Previous ESR Firefox version, v52.0, doesn't present this issue or have any problem in loading SP sites.
- no errors are shown in Firefox console, in terms of javascript, XHR, etc



Actual results:

The problem is that FF (>v60.0) has issues to succesfully load the page, or sometimes even to start to load the page. However, if Firefox cache is disabled, this issue doesnt happen at all.

"Network" tab from the FF developer tools, lets you notice that load process stops at some point ( You can see in the source code that some files/code has been downloaed, but not all of it), and it is only completed if you refresh the page, or disable cache. 


Expected results:

Site should have loaded just fine, no matter in cache is enabled or disabled. FF v52 loads the site SP site just fine.
Hi,

I have the same Issue. In SharePoint 2013 and now a new installed SharePoint 2016. 

Current unhappy Solutions we used: 

1. Reload the Page
2. Open SharePoint in Private-Mode (no Cache)
Correct, that is exactly the same behaviour I have noticed. If you disable the cache, then Sharepoint pages load without any issues. I hope some FF/SP expert can help us. Thanks
Hi,

We have the same issue with SharePoint 2010.

Please Help.

Kind Regards
Hi,

we have the Same issue with SharePoint 2016 On-Premise and Firefox 60.0.2 :-(.
All users are affected. 

Page reload is needed very often because page is not loading comletly.
Hi,
see also this Mozilla Support Forum entry: https://support.mozilla.org/de/questions/1213246
Since the creator of the post used Firefox 59, it shouldn't be related to Quantum.

We experience the same behavior as described in Firefox 61 and 62 across several users in our company.
We do use SharePoint 2007 (NTLM) and 2013 (Negotiate / Kerberos), but the problem is not limited to it. It also happens on our on-premise Team Foundation Server with NTLM and on custom ASP.NET applications hosted by our team on IIS.
Hi,

we're not limited to sharepoint too, other sites using NTLM are concerned (delivred by IIS and Apache)
(In reply to Sebastian Segerer from comment #5)
> Hi,
> see also this Mozilla Support Forum entry:
> https://support.mozilla.org/de/questions/1213246
> Since the creator of the post used Firefox 59, it shouldn't be related to
> Quantum.
> 
> We experience the same behavior as described in Firefox 61 and 62 across
> several users in our company.
> We do use SharePoint 2007 (NTLM) and 2013 (Negotiate / Kerberos), but the
> problem is not limited to it. It also happens on our on-premise Team
> Foundation Server with NTLM and on custom ASP.NET applications hosted by our
> team on IIS.

Sebastian, Please read the inital description of the issue. I clearly stated that I noticed this issue when working with firefox 60 ( Quantum version ). When I mention, firefox v.52, what I say is that version doesnt show the Sharepoint NTLM load issue.
@Alberto Suarez
I was referring to the creator of the linked support.mozilla.org post, in which Firefox 59 was used.
(In reply to Sebastian Segerer from comment #8)
> @Alberto Suarez
> I was referring to the creator of the linked support.mozilla.org post, in
> which Firefox 59 was used.

My apologies Sebastian. I have taken a look to the post you shared. It seems that many users and experiencing the same issue. Thanks.
Based on many users experiencing this issue I am placing this under Core:Networking, so someone from the team can look into this issue. Thanks!
Component: Untriaged → Networking
Product: Firefox → Core
Thanks for putting this to the right component and thanks for the report.

Alberto, Sebastian, I will kindly ask you for producing http logs according [1].  

Please set the list of logging modules (MOZ_LOG) as:
timestamp,rotate:400,nsHttp:5,cache2:5,negotiateauth:5,NTLM:5

As the logs may contain sensitive information, it will be best to send them to my bugzilla email directly.

Thank you!


[1] https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging
Assignee: nobody → honzab.moz
Status: UNCONFIRMED → NEW
Component: Networking → Networking: HTTP
Ever confirmed: true
Flags: needinfo?(sebbl)
Flags: needinfo?(alberto.suarez.caballero)
Priority: -- → P2
Whiteboard: [necko-triaged][ntlm]
So, the problem is that when we race cache with network responses, NTLM authentication can be broken.

What I can see in a privately provided log is that a _first request on a new connection_ is being raced with cache.  We send a GET request (no authentication) and also open an entry that can be used w/o revalidation.  The cache wins even before we get the first 401: NTLM response.  The channel is finished, 401 response thew away, resource satisfied from the cache.

The expected processing chain is to authenticate the connection with three 401,401,200 loops.  The authentication state is kept in the requesting channel, so when that channel is finished, next request on the connection has to start over doing again a plain GET w/o any auth headers, restarting the NTLM auth process from scratch.  If it happens that a previous raced request has already sent an NTLM message type 1, this likely confuses the server as it expects NTLM message type 3 in the next request.  That is likely the cause of unsatisfied loads from the server, but I have to carefully look into the log again later.


Possible fixes for the problem found so far:

- in case of Basic or Digest auth we keep an information on a cache entry that it was served with authentication what makes us revalidate it ; I believe this also disallows cache racing (Michal?), but I can see we are sending conditional headers in requests...  OTOH, this may be incomplete solution when NTLM is established later for the resource (we have an entry w/o "auth" marking)

- let the channel finish the authentication despite a cache win ; this is probably the "easiest" and most clear way to fix this bug
Flags: needinfo?(sebbl)
Flags: needinfo?(michal.novotny)
Flags: needinfo?(alberto.suarez.caballero)
Summary: Load / NTLM Auth / cache issue in Firefox and Sharepoint on premises → Cache racing breaks NTLM authentication - Load / NTLM Auth / cache issue in Firefox and Sharepoint on premises
Whiteboard: [necko-triaged][ntlm] → [necko-triaged][ntlm][http-conn]
Hmm.. I inspected the provided log more in detail and I see a different problem (actually a second one).  Michal, I will send you the log with ref to the affected channel to inspect.  It's also definitely related to racing.
The second problem, that actually manifests as reported - responses are hanging - is the following:
- we do a request
- find an entry that needs to be re-validated with the server
- we do a request
- we do it on a new connection
- we get a 401:NTLM response
=> the cache racing algorithm takes it as if that the first response came from the network, but that is a totally wrong assumption
- we do the full NTLM authentication round: another GET, 401, GET 304
- now the server has confirmed that the cached entry can be used
- we call ReadFromCache, but it's skipped because cache racing believes we have already provided the response from network

This leads to total omission of calling OnStartReqest/OnDataAvailable/OnStopRequest of the final listener (HttpChannelParent) and thus the child request is hanging forever.

If this happens for a top level page, a user just stars at a blank page and spinning throbber.

P1 as this is a corporate serious bug.  I'll file bugs to disable cache racing on ESR branches.
Priority: P2 → P1
Blocks: RCWN
Depends on: 1494405
For all affected users, the actual fix/good workaround is to switch 'network.http.rcwn.enabled' to |false| in about:config.
Thanks Honza, i'm testing the workaround on few machines and so far so good.
status-geckoview62=wontfix because NTLM is not a critical use case for Focus+GeckoView. We don't need to uplift a fix for GeckoView 62 in Focus 7.0.
(In reply to Honza Bambas (:mayhemer) from comment #12)
> - in case of Basic or Digest auth we keep an information on a cache entry
> that it was served with authentication what makes us revalidate it ; I
> believe this also disallows cache racing (Michal?)

We don't use this information when we're deciding whether to race or not because we don't know it. It's not stored in the index and we don't have the entry.
Flags: needinfo?(michal.novotny)
(In reply to Honza Bambas (:mayhemer) from comment #14)
> The second problem, that actually manifests as reported - responses are
> hanging - is the following:
> - we do a request
> - find an entry that needs to be re-validated with the server
> - we do a request
> - we do it on a new connection
> - we get a 401:NTLM response
> => the cache racing algorithm takes it as if that the first response came
> from the network, but that is a totally wrong assumption
> - we do the full NTLM authentication round: another GET, 401, GET 304
> - now the server has confirmed that the cached entry can be used

To not end up with 304 response while we don't have the entry, we remove conditional headers before we send the request. This landed in bug 1382831. I need to understand more how NTLM works to understand why the problem persists for NTLM.
So the problem is that we remove the conditional headers only in the first request but not in subsequent requests. It seems we need change a bit the condition at https://searchfox.org/mozilla-central/rev/ce57be88b8aa2ad03ace1b9684cd6c361be5109f/netwerk/protocol/http/nsHttpChannel.cpp#1186.
RCWN has been disabled for this week's forthcoming ESR 60.2.2 release, which should resolve this issue for those users.
After studying the code, it seems that this is a dupe of bug 1477684 which was fixed in version 62 and was uplifted to ESR 60.2.

Alberto, what version of ESR did you use when you were able to reproduce the bug? ESR 60 or ESR 60.2?
Flags: needinfo?(alberto.suarez.caballero)
From an end-user perspective, I can confirm that I did not encounter this problem for some days / maybe weeks; probably since the FF 62 update mid September.
I'm sorry I did not report this earlier, but I wasn't sure if I was just "lucky" to not have this issue for some days.
I also just checked with our team and no one hat this behaviour anymore.
In our case, i've provided a log from 60.0 to Honza.

I've updated firefox to 60.2, i guess i should switch back to true "network.http.rcwn.enabled" and try.
(In reply to Arnaud Meurou from comment #25)
> I've updated firefox to 60.2, i guess i should switch back to true
> "network.http.rcwn.enabled" and try.

Yes, please enable rcwn again and let me know whether ESR 60.2 works correctly. Thanks.
Great news!  Thanks.  When confirmed, we can back bug 1494405 out from ESR.
Assignee: honzab.moz → michal.novotny
(In reply to Sebastian Segerer from comment #24)
> From an end-user perspective, I can confirm that I did not encounter this
> problem for some days / maybe weeks; probably since the FF 62 update mid
> September.
> I'm sorry I did not report this earlier, but I wasn't sure if I was just
> "lucky" to not have this issue for some days.
> I also just checked with our team and no one hat this behaviour anymore.

Yes I think, as Sebastian has obeserved, that I have not gone through this issue since the last FF update. Currently using FF ESR 60.2
Flags: needinfo?(alberto.suarez.caballero)
(In reply to Michal Novotny (:michal) from comment #23)
> After studying the code, it seems that this is a dupe of bug 1477684 which
> was fixed in version 62 and was uplifted to ESR 60.2.
> 
> Alberto, what version of ESR did you use when you were able to reproduce the
> bug? ESR 60 or ESR 60.2?

ESR 60
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → DUPLICATE
Ok works for too with 60.2 !

thanks guys
You need to log in before you can comment on or make changes to this bug.