Closed Bug 1797729 Opened 3 years ago Closed 3 years ago

Initial mozilla safebrowsing/tracking download almost always fails on local artifact build

Categories

(Core :: Networking: HTTP, defect, P1)

Desktop
macOS
defect

Tracking

()

RESOLVED FIXED
109 Branch
Tracking Status
firefox109 --- fixed

People

(Reporter: Gijs, Assigned: kershaw)

References

Details

(Whiteboard: [necko-triaged])

Attachments

(4 files)

I was having issues testing out styling changes to the tracking protection panel because in new profiles, there was no effective TP at all.

Digging into it, what seems to happen is that on an artifact build and a clean profile, only the mozilla provider is being listed on about:url-classifier, and attempting to force an update hits the backoff timer ("cannot update").

From logging, it appears that the initial request errors out without a status argument provided to https://searchfox.org/mozilla-central/rev/2d1097c27f31cb15f7588b043d691cda9710d14b/toolkit/components/url-classifier/UrlClassifierListManager.jsm#755-759 . The url that fails is https://shavar.services.mozilla.com/downloads?client=Firefox&appver=108.0a&pver=2.2 but no request is logged at all in the network part of the browser toolbox. I have the impression the request times out entirely or fails to connect, but I'm not sure how to get more details for it.

It doesn't appear possible to force an update despite the backoff time from the UI, but if I set a breakpoint in the debugger here and overwrite the nextRequestTime_ when clicking "Force update" in about:url-classifier the request succeeds fine and TP starts working.

The same issue is more intermittent when using mozregression downloaded builds, which makes it difficult/impossible to determine if this is some kind of regression or specific to artifact builds. It smells like a race condition to me but I'm unsure how to dig into it more.

Thanks for the detailed report! I couldn't reproduce this locally with an artifact build and mach run --temp-profile yet. The ETP list seems to get populated within a few seconds of browser startup.

Dimi, do you have time to take a look at this issue when you're back? I'm worried that it might be more widespread and actually affect users in the wild. Happy to take a look together too, feel free to reach out on Slack or Matrix.

Flags: needinfo?(dlee)
Attached file log.zip

I'm able to reproduce this issue on my platform. It looks like when SafeBrowsing tries to fetch update from the server, the OnStartRequest is never called, hence SafeBrowsing returns a timeout error.

Somehow this issue only happens when SafeBrowsing triggers an update during startup, if I trigger an update manually after startup, everything works fine.

Hi Kershaw,
I collected a debug log with http logging, could someone in your team help take a look at the log to see if you have any idea about why this issue is happening? Thanks. Please also let me know if you need more information from me.

Some information about the attached log

  1. The first two SafeBrowsing requests succeed (line 31721, line 125277), the third one fails (line 161661).
  2. I change the SafeBrowsing timeout to 10sec. The timeout callback is called in line 169290 (the second file)
Flags: needinfo?(dlee) → needinfo?(kershaw)
See Also: → 1775718

Thanks for the log, Dimi.
This is actually a necko bug. I'll try to fix this.

Basically, what happens in necko is that we got stuck in an infinite loop of dispatching a transaction.
The first condition to trigger this is having a http3 connection (tracking-protection.cdn.mozilla.net) available.

2022-11-07 11:27:04.586894 UTC - [Parent 8909: Socket Thread]: V/nsHttp nsHttpConnectionMgr::DispatchAbstractTransaction [ci=.S........[tlsflags0x00000000]tracking-protection.cdn.mozilla.net:443 <ROUTE-via tracking-protection.cdn.mozilla.net:443> {NPN-TOKEN h3}^firstPartyDomain=safebrowsing.86868755-6b82-4842-b301-72671a0db32e.mozilla trans=1489f2320 caps=400 conn=139d6fd20]
2022-11-07 11:27:04.586897 UTC - [Parent 8909: Socket Thread]: E/nsHttp HttpConnectionUDP::Activate [this=139d6fd20 trans=1489f2320 caps=400]

Next, another channel to the same host (tracking-protection.cdn.mozilla.net) is created, but with http3 disabled.

2022-11-07 11:27:04.585965 UTC - [Parent 8909: Main Thread]: D/nsHttp Creating nsHttpChannel [this=14a20be00]
2022-11-07 11:27:04.585969 UTC - [Parent 8909: Main Thread]: E/nsHttp HttpBaseChannel::Init [this=14a20be00]
2022-11-07 11:27:04.585975 UTC - [Parent 8909: Main Thread]: E/nsHttp uri=https://tracking-protection.cdn.mozilla.net/social-track-digest256/1604686195

Then, we successfully created a http2 connection to the server. But this connection can not be used, since we try to coalesce this http2 connection to the http3 one.

2022-11-07 11:27:04.652369 UTC - [Parent 8909: Socket Thread]: V/nsHttp FindCoalescableConnection .S........[tlsflags0x00000000]tracking-protection.cdn.mozilla.net:443^firstPartyDomain=safebrowsing.86868755-6b82-4842-b301-72671a0db32e.mozilla
2022-11-07 11:27:04.652380 UTC - [Parent 8909: Socket Thread]: V/nsHttp FindCoalescableConnectionByHashKey() found match conn=139d6fd20 key=34.120.158.37~.:~.:443/[^firstPartyDomain=safebrowsing.86868755-6b82-4842-b301-72671a0db32e.mozilla]viaDNS newCI=.S........[tlsflags0x00000000]tracking-protection.cdn.mozilla.net:443^firstPartyDomain=safebrowsing.86868755-6b82-4842-b301-72671a0db32e.mozilla matchedCI=.S........[tlsflags0x00000000]tracking-protection.cdn.mozilla.net:443 <ROUTE-via tracking-protection.cdn.mozilla.net:443> {NPN-TOKEN h3}^firstPartyDomain=safebrowsing.86868755-6b82-4842-b301-72671a0db32e.mozilla join ok
2022-11-07 11:27:04.652384 UTC - [Parent 8909: Socket Thread]: V/nsHttp FindCoalescableConnection(.S........[tlsflags0x00000000]tracking-protection.cdn.mozilla.net:443^firstPartyDomain=safebrowsing.86868755-6b82-4842-b301-72671a0db32e.mozilla) match conn 139d6fd20 on dns key 34.120.158.37~.:~.:443/[^firstPartyDomain=safebrowsing.86868755-6b82-4842-b301-72671a0db32e.mozilla]viaDNS
2022-11-07 11:27:04.652388 UTC - [Parent 8909: Socket Thread]: V/nsHttp UpdateCoalescingForNewConn() found existing active conn that could have served newConn graceful close of newConn=14cb82800 to migrate to existingConn 139d6fd20

We dispatch the same transaction again. Since http3 is disabled, we end up with a look of creating a http2 connection, closing the connection, and dispatching the transaction again.

Assignee: nobody → kershaw
Severity: -- → S3
Component: Privacy: Anti-Tracking → Networking: HTTP
Flags: needinfo?(kershaw)
Priority: -- → P2
Whiteboard: [necko-triaged]

More context about what triggers this bug.
This bug is triggered by a server side change. To be more specifically, this is triggered by a server that advertises alt-svc header.

 alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000

When necko sees this header, we create a HTTP/3 connection to the server to verify this information. However, the request to load tracking protection list has beConservative flag set, which means HTTP/3 is disabled for this request. In the end, this old necko bug is triggered.

We should also ask the server to stop advertising alt-svc header, since this bug also affects multiple Firefox release versions (also ESR).
Dimi, do you know who to talk at the server side?

Severity: S3 → S2
Flags: needinfo?(dlee)
Priority: P2 → P1
Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-queue]

Sven,

Is there any recent change on the server side that make the server starts to send alt-svc header?

Flags: needinfo?(dlee) → needinfo?(sven)
See Also: → 1796117

Thanks Dimi and Kershaw for tracking this down!

(In reply to Tim Huang[:timhuang] from comment #6)

Is there any recent change on the server side that make the server starts to send alt-svc header?

There were no changes to the Shavar servers, i.e. the servers backing the domain name shavar.services.mozilla.com. However, as mentioned above, there were changes to the CDN we use to distribute the actual tracking protection lists, i.e. to the CDN that tracking-protection.cdn.mozilla.net points to. We used to use Amazon CloudFront before, and now we are using Google Cloud CDN. The header is added by Google. I'll look into whether we can get rid of it.

Flags: needinfo?(sven)

Add leave-open, since I am working on a test case as a separate patch.

Keywords: leave-open

I've disabled HTTP3 on the production CDN https://tracking-protection.cdn.mozilla.net . I've left it on the stage CDN https://cdn.stage.shavar.nonprod.cloudops.mozgcp.net in case folks need to use that for testing fixes

Pushed by kjang@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/2e39366a830c Avoid coalescing a Http/2 connection with a Http/3 connection when a transaction doesn't allow Http/3, r=necko-reviewers,valentin
Keywords: leave-open

(In reply to Atila Butkovits from comment #15)

Backed out for causing failures complaining about security/sandbox/test/browser_bug1717599_XDG-CONFIG-DIRS.ini.

Backout link: https://hg.mozilla.org/integration/autoland/rev/1a5ebe6b0dac211cf4f8096c5409de5255fcd737

Push with failures: https://treeherder.mozilla.org/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel&revision=f16db56098b0ec268a7dd0cf551f2abebba42135

failure log: https://treeherder.mozilla.org/logviewer?job_id=396266394&repo=autoland&lineNumber=2333

Not sure why adding a xpcshell test can cause this failure...
Anyway, I'll try to rename my test and land again.

Flags: needinfo?(kershaw)
Whiteboard: [necko-triaged][necko-priority-queue] → [necko-triaged]

I'm seeing a crash on my M1 MacOS 13 running a debug build with this change 2e39366a830c.

thread #10, name = 'Socket Thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x00000002bcb8b608 XUL`mozilla::net::Http2Session::Caps(this=0x00000001770f4e00) at Http2Session.cpp:4319:3
    frame #1: 0x00000002bcda9adc XUL`mozilla::net::nsHttpConnection::StartSpdy(this=0x0000000173785000, sslControl=0x0000000173750400, spdyVersion=HTTP_2) at nsHttpConnection.cpp:299:38
    frame #2: 0x00000002bcdb5914 XUL`mozilla::net::nsHttpConnection::HandshakeDoneInternal(this=0x0000000173785000) at nsHttpConnection.cpp:2370:7
    frame #3: 0x00000002bcd57d78 XUL`mozilla::net::TlsHandshaker::HandshakeDone(this=0x0000000163957c98)::$_22::operator()() const at TlsHandshaker.cpp:45:27
    frame #4: 0x00000002bcd57c7c XUL`mozilla::detail::RunnableFunction<mozilla::net::TlsHandshaker::HandshakeDone()::$_22>::Run(this=0x0000000163957c70) at nsThreadUtils.h:531:5
    frame #5: 0x00000002bbf89e2c XUL`nsThread::ProcessNextEvent(this=0x0000000113c8b580, aMayWait=true, aResult=0x000000016d412527) at nsThread.cpp:1198:16
    frame #6: 0x00000002bbf91c20 XUL`NS_ProcessNextEvent(aThread=0x0000000113c8b580, aMayWait=true) at nsThreadUtils.cpp:465:10
    frame #7: 0x00000002bc498100 XUL`mozilla::net::nsSocketTransportService::Run(this=0x0000000103ba9900) at nsSocketTransportService2.cpp:1235:11
    frame #8: 0x00000002bbf89e2c XUL`nsThread::ProcessNextEvent(this=0x0000000113c8b580, aMayWait=false, aResult=0x000000016d412ae7) at nsThread.cpp:1198:16
    frame #9: 0x00000002bbf91c20 XUL`NS_ProcessNextEvent(aThread=0x0000000113c8b580, aMayWait=false) at nsThreadUtils.cpp:465:10
    frame #10: 0x00000002bd423e6c XUL`mozilla::ipc::MessagePumpForNonMainThreads::Run(this=0x0000000113c91400, aDelegate=0x000000016d412dc8) at MessagePump.cpp:300:20
    frame #11: 0x00000002bd28d434 XUL`MessageLoop::RunInternal(this=0x000000016d412dc8) at message_loop.cc:381:10
    frame #12: 0x00000002bd28d378 XUL`MessageLoop::RunHandler(this=0x000000016d412dc8) at message_loop.cc:374:3
    frame #13: 0x00000002bd28d2f4 XUL`MessageLoop::Run(this=0x000000016d412dc8) at message_loop.cc:356:3
    frame #14: 0x00000002bbf8469c XUL`nsThread::ThreadFunc(aArg=0x00000001138f5c40) at nsThread.cpp:383:10
    frame #15: 0x00000001042c1e04 libnss3.dylib`_pt_root(arg=0x00000001032cd8b0) at ptthread.c:201:5
    frame #16: 0x0000000196b4e06c libsystem_pthread.dylib`_pthread_start + 148
Flags: needinfo?(kershaw)

(In reply to Dan Baker[:dbaker] from comment #17)

I'm seeing a crash on my M1 MacOS 13 running a debug build with this change 2e39366a830c.

thread #10, name = 'Socket Thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x00000002bcb8b608 XUL`mozilla::net::Http2Session::Caps(this=0x00000001770f4e00) at Http2Session.cpp:4319:3
    frame #1: 0x00000002bcda9adc XUL`mozilla::net::nsHttpConnection::StartSpdy(this=0x0000000173785000, sslControl=0x0000000173750400, spdyVersion=HTTP_2) at nsHttpConnection.cpp:299:38
    frame #2: 0x00000002bcdb5914 XUL`mozilla::net::nsHttpConnection::HandshakeDoneInternal(this=0x0000000173785000) at nsHttpConnection.cpp:2370:7
    frame #3: 0x00000002bcd57d78 XUL`mozilla::net::TlsHandshaker::HandshakeDone(this=0x0000000163957c98)::$_22::operator()() const at TlsHandshaker.cpp:45:27
    frame #4: 0x00000002bcd57c7c XUL`mozilla::detail::RunnableFunction<mozilla::net::TlsHandshaker::HandshakeDone()::$_22>::Run(this=0x0000000163957c70) at nsThreadUtils.h:531:5
    frame #5: 0x00000002bbf89e2c XUL`nsThread::ProcessNextEvent(this=0x0000000113c8b580, aMayWait=true, aResult=0x000000016d412527) at nsThread.cpp:1198:16
    frame #6: 0x00000002bbf91c20 XUL`NS_ProcessNextEvent(aThread=0x0000000113c8b580, aMayWait=true) at nsThreadUtils.cpp:465:10
    frame #7: 0x00000002bc498100 XUL`mozilla::net::nsSocketTransportService::Run(this=0x0000000103ba9900) at nsSocketTransportService2.cpp:1235:11
    frame #8: 0x00000002bbf89e2c XUL`nsThread::ProcessNextEvent(this=0x0000000113c8b580, aMayWait=false, aResult=0x000000016d412ae7) at nsThread.cpp:1198:16
    frame #9: 0x00000002bbf91c20 XUL`NS_ProcessNextEvent(aThread=0x0000000113c8b580, aMayWait=false) at nsThreadUtils.cpp:465:10
    frame #10: 0x00000002bd423e6c XUL`mozilla::ipc::MessagePumpForNonMainThreads::Run(this=0x0000000113c91400, aDelegate=0x000000016d412dc8) at MessagePump.cpp:300:20
    frame #11: 0x00000002bd28d434 XUL`MessageLoop::RunInternal(this=0x000000016d412dc8) at message_loop.cc:381:10
    frame #12: 0x00000002bd28d378 XUL`MessageLoop::RunHandler(this=0x000000016d412dc8) at message_loop.cc:374:3
    frame #13: 0x00000002bd28d2f4 XUL`MessageLoop::Run(this=0x000000016d412dc8) at message_loop.cc:356:3
    frame #14: 0x00000002bbf8469c XUL`nsThread::ThreadFunc(aArg=0x00000001138f5c40) at nsThread.cpp:383:10
    frame #15: 0x00000001042c1e04 libnss3.dylib`_pt_root(arg=0x00000001032cd8b0) at ptthread.c:201:5
    frame #16: 0x0000000196b4e06c libsystem_pthread.dylib`_pthread_start + 148

Thanks for letting me know.
I'll submit a patch in bug 1800690 to fix this.

Flags: needinfo?(kershaw)

(In reply to Kershaw Chang [:kershaw] from comment #18)

(In reply to Dan Baker[:dbaker] from comment #17)

I'm seeing a crash on my M1 MacOS 13 running a debug build with this change 2e39366a830c.

thread #10, name = 'Socket Thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x00000002bcb8b608 XUL`mozilla::net::Http2Session::Caps(this=0x00000001770f4e00) at Http2Session.cpp:4319:3
    frame #1: 0x00000002bcda9adc XUL`mozilla::net::nsHttpConnection::StartSpdy(this=0x0000000173785000, sslControl=0x0000000173750400, spdyVersion=HTTP_2) at nsHttpConnection.cpp:299:38
    frame #2: 0x00000002bcdb5914 XUL`mozilla::net::nsHttpConnection::HandshakeDoneInternal(this=0x0000000173785000) at nsHttpConnection.cpp:2370:7
    frame #3: 0x00000002bcd57d78 XUL`mozilla::net::TlsHandshaker::HandshakeDone(this=0x0000000163957c98)::$_22::operator()() const at TlsHandshaker.cpp:45:27
    frame #4: 0x00000002bcd57c7c XUL`mozilla::detail::RunnableFunction<mozilla::net::TlsHandshaker::HandshakeDone()::$_22>::Run(this=0x0000000163957c70) at nsThreadUtils.h:531:5
    frame #5: 0x00000002bbf89e2c XUL`nsThread::ProcessNextEvent(this=0x0000000113c8b580, aMayWait=true, aResult=0x000000016d412527) at nsThread.cpp:1198:16
    frame #6: 0x00000002bbf91c20 XUL`NS_ProcessNextEvent(aThread=0x0000000113c8b580, aMayWait=true) at nsThreadUtils.cpp:465:10
    frame #7: 0x00000002bc498100 XUL`mozilla::net::nsSocketTransportService::Run(this=0x0000000103ba9900) at nsSocketTransportService2.cpp:1235:11
    frame #8: 0x00000002bbf89e2c XUL`nsThread::ProcessNextEvent(this=0x0000000113c8b580, aMayWait=false, aResult=0x000000016d412ae7) at nsThread.cpp:1198:16
    frame #9: 0x00000002bbf91c20 XUL`NS_ProcessNextEvent(aThread=0x0000000113c8b580, aMayWait=false) at nsThreadUtils.cpp:465:10
    frame #10: 0x00000002bd423e6c XUL`mozilla::ipc::MessagePumpForNonMainThreads::Run(this=0x0000000113c91400, aDelegate=0x000000016d412dc8) at MessagePump.cpp:300:20
    frame #11: 0x00000002bd28d434 XUL`MessageLoop::RunInternal(this=0x000000016d412dc8) at message_loop.cc:381:10
    frame #12: 0x00000002bd28d378 XUL`MessageLoop::RunHandler(this=0x000000016d412dc8) at message_loop.cc:374:3
    frame #13: 0x00000002bd28d2f4 XUL`MessageLoop::Run(this=0x000000016d412dc8) at message_loop.cc:356:3
    frame #14: 0x00000002bbf8469c XUL`nsThread::ThreadFunc(aArg=0x00000001138f5c40) at nsThread.cpp:383:10
    frame #15: 0x00000001042c1e04 libnss3.dylib`_pt_root(arg=0x00000001032cd8b0) at ptthread.c:201:5
    frame #16: 0x0000000196b4e06c libsystem_pthread.dylib`_pthread_start + 148

Thanks for letting me know.
I'll submit a patch in bug 1800690 to fix this.

Just to be clear this is happening on startup preventing me from launching the debug build.

Regressions: 1800690
See Also: → 1800690
See Also: 1800690
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 109 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: