Closed Bug 1577937 Opened 5 years ago Closed 5 years ago

Intermittent Tier 2 testing/firefox-ui/tests/functional/safebrowsing/test_warning_pages.py TestSafeBrowsingWarningPages.test_warning_pages | application crashed [@ mozilla::Telemetry::CanRecordExtended()]

Categories

(Toolkit :: Telemetry, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla71
Tracking Status
firefox-esr60 --- wontfix
firefox-esr68 --- wontfix
firefox69 --- wontfix
firefox70 --- wontfix
firefox71 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: chutten)

References

Details

(5 keywords, Whiteboard: [post-critsmash-triage][adv-main71+r])

Crash Data

Attachments

(2 files)

Filed by: rgurzau [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=264324953&repo=autoland
Full log: https://queue.taskcluster.net/v1/task/a3n_iDH3QOudZIKpTJPeXA/runs/0/artifacts/public/logs/live_backing.log


[task 2019-08-30T20:40:44.246Z] 20:40:44 INFO - ExceptionHandler::SendContinueSignalToChild sent continue signal to child
[task 2019-08-30T20:40:44.246Z] 20:40:44 INFO - ExceptionHandler::WaitForContinueSignal waiting for continue signal...
[task 2019-08-30T20:40:44.345Z] 20:40:44 INFO - Exiting due to channel error.
[task 2019-08-30T20:40:44.345Z] 20:40:44 INFO - [GFX1-]: Receive IPC close with reason=AbnormalShutdown
[task 2019-08-30T20:40:44.345Z] 20:40:44 INFO - [Child 904, Chrome_ChildThread] WARNING: pipe error (3): Connection reset by peer: file /builds/worker/workspace/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 358
[task 2019-08-30T20:40:44.345Z] 20:40:44 INFO - Exiting due to channel error.
[task 2019-08-30T20:40:44.347Z] 20:40:44 INFO - [Child 1058, Chrome_ChildThread] WARNING: pipe error (3): Connection reset by peer: file /builds/worker/workspace/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 358
[task 2019-08-30T20:40:44.348Z] 20:40:44 INFO - Exiting due to channel error.
[task 2019-08-30T20:40:44.410Z] 20:40:44 INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/HMWDQuhTTFGsU-Bpgf6jwQ/artifacts/public/build/target.crashreporter-symbols.zip
[task 2019-08-30T20:40:48.231Z] 20:40:48 INFO - mozcrash Copy/paste: /builds/worker/workspace/build/linux64-minidump_stackwalk /builds/worker/workspace/build/tmpQousue.mozrunner/minidumps/4aead703-b6df-7310-9936-7040b9a16c41.dmp /tmp/tmpe_tevC
[task 2019-08-30T20:40:53.055Z] 20:40:53 INFO - mozcrash Saved minidump as /builds/worker/workspace/build/blobber_upload_dir/4aead703-b6df-7310-9936-7040b9a16c41.dmp
[task 2019-08-30T20:40:53.055Z] 20:40:53 INFO - mozcrash Saved app info as /builds/worker/workspace/build/blobber_upload_dir/4aead703-b6df-7310-9936-7040b9a16c41.extra
[task 2019-08-30T20:40:53.209Z] 20:40:53 INFO - PROCESS-CRASH | testing/firefox-ui/tests/functional/safebrowsing/test_warning_pages.py TestSafeBrowsingWarningPages.test_warning_pages | application crashed [@ mozilla::Telemetry::CanRecordExtended()]
[task 2019-08-30T20:40:53.210Z] 20:40:53 INFO - Crash dump filename: /builds/worker/workspace/build/tmpQousue.mozrunner/minidumps/4aead703-b6df-7310-9936-7040b9a16c41.dmp
[task 2019-08-30T20:40:53.210Z] 20:40:53 INFO - Operating system: Linux
[task 2019-08-30T20:40:53.211Z] 20:40:53 INFO - 0.0.0 Linux 4.4.0-1014-aws #14taskcluster1-Ubuntu SMP Tue Apr 3 10:27:00 UTC 2018 x86_64
[task 2019-08-30T20:40:53.211Z] 20:40:53 INFO - CPU: amd64
[task 2019-08-30T20:40:53.212Z] 20:40:53 INFO - family 6 model 85 stepping 4
[task 2019-08-30T20:40:53.213Z] 20:40:53 INFO - 1 CPU
[task 2019-08-30T20:40:53.213Z] 20:40:53 INFO -
[task 2019-08-30T20:40:53.214Z] 20:40:53 INFO - GPU: UNKNOWN
[task 2019-08-30T20:40:53.214Z] 20:40:53 INFO -
[task 2019-08-30T20:40:53.214Z] 20:40:53 INFO - Crash reason: SIGSEGV /SEGV_ACCERR
[task 2019-08-30T20:40:53.214Z] 20:40:53 INFO - Crash address: 0x7f355d0b537f
[task 2019-08-30T20:40:53.214Z] 20:40:53 INFO - Process uptime: not available
[task 2019-08-30T20:40:53.214Z] 20:40:53 INFO -
[task 2019-08-30T20:40:53.214Z] 20:40:53 INFO - Thread 17 (crashed)
[task 2019-08-30T20:40:53.215Z] 20:40:53 INFO - 0 libxul.so!mozilla::Telemetry::CanRecordExtended() [Telemetry.cpp:466e8f97c54f14584c6c23d8a8057255e5957abb : 1997 + 0x0]
[task 2019-08-30T20:40:53.215Z] 20:40:53 INFO - rax = 0x000005e3704fc2f0 rdx = 0x00000000430bd000
[task 2019-08-30T20:40:53.215Z] 20:40:53 INFO - rcx = 0x000005e36e455b1a rbx = 0x00000000430bd000
[task 2019-08-30T20:40:53.215Z] 20:40:53 INFO - rsi = 0x0000000000000002 rdi = 0x00007f35430bd000
[task 2019-08-30T20:40:53.215Z] 20:40:53 INFO - rbp = 0x00007f354e88d630 rsp = 0x00007f354e88d5d8
[task 2019-08-30T20:40:53.215Z] 20:40:53 INFO - r8 = 0x89c4ddff1b344a00 r9 = 0x89c4ddff1b344a00
[task 2019-08-30T20:40:53.216Z] 20:40:53 INFO - r10 = 0x00007f354e88d5b0 r11 = 0x0000000000000246
[task 2019-08-30T20:40:53.216Z] 20:40:53 INFO - r12 = 0x00007f35430bd000 r13 = 0x0000000000000001
[task 2019-08-30T20:40:53.216Z] 20:40:53 INFO - r14 = 0x00007f354e88d601 r15 = 0x00007f354241a000
[task 2019-08-30T20:40:53.216Z] 20:40:53 INFO - rip = 0x00007f355d0b537f
[task 2019-08-30T20:40:53.216Z] 20:40:53 INFO - Found by: given as instruction pointer in context
[task 2019-08-30T20:40:53.216Z] 20:40:53 INFO - 1 libxul.so!mozilla::layers::AsyncPanZoomController::ReportCheckerboard(mozilla::TimeStamp const&) [AsyncPanZoomController.cpp:466e8f97c54f14584c6c23d8a8057255e5957abb : 4352 + 0x5]
[task 2019-08-30T20:40:53.216Z] 20:40:53 INFO - rbx = 0x00000000430bd000 rbp = 0x00007f354e88d630
[task 2019-08-30T20:40:53.216Z] 20:40:53 INFO - rsp = 0x00007f354e88d5e0 r12 = 0x00007f35430bd000
[task 2019-08-30T20:40:53.217Z] 20:40:53 INFO - r13 = 0x0000000000000001 r14 = 0x00007f354e88d601
[task 2019-08-30T20:40:53.217Z] 20:40:53 INFO - r15 = 0x00007f354241a000 rip = 0x00007f355a34effe
[task 2019-08-30T20:40:53.217Z] 20:40:53 INFO - Found by: call frame info
[task 2019-08-30T20:40:53.217Z] 20:40:53 INFO - 2 libxul.so!mozilla::layers::ForEachNode<mozilla::layers::ForwardIterator, mozilla::layers::LayerMetricsWrapper, (lambda at /builds/worker/workspace/build/src/gfx/layers/TreeTraversal.h:189:14), (lambda at /builds/worker/workspace/build/src/gfx/layers/apz/src/APZSampler.cpp:111:7)> [TreeTraversal.h:466e8f97c54f14584c6c23d8a8057255e5957abb : 145 + 0x1a]
[task 2019-08-30T20:40:53.217Z] 20:40:53 INFO - rbx = 0x00007f35430bd000 rbp = 0x00007f354e88d690
[task 2019-08-30T20:40:53.217Z] 20:40:53 INFO - rsp = 0x00007f354e88d640 r12 = 0x00007f354e88d7c0
[task 2019-08-30T20:40:53.218Z] 20:40:53 INFO - r13 = 0x0000000000000000 r14 = 0x00007f354e88d6b0
[task 2019-08-30T20:40:53.218Z] 20:40:53 INFO - r15 = 0x00007f354241a000 rip = 0x00007f355a381b70
[task 2019-08-30T20:40:53.218Z] 20:40:53 INFO - Found by: call frame info
[task 2019-08-30T20:40:53.218Z] 20:40:53 INFO - 3 libxul.so!mozilla::layers::ForEachNode<mozilla::layers::ForwardIterator, mozilla::layers::LayerMetricsWrapper, (lambda at /builds/worker/workspace/build/src/gfx/layers/TreeTraversal.h:189:14), (lambda at /builds/worker/workspace/build/src/gfx/layers/apz/src/APZSampler.cpp:111:7)> [TreeTraversal.h:466e8f97c54f14584c6c23d8a8057255e5957abb : 142 + 0xe]
[task 2019-08-30T20:40:53.218Z] 20:40:53 INFO - rbx = 0x00007f354241a000 rbp = 0x00007f354e88d6f0

Happened when loading https://www.itisatrap.org/firefox/its-an-attack.html:

[task 2019-08-30T20:40:44.218Z] 20:40:44 INFO - 1567197644213 Marionette TRACE [39] Received DOM event DOMContentLoaded for https://www.itisatrap.org/firefox/its-an-attack.html
[task 2019-08-30T20:40:44.246Z] 20:40:44 INFO - ExceptionHandler::GenerateDump cloned child 1119
[task 2019-08-30T20:40:44.246Z] 20:40:44 INFO - ExceptionHandler::SendContinueSignalToChild sent continue signal to child
[task 2019-08-30T20:40:44.246Z] 20:40:44 INFO - ExceptionHandler::WaitForContinueSignal waiting for continue signal...
[task 2019-08-30T20:40:44.345Z] 20:40:44 INFO - Exiting due to channel error.
[task 2019-08-30T20:40:44.345Z] 20:40:44 INFO - [GFX1-]: Receive IPC close with reason=AbnormalShutdown

The crash itself seems to be security sensitive:

[task 2019-08-30T20:40:53.214Z] 20:40:53 INFO - Crash reason: SIGSEGV /SEGV_ACCERR
[task 2019-08-30T20:40:53.214Z] 20:40:53 INFO - Crash address: 0x7f355d0b537f
[task 2019-08-30T20:40:53.214Z] 20:40:53 INFO - Process uptime: not available
[task 2019-08-30T20:40:53.214Z] 20:40:53 INFO -
[task 2019-08-30T20:40:53.214Z] 20:40:53 INFO - Thread 17 (crashed)
[task 2019-08-30T20:40:53.215Z] 20:40:53 INFO - 0 libxul.so!mozilla::Telemetry::CanRecordExtended() [Telemetry.cpp:466e8f97c54f14584c6c23d8a8057255e5957abb : 1997 + 0x0]
[task 2019-08-30T20:40:53.215Z] 20:40:53 INFO - rax = 0x000005e3704fc2f0 rdx = 0x00000000430bd000
[task 2019-08-30T20:40:53.215Z] 20:40:53 INFO - rcx = 0x000005e36e455b1a rbx = 0x00000000430bd000
[task 2019-08-30T20:40:53.215Z] 20:40:53 INFO - rsi = 0x0000000000000002 rdi = 0x00007f35430bd000
[task 2019-08-30T20:40:53.215Z] 20:40:53 INFO - rbp = 0x00007f354e88d630 rsp = 0x00007f354e88d5d8
[task 2019-08-30T20:40:53.215Z] 20:40:53 INFO - r8 = 0x89c4ddff1b344a00 r9 = 0x89c4ddff1b344a00
[task 2019-08-30T20:40:53.216Z] 20:40:53 INFO - r10 = 0x00007f354e88d5b0 r11 = 0x0000000000000246
[task 2019-08-30T20:40:53.216Z] 20:40:53 INFO - r12 = 0x00007f35430bd000 r13 = 0x0000000000000001
[task 2019-08-30T20:40:53.216Z] 20:40:53 INFO - r14 = 0x00007f354e88d601 r15 = 0x00007f354241a000
[task 2019-08-30T20:40:53.216Z] 20:40:53 INFO - rip = 0x00007f355d0b537f

Group: firefox-core-security
Component: Firefox UI Tests → Telemetry
Product: Testing → Toolkit
QA Contact: hskupin
Version: Version 3 → unspecified

Shall I presume that the stack pointer being at Telemetry::CanRecordExtended is more likely meaning TelemetryImpl::CanRecordExtended instead?

If so, the only thing I can think of is if sTelemetry is non-null, but is also not the singleton TelemetryImpl. This would mean APZ is calling in to CanRecordExtended before CreateTelemetryInstance has been called (which is called the first time the Telemetry service is requested). Since APZ accesses the C++ API directly, it doesn't ask for the service, so this is entirely possible.

I think we should throw sTelemetry in a DataMutex or Atomic or some other mutual exclusion data access thingy. Might not be the problem causing this crash, but it seems like a decent idea anyway.

ni?Dexter for a second pair of eyes.

Assignee: nobody → chutten
Status: NEW → ASSIGNED
Flags: needinfo?(alessio.placitelli)
Priority: -- → P1

(In reply to Chris H-C :chutten from comment #3)

If so, the only thing I can think of is if sTelemetry is non-null, but is also not the singleton TelemetryImpl. This would mean APZ is calling in to CanRecordExtended before CreateTelemetryInstance has been called (which is called the first time the Telemetry service is requested). Since APZ accesses the C++ API directly, it doesn't ask for the service, so this is entirely possible.

Yes, I agree, this is entirely possible.

I think we should throw sTelemetry in a DataMutex or Atomic or some other mutual exclusion data access thingy. Might not be the problem causing this crash, but it seems like a decent idea anyway.

I agree on using Atomic or similar here. I wonder why this even crashes, though. The first thing we do in TelemetryImpl::CanRecordExtended is to check for sTelemetry being valid.

Flags: needinfo?(alessio.placitelli)

(In reply to Alessio Placitelli [:Dexter] from comment #4)

I agree on using Atomic or similar here. I wonder why this even crashes, though. The first thing we do in TelemetryImpl::CanRecordExtended is to check for sTelemetry being valid.

The first thing we check is if sTelemetry is non-null, which is different from "valid". Imagine if we caught it with only one half of the pointer written (to my knowledge C++ doesn't require atomic assignment of pointer types).

And, come to think of it, there's nothing stopping sTelemetry from being nulled out after we've done the check, either. t1 could check sTelemetry and it could be 100% fine, and then t2 comes in and shuts down Telemetry, and then t1 continues and tries to use sTelemetry.

See Also: → 1367344

It could very well happen that someone calls into a static method of
TelemetryImpl during Telemetry shutdown. There could be a race where sTelemetry
is non-null at the top of a method like CanRecordExtended, but is null by the
time we try to use it.

So let's put a DataMutex around it. This isn't trying to mutex the data itself.
Instead I'm just trying to protect places where the pointer is being read
or changed.

(In reply to Chris H-C :chutten from comment #5)

(In reply to Alessio Placitelli [:Dexter] from comment #4)

I agree on using Atomic or similar here. I wonder why this even crashes, though. The first thing we do in TelemetryImpl::CanRecordExtended is to check for sTelemetry being valid.

The first thing we check is if sTelemetry is non-null, which is different from "valid". Imagine if we caught it with only one half of the pointer written (to my knowledge C++ doesn't require atomic assignment of pointer types).

It does not require the type to be written "all at once", as it were (e.g. 64-bit types on 32-bit machines), but we can assume that word-size things are written with word-size load/store instructions. That is, they are written "all at once", and this assumption would apply to pointers on all the machines we care about.

Whether those stores are visible to other threads is a different matter. And whether the non-nullness remains true in the body of the if after being checked in the conditional is also up for debate in a multi-threaded world, as you pointed out.

Attachment #9090848 - Attachment description: Bug 1577937 - Throw a DataMutex around sTelemetry r?janerik!,froydnj! → Bug 1577937 - Throw a StaticDataMutex around sTelemetry r?janerik!,froydnj!
Keywords: checkin-needed
Group: firefox-core-security → core-security-release
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla71

This grafts cleanly to Beta and ESR68. Is this something we should consider backporting or can it ride Fx71 to release?

Flags: needinfo?(chutten)

I don't think we're in a hurry here. I think it can ship in 71.

Flags: needinfo?(chutten)
Flags: qe-verify-
Whiteboard: [post-critsmash-triage]
Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main71+r]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: