Closed Bug 1160148 Opened 9 years ago Closed 9 years ago

startup crash in log via mozilla::Telemetry::Accumulate

Categories

(Toolkit :: Telemetry, defect)

Unspecified
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla41
Tracking Status
firefox39 + fixed
firefox40 + fixed
firefox41 --- fixed

People

(Reporter: tracy, Assigned: away)

References

Details

(Keywords: crash, topcrash-thunderbird, topcrash-win, Whiteboard: [tbird topcrash][startupcrash])

Crash Data

Attachments

(4 files)

Startup crasher is number 3 on the 3 day top crash list.  First affected Nightly build was 2015042703 on Windows 7 amd64 only.

This bug was filed from the Socorro interface and is 
report bp-2f8ef536-1f91-4c00-9413-6a3a92150427.
=============================================================

Frame 	Module 	Signature 	Source
0 	msvcr120.dll 	log 	f:\dd\vctools\crt\fpw32\tran\amd64\log.asm:429
1 	mozglue.dll 	moz_xmalloc 	memory/mozalloc/mozalloc.cpp
2 	xul.dll 	base::Histogram::InitializeBucketRange() 	ipc/chromium/src/base/histogram.cc
3 	xul.dll 	base::Histogram::FactoryGet(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int, unsigned __int64, base::Histogram::Flags) 	ipc/chromium/src/base/histogram.cc
4 	xul.dll 	`anonymous namespace'::HistogramGet(char const*, char const*, unsigned int, unsigned int, unsigned int, unsigned int, bool, base::Histogram**) 	toolkit/components/telemetry/Telemetry.cpp
5 	xul.dll 	`anonymous namespace'::GetHistogramByEnumId(mozilla::Telemetry::ID, base::Histogram**) 	toolkit/components/telemetry/Telemetry.cpp
6 	xul.dll 	mozilla::Telemetry::Accumulate(mozilla::Telemetry::ID, unsigned int) 	toolkit/components/telemetry/Telemetry.cpp
7 	xul.dll 	nsSocketTransportService::DoPollIteration(bool, mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator>*) 	netwerk/base/nsSocketTransportService2.cpp
8 	xul.dll 	nsSocketTransportService::Run() 	netwerk/base/nsSocketTransportService2.cpp
9 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp
10 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/glue/nsThreadUtils.cpp
11 	xul.dll 	mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp
12 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc
13 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc
14 	xul.dll 	nsThread::ThreadFunc(void*) 	xpcom/threads/nsThread.cpp
15 	nss3.dll 	PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c
16 	nss3.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c
17 	msvcr120.dll 	_callthreadstartex 	f:\dd\vctools\crt\crtw32\startup\threadex.c:376
18 	msvcr120.dll 	_threadstartex 	f:\dd\vctools\crt\crtw32\startup\threadex.c:354
19 	kernel32.dll 	BaseThreadInitThunk 	
20 	ntdll.dll 	RtlUserThreadStart 	
21 	kernel32.dll 	BasepReportFault 	
22 	kernel32.dll 	BasepReportFault
Summary: crash in log → crash in log via mozilla::Telemetry::Accumulate
Summary: crash in log via mozilla::Telemetry::Accumulate → startup crash in log via mozilla::Telemetry::Accumulate
Whiteboard: [tbird topcrash][startupcrash]
The crash happens when the CPU does not support the "vpsrlq" instruction. Normally an alternate codepath would be used for these processors, but there is a bug in MSVCR120: https://connect.microsoft.com/VisualStudio/feedback/details/987093/x64-log-function-uses-vpsrlq-avx-instruction-without-regard-to-operating-system-so-it-crashes-on-vista-x64

It seems the only workaround is to disable these fast-path computations. That could affect perf, so we should try to disable only where necessary.
Assignee: nobody → dmajor
(It's ok to call cpuid7 when it's not supported, as long as you don't use the result.)
Attachment #8601767 - Flags: review?(nfroyd)
Attachment #8601767 - Flags: review?(nfroyd) → review+
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/3bf223d7eedd
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
There haven't been any reports of this bug with builds since the fix here was landed.
Status: RESOLVED → VERIFIED
I think I missed the nightly by a few hours so really it's tomorrow's stats that will be interesting.
Er, what I mean is that there hasn't yet been a nightly with this fix out in the wild.
ok, I'll circle back to reconfirm crash stats on Monday.
bp-4d516b25-457a-463a-a848-d83482150508

The fix isn't working, and I know why. My patch put the code in firefox.exe, which static-links the CRT. We'll also need to disable FMA3 in the DLL version of the CRT.
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
https://hg.mozilla.org/mozilla-central/rev/f7a79386ab63
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
no crashes reported of this signature with builds from yesterday, 20150510.
Status: RESOLVED → VERIFIED
(Ah sorry Tracy, I was hoping to save you the trouble, but we mid-aired.)

I still see this happening on 20150510: bp-2a095a65-1f29-4966-aee3-622652150511

I notice that all the reports are Win7 RTM, build 7600. It seems that Windows did not gain support for these instructions until Win7 SP1: http://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Operating_system_support
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
dmajor, how is it you're seeing reports, but I don't in the signature summary > table?
ok, lesson learned: don't trust the table.  I see the crashes with 20150501030205 in the reports list.
(In reply to [:tracy] Tracy Walker - QA Mentor from comment #15)
> dmajor, how is it you're seeing reports, but I don't in the signature
> summary > table?

Can you tell me exactly what tab on what URL you are looking at? The "Table" tab is known to be buggy.
Flagging this one for another review since I'm doing something slightly yucky.

I just realized that XRE_main doesn't run in content processes. Instead of adding a third instance of this code, I decided to move it to DllBlocklist_Initialize. It doesn't fit the blocklist theme, but it's run by chrome and content processes alike, and it has the added benefit of being the very first line of code we run after loading MSVCR120.dll.
Attachment #8604253 - Flags: review?(nfroyd)
Doh, trailing space. Already fixed locally.
Comment on attachment 8604253 [details] [diff] [review]
Also disable on Win7RTM, and move to mozglue

Review of attachment 8604253 [details] [diff] [review]:
-----------------------------------------------------------------

So, just to make sure I understand what's going on here:

- We initially put the code only in firefox.exe (nsWindowsWMain.cpp);
- That actually wasn't good enough, so we had to duplicate the code into nsAppRunner.cpp so libxul's calls into the CRT would be covered;
- (I admit to not really understanding why we have different CRTs for firefox.exe and xul.dll, but I trust there's good reasons for it)
- But that *still* doesn't cover content processes, because they start with another call in libxul (and they load a different CRT?);
- So rather than duplicate the code again for the content process's entry point, put it someplace where it's more-or-less covering firefox.exe, libxul, and plugin-container.exe?

Or did it not really matter that we had a call in nsWindowsWMain.cpp once we had the call in nsAppRunner.cpp?  So we could really get away with removing the bits in nsWindowsWMain.cpp and pasting them into the appropriate plugin-container.exe entry point?

I think the patch is reasonable, but I'd appreciate some short explanation that covers the above questions, if you could.
Attachment #8604253 - Flags: review?(nfroyd) → review+
So, the first problem is that there are two copies of the CRT in the process. One is static-linked into firefox.exe (unfortunate fallout from the XPSP2 hack) and the other is a dynamic-linked instance of msvcr120.dll shared by the other binaries. If code in firefox.exe calls |log|, the use of FMA3 will be controlled by a flag located in firefox.exe memory. If code anywhere else calls |log|, FMA3 will be gated by a flag in msvcr120.dll's globals. So to be on the safe side, we need to disable both.

The other problem is that the particular place where I tried to disable msvcr120.dll's copy of the flag, was in a codepath that isn't executed by child processes. The blocklist code is executed on both types of processes, so it's a better place to disable the msvcr120 flag.
https://hg.mozilla.org/mozilla-central/rev/730d9ce9776b
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Finally! The crashes are gone from nightly. In the meantime version 40 has moved to aurora, but there haven't been any builds released yet. Hopefully I can get this uplifted before the first build goes out.

Despite a lack of reports on beta, I'd like to uplift there too, because I don't see why the issue couldn't happen there as well.
[Tracking Requested - why for this release]:
Given this is a startup crash, requesting tracking. Ideally, we should get this into 40 today so it will be in tomorrow when we want to turn on updates on Dev Edition.
Attached patch patch for auroraSplinter Review
Attachment #8605870 - Flags: review+
Comment on attachment 8605870 [details] [diff] [review]
patch for aurora

Approval Request Comment
[Feature/regressing bug #]: Microsoft C library
[User impact if declined]: Startup crashes on win64
[Describe test coverage new/current, TreeHerder]: confirmed by crash-stats
[Risks and why]: Low risk, and win64 only
[String/UUID change made/needed]: none
Attachment #8605870 - Flags: approval-mozilla-aurora?
Setting 39 status to "presumed affected" despite a lack of reports, since the tools key off this flag.
Attached patch patch for betaSplinter Review
Attachment #8605880 - Flags: review+
Comment on attachment 8605880 [details] [diff] [review]
patch for beta

Approval Request Comment
[Feature/regressing bug #]: Microsoft C library
[User impact if declined]: Possible startup crashes on win64
[Describe test coverage new/current, TreeHerder]: confirmed by crash-stats
[Risks and why]: Low risk, and win64 only
[String/UUID change made/needed]: none
Attachment #8605880 - Flags: approval-mozilla-beta?
Comment on attachment 8605870 [details] [diff] [review]
patch for aurora

Verified fix. Good work on this one. Aurora+
Attachment #8605870 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Nomming for 39 as per comment 28 this may affect the 39 release as well.
Target Milestone: mozilla40 → mozilla41
Comment on attachment 8605880 [details] [diff] [review]
patch for beta

Approved for uplift to beta (39). No reported crashes for 39 but based on comment 28, sounds like this needs to be on 39 as well.
Attachment #8605880 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Depends on: 1279769
You need to log in before you can comment on or make changes to this bug.