Closed Bug 1334105 Opened 3 years ago Closed 3 years ago

Crash in wrap_strdup | mozilla::CompareVersions

Categories

(Toolkit :: Telemetry, defect, P1, critical)

52 Branch
All
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox52 --- fixed
firefox-esr52 --- fixed
firefox53 --- fixed
firefox54 --- fixed

People

(Reporter: philipp, Assigned: gfritzsche)

Details

(Keywords: crash, regression, Whiteboard: [measurement:client])

Crash Data

Attachments

(4 files)

This bug was filed from the Socorro interface and is 
report bp-9674279b-b13f-41b6-89b3-d465c2170126.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	mozglue.dll 	wrap_strdup 	memory/build/mozmemory_wrap.c:86
1 	xul.dll 	mozilla::CompareVersions(char const*, char const*) 	xpcom/glue/nsVersionComparator.cpp:349
2 	xul.dll 	mozilla::Telemetry::Common::IsExpiredVersion(char const*) 	toolkit/components/telemetry/TelemetryCommon.cpp:26
3 	xul.dll 	TelemetryEvent::InitializeGlobalState(bool, bool) 	toolkit/components/telemetry/TelemetryEvent.cpp:366
4 	xul.dll 	`anonymous namespace'::TelemetryImpl::CreateTelemetryInstance 	toolkit/components/telemetry/Telemetry.cpp:1922
5 	xul.dll 	`anonymous namespace'::nsITelemetryConstructor 	toolkit/components/telemetry/Telemetry.cpp:2265
6 	xul.dll 	nsComponentManagerImpl::CreateInstanceByContractID(char const*, nsISupports*, nsID const&, void**) 	xpcom/components/nsComponentManager.cpp:1170
7 	xul.dll 	nsComponentManagerImpl::GetServiceByContractID(char const*, nsID const&, void**) 	xpcom/components/nsComponentManager.cpp:1526
8 	xul.dll 	nsTHashtable<nsBaseHashtableET<nsCStringHashKey, nsFactoryEntry*> >::s_HashKey(void const*) 	obj-firefox/dist/include/nsTHashtable.h:376
9 	xul.dll 	nsCOMPtr_base::assign_from_gs_contractid(nsGetServiceByContractID, nsID const&) 	xpcom/glue/nsCOMPtr.cpp:95
10 	xul.dll 	mozilla::Telemetry::Init() 	toolkit/components/telemetry/Telemetry.cpp:2909
11 	xul.dll 	nsLocalFile::GetParent(nsIFile**) 	xpcom/io/nsLocalFileWin.cpp:2857
12 	xul.dll 	mozilla::ClearOnShutdown<mozilla::StaticRefPtr<nsPrintingProxy> >(mozilla::StaticRefPtr<nsPrintingProxy>*, mozilla::ShutdownPhase) 	obj-firefox/dist/include/mozilla/ClearOnShutdown.h:115
13 	xul.dll 	mozilla::dom::ContentChild::Init(MessageLoop*, unsigned long, IPC::Channel*) 	dom/ipc/ContentChild.cpp:605
14 	xul.dll 	mozilla::dom::ContentProcess::Init() 	dom/ipc/ContentProcess.cpp:138
15 	xul.dll 	XRE_InitChildProcess 	toolkit/xre/nsEmbedFunctions.cpp:669

crashes with this signature are increasing in firefox 52 builds and later.
Priority: -- → P1
Whiteboard: [measurement:client]
From the signature summary and skimming through reports:
(1) the crash is limited to content processes
(2) limited to Fx 52
(3) limited to beta channel
(4) limited to crashes under "TelemetryEvent::InitializeGlobalState"
(5) always crashing on `strdup(aStrB)` in `CompareVersions()`, with a null ptr read

(2) makes sense as the first regularly recording event landed in 52 (bug 1316281).
In 53, we made event recording default to disabled (bug 1329139).

(3) is strange, why don't we see any crashes from Nightly & Aurora?
I wonder if this is some odd build?

For (4), comparing this to the scalar and histogram implementations, the event initialization is the only one checking expiry on initialization.
For scalars & histograms we do this lazily, also using `Telemetry::Common::IsExpiredVersion()`:
https://dxr.mozilla.org/mozilla-central/search?q=path%3ATelemetryHistogram.cpp+IsExpiredVersion
https://dxr.mozilla.org/mozilla-central/search?q=path%3ATelemetryScalar.cpp+IsExpiredVersion
... so they could also be affected, but hidden by this crash.

(5) is strange, coming from:
* current_version = mozilla::Version(MOZ_APP_VERSION)
* (mozilla::Version(aExpiration) <= current_version)
* this calls: CompareVersions(versionContent, aRhs.ReadContent())
* CompareVersions(const char* aStrA, const char* aStrB)
* char* B2 = strdup(aStrB);
(see e.g. bp-9674279b-b13f-41b6-89b3-d465c2170126)
I can't see how we can get to a 0 pointer there - strdup(MOZ_APP_VERSION) seems like it should always be valid.
The other expiration parameter is asserted to be non-0.
All crashes do have the same buildid, which is legit - it's this build:
http://ftp.mozilla.org/pub/firefox/candidates/52.0b1-candidates/build2/win32_info.txt
Assignee: nobody → gfritzsche
Another theory would be OOM or similar, i'm not sure how that shows up in crash reports now.

Benjamin, do you know who can help with odd crashers like this these days?
Flags: needinfo?(benjamin)
it's also a crash at startup (<60s uptime) in all the reported cases last week.
That part is not surprising, this is initialization code for the Telemetry module running at/around startup.
Just checking the stack, it looks like "B" string passed to CompareVersions is null, and by debugging the minidump I've verified that 

  static mozilla::Version current_version = mozilla::Version(MOZ_APP_VERSION);

is null.

It's probably not a good idea to have a function-scope static variable like that. It doesn't add a static destructor, but it does add a threadsafe dynamic constructor check to this codepath.

Here is the disassembly for initializing current_version:

57832F40  push        offset string "52.0" (59332560h)  
57832F45  mov         dword ptr ds:[5998C5FCh],eax  
57832F4A  call        dword ptr [__imp__strdup (5921E39Ch)]  
57832F50  mov         dword ptr [current_version (599A9B30h)],eax  
57832F55  mov         dword ptr [esp],offset `mozilla::Telemetry::Common::IsExpiredVersion'::`2'::`dynamic atexit destructor for 'current_version'' (57B75E21h)  
57832F5C  call        atexit (57B0310Bh)  
57832F61  pop         ecx  
57832F62  mov         ecx,dword ptr [esp+8]  
57832F66  jmp         mozilla::Telemetry::Common::IsExpiredVersion+1Bh (57832EBEh)

I don't think strdup is infallible in our codebase, although it's obviously rather unlikely that it would fail during process startup like this. I don't know if glandium has thoughts about what might be happening with the CRT and allocators.
Flags: needinfo?(benjamin) → needinfo?(mh+mozilla)
Bonus points if a fix here can also address bug 959745.
There shouldn't be a problem with that strdup there. It would be worth double checking what dumpbin -imports says about __imp_strdup. It should come from mozglue.dll. Either way, it should not have returned NULL. But maybe the problem is that it wasn't called at all? what does the dynamic constructor check look like?
Flags: needinfo?(mh+mozilla)
(In reply to Mike Hommey [:glandium] from comment #8)
> There shouldn't be a problem with that strdup there. It would be worth
> double checking what dumpbin -imports says about __imp_strdup. It should
> come from mozglue.dll.

It does (per attachment 8833920 [details], thanks Alessio).

> Either way, it should not have returned NULL. But
> maybe the problem is that it wasn't called at all? what does the dynamic
> constructor check look like?

Benjamin, can you answer that one? Thanks!
Flags: needinfo?(benjamin)
I can't spend more time on this now, but you should be able to answer this by disassembling this function xul.dll for the affected build(s).
Flags: needinfo?(benjamin)
Thanks, here it is:
https://pastebin.mozilla.org/8974145
Glandium, does that tell you something?
Flags: needinfo?(mh+mozilla)
Copy/pasting only the relevant instruction that lead to the static variable initialization:
10412ea4 a1fcc55612      mov     eax,dword ptr [xul!`nsTHashtable<nsBaseHashtableET<nsStringHashKey,nsAutoPtr<`anonymous namespace'::HangReports::AnnotationInfo> > >::Ops'::`2'::`local static guard'+0x4 (1256c5fc)]
10412eaa 33db            xor     ebx,ebx
10412eaf 43              inc     ebx
10412eb6 84c3            test    bl,al
10412eb8 0f8480000000    je      xul!mozilla::Telemetry::Common::IsExpiredVersion+0x9b (10412f3e)

So, it loads the guard value at 0x1256c5fc into eax, sets ebx to 1, and compares the lower bits of both registers, and goes to the initialization code depending on the result of the comparison.

The x86 test instruction is kind of weird. It sets flags depending on the result of a bitwise AND of the two given values. Assuming the uninitialized guard is 0, 0 & 1 is 0, which sets the ZF flag to 1. The je instruction jumps when the ZF flags is 1, so the jump happens if the guard is 0. If the guard is 1, then 1 & 1 is 1, which sets ZF to 0, and the jump doesn't happen.

Then, in the initialization code:
10412f3e 0bc3            or      eax,ebx
10412f45 a3fcc55612      mov     dword ptr [xul!`nsTHashtable<nsBaseHashtableET<nsStringHashKey,nsAutoPtr<`anonymous namespace'::HangReports::AnnotationInfo> > >::Ops'::`2'::`local static guard'+0x4 (1256c5fc)],eax

Since ebx is always 1, essentially, the or just sets eax to 1. Then eax is put into the guard value.

All in all, the guard looks normal. Now the question is whether the guard is 0 or 1 when this all happens, and we likely don't have that info in the minidump, although, maybe we do, I don't know what minidumps include on windows. Surely the contents of the (modified) data segments of libraries would be interesting to have in minidump.
Flags: needinfo?(mh+mozilla)
(In reply to Mike Hommey [:glandium] from comment #15)
> All in all, the guard looks normal. Now the question is whether the guard is
> 0 or 1 when this all happens, and we likely don't have that info in the
> minidump, although, maybe we do, I don't know what minidumps include on
> windows. Surely the contents of the (modified) data segments of libraries
> would be interesting to have in minidump.

Per Benjamin the minidump can't tell us that as its heap memory.
Do you have an idea on next steps we can take here?
Flags: needinfo?(mh+mozilla)
Speculative fix idea: Version has an operator<=(const char* aRhs), so how about just comparing (mozilla::Version(aExpiration) <= MOZ_APP_VERSION)?
Comment 17 sounds good.
Flags: needinfo?(mh+mozilla)
Thanks for the help. I would love to know what is actually going on here, but i will push this fix up now. Chris, can you take a look?
Attachment #8834875 - Flags: review?(chutten)
Comment on attachment 8834875 [details] [diff] [review]
Work around startup crashes in Version(MOZ_APP_VERSION) checks

Review of attachment 8834875 [details] [diff] [review]:
-----------------------------------------------------------------

Seems legit.
Attachment #8834875 - Flags: review?(chutten) → review+
Pushed by georg.fritzsche@googlemail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/66e0fa002986
Work around startup crashes in Version(MOZ_APP_VERSION) checks. r=chutten
https://hg.mozilla.org/mozilla-central/rev/66e0fa002986
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Going off crash-stats, I assume we're not going to get any real sense of whether this patch helps or not unless we uplift to Beta, so please nominate it for Aurora/Beta approval when you get a chance.
Flags: needinfo?(gfritzsche)
Yes, that is the plan.
Flags: needinfo?(gfritzsche)
Comment on attachment 8834875 [details] [diff] [review]
Work around startup crashes in Version(MOZ_APP_VERSION) checks

Approval Request Comment
[Feature/Bug causing the regression]: Event Telemetry
[User impact if declined]: Startup crashes probably fixed. We will need to uplift to Beta to confirm though.
[Is this code covered by automated tests?]: Yes.
[Has the fix been verified in Nightly?]: Yes.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No.
[Why is the change risky/not risky?]: This is a trivial change that is presumed to work around the startup crash here.
[String changes made/needed]: None.
Attachment #8834875 - Flags: approval-mozilla-beta?
Attachment #8834875 - Flags: approval-mozilla-aurora?
Comment on attachment 8834875 [details] [diff] [review]
Work around startup crashes in Version(MOZ_APP_VERSION) checks

Fix a crash. Aurora53+.
Attachment #8834875 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
We will probably only see confirmation of a fix on 52 beta here.
Comment on attachment 8834875 [details] [diff] [review]
Work around startup crashes in Version(MOZ_APP_VERSION) checks

startup crash fix for beta52
Attachment #8834875 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Setting qe-verify- per comment 26.
Flags: qe-verify-
You need to log in before you can comment on or make changes to this bug.