Crash while updating telemetry in Necko

RESOLVED WORKSFORME

Status

()

Core
Networking
RESOLVED WORKSFORME
3 years ago
3 years ago

People

(Reporter: seth, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

3 years ago
In bug 1128255, we got reports of crashes that look related to updating telemetry in Necko. Here are some crashstats links:

https://crash-stats.mozilla.com/report/index/bp-b137ac89-1966-4259-9990-e82912150402

https://crash-stats.mozilla.com/report/index/5e67fba4-a4ec-4283-8a98-ae67a2150305

More details and STR can be found in bug 1128255.
(Reporter)

Updated

3 years ago
See Also: → bug 1128255
(Reporter)

Comment 1

3 years ago
Stack, posted by Patrick in bug 1128255 comment 19:

Frame 	Module 	Signature 	Source
0 	libxul.so 	base::Histogram::Add(int) 	ipc/chromium/src/base/histogram.cc
1 	libxul.so 	mozilla::Telemetry::Accumulate(mozilla::Telemetry::ID, unsigned int) 	toolkit/components/telemetry/Telemetry.cpp
2 	libxul.so 	mozilla::net::CacheStorageService::TelemetryRecordEntryRemoval(mozilla::net::CacheEntry const*) 	netwerk/cache2/CacheStorageService.cpp
3 	libxul.so 	mozilla::net::CacheStorageService::UnregisterEntry(mozilla::net::CacheEntry*) 	netwerk/cache2/CacheStorageService.cpp
4 	libxul.so 	mozilla::net::CacheEntry::Purge(unsigned int) 	netwerk/cache2/CacheEntry.cpp
5 	libxul.so 	mozilla::net::CacheStorageService::MemoryPool::PurgeByFrecency(bool&, unsigned int) 	netwerk/cache2/CacheStorageService.cpp
6 	libxul.so 	mozilla::net::CacheStorageService::MemoryPool::PurgeOverMemoryLimit() 	netwerk/cache2/CacheStorageService.cpp
7 	libxul.so 	mozilla::net::CacheStorageService::PurgeOverMemoryLimit() 	netwerk/cache2/CacheStorageService.cpp
8 	libxul.so 	nsRunnableMethodImpl<void (mozilla::net::CacheStorageService::*)(), void, true>::Run() 	xpcom/glue/nsThreadUtils.h
9 	libxul.so
(Reporter)

Updated

3 years ago
Flags: needinfo?(michal.novotny)
I don't see anything wrong in the cache code. The telemetry is not protected with the lock but it's called only on one thread. On crash-stats I see a lot of crashes in base::Histogram::Add(int) called from other than cache code, so why is this considered as a bug in cache instead of a bug in telemetry?
Flags: needinfo?(michal.novotny)
(Reporter)

Comment 3

3 years ago
I have no idea where the fault lies. Vladan, does this seem like it's a bug in the Telemetry code?
Flags: needinfo?(vdjeric)
Flags: needinfo?(vdjeric)
From those 0x5a5a5a92 addresses, it looks like the histograms (or their buckets) have been freed. That sounds bad. What could cause that?
could it be a shutdown ordering thing?
(Reporter)

Comment 6

3 years ago
(In reply to Patrick McManus [:mcmanus] from comment #5)
> could it be a shutdown ordering thing?

I don't think so; the STR in bug 1128255 don't involve shutdown.
Since it's on Linux and has STR, I bet rr will have the answer.

Comment 8

3 years ago
dmajor does that mean you can take this to debug with rr?
Flags: needinfo?(dmajor)
Not really. My rr VM was super dusty and got lost to spring cleaning. According to the docs, reverse-execution (which is what we need here) needs real hardware anyway.
Flags: needinfo?(dmajor)
(In reply to David Major [:dmajor] from comment #7)
> Since it's on Linux and has STR, I bet rr will have the answer.

I don't think we actually have STR here. I tried to reproduce based on the descriptions in that bug on my linux machine and I didn't get any crash.
(In reply to Timothy Nikkel (:tn) from comment #10)
> (In reply to David Major [:dmajor] from comment #7)
> > Since it's on Linux and has STR, I bet rr will have the answer.
> 
> I don't think we actually have STR here. I tried to reproduce based on the
> descriptions in that bug on my linux machine and I didn't get any crash.

Same here. Vladan set me up with a VNC linux machine and it didn't crash either.
Ryan, can you help David and Timothy reproduce this crash?
Flags: needinfo?(yixxt)

Comment 13

3 years ago
I will try in the next few days. I am still using aurora 35 since the bug has has affected every release since 36. As I said before in the other thread it takes between 5 to 20 crashes before the bug reporter even pops up.

Comment 14

3 years ago
http://officialfan.proboards.com/thread/519714/divas-pics-thread-bella-edition?page=35

http://officialfan.proboards.com/thread/516101/wwe-pics-gifs-vigilante-thread?page=72

I crash when going to above threads, clicking the page number to move along the pages within the thread and then scrolling to view more pictures. Most of the time I will crash or hard freeze by viewing a page or two or three of those threads. Those pages work fine under Stable and Nightly Firefox on Windows and Firefox 35 under Linux.
Flags: needinfo?(yixxt)
I still haven't seen the crash, but it's possible that my remote session may be interfering with the scrolling. Timothy does it crash for you?
Flags: needinfo?(tnikkel)
Created attachment 8595039 [details]
stack

Hmm, so I was able to get three crashes using dev edition official builds. Two of them just said "Fatal IO error 11 (Resource temporarily unavailable) on X server :0. One of them dumped some hex addresses (I'll attach). I tried for quite some time in my own m-c build with debug+opt under gdb or not under gdb but I never got a single crash. I even tried tagging my build as official and specifically enabling telemetry in my mozconfig.

I'm not sure what to try.
Flags: needinfo?(tnikkel)
Any ideas on how I can use official builds (I'm assuming try build will also reproduce the crash) to track this down (get a stack or something)?
Flags: needinfo?(dmajor)
I made a try build with some printfs at the crashing site from comment 0 but I couldn't get it to crash.
I don't have much experience debugging on Linux so I don't really know what the options are. Can you take the build that does crash, and have gdb auto-attach when it crashes? Or run it under gdb from the start? Does it crash if you run under rr?
Flags: needinfo?(dmajor)
Running under gdb it crashes but it never breaks in gdb. One session ended with this crash:
[NPAPI 15507] ###!!! ABORT: Aborting on channel error.: file /builds/slave/m-aurora-l64-ntly-000000000000/build/src/ipc/glue/MessageChannel.cpp, line 1597
[NPAPI 15507] ###!!! ABORT: Aborting on channel error.: file /builds/slave/m-aurora-l64-ntly-000000000000/build/src/ipc/glue/MessageChannel.cpp, line 1597
[Inferior 1 (process 15418) exited with code 01]

I guess I'll try rr next.
rr does not seem to like official builds of firefox.
Um... that's not good. Talk to roc, I'm sure he'll want to know!
https://github.com/mozilla/rr/wiki/Building-And-Installing says that one needs to --disable-gstreamer in the build, so that is probably why.

Comment 24

3 years ago
There does not seem to be anymore crashing or freezing on those web pages with the few 40.0a2 builds I tested over the past couple weeks.
Thanks for the update, Ryan. I'm going to resolve this but please re-open if you run into the issue again.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.