1151548 - Crash while updating telemetry in Necko

Reporter

Description

•

9 years ago

In bug 1128255, we got reports of crashes that look related to updating telemetry in Necko. Here are some crashstats links:

https://crash-stats.mozilla.com/report/index/bp-b137ac89-1966-4259-9990-e82912150402

https://crash-stats.mozilla.com/report/index/5e67fba4-a4ec-4283-8a98-ae67a2150305

More details and STR can be found in bug 1128255.

Seth Fowler [:seth] [:s2h]

Reporter

Updated

•

9 years ago

Comment 1

•

9 years ago

Stack, posted by Patrick in bug 1128255 comment 19:

Frame 	Module 	Signature 	Source
0 	libxul.so 	base::Histogram::Add(int) 	ipc/chromium/src/base/histogram.cc
1 	libxul.so 	mozilla::Telemetry::Accumulate(mozilla::Telemetry::ID, unsigned int) 	toolkit/components/telemetry/Telemetry.cpp
2 	libxul.so 	mozilla::net::CacheStorageService::TelemetryRecordEntryRemoval(mozilla::net::CacheEntry const*) 	netwerk/cache2/CacheStorageService.cpp
3 	libxul.so 	mozilla::net::CacheStorageService::UnregisterEntry(mozilla::net::CacheEntry*) 	netwerk/cache2/CacheStorageService.cpp
4 	libxul.so 	mozilla::net::CacheEntry::Purge(unsigned int) 	netwerk/cache2/CacheEntry.cpp
5 	libxul.so 	mozilla::net::CacheStorageService::MemoryPool::PurgeByFrecency(bool&, unsigned int) 	netwerk/cache2/CacheStorageService.cpp
6 	libxul.so 	mozilla::net::CacheStorageService::MemoryPool::PurgeOverMemoryLimit() 	netwerk/cache2/CacheStorageService.cpp
7 	libxul.so 	mozilla::net::CacheStorageService::PurgeOverMemoryLimit() 	netwerk/cache2/CacheStorageService.cpp
8 	libxul.so 	nsRunnableMethodImpl<void (mozilla::net::CacheStorageService::*)(), void, true>::Run() 	xpcom/glue/nsThreadUtils.h
9 	libxul.so

Seth Fowler [:seth] [:s2h]

Reporter

Updated

•

9 years ago

Flags: needinfo?(michal.novotny)

Michal Novotny [:michal]

Comment 2

•

9 years ago

I don't see anything wrong in the cache code. The telemetry is not protected with the lock but it's called only on one thread. On crash-stats I see a lot of crashes in base::Histogram::Add(int) called from other than cache code, so why is this considered as a bug in cache instead of a bug in telemetry?

Flags: needinfo?(michal.novotny)

Seth Fowler [:seth] [:s2h]

Reporter

Comment 3

•

9 years ago

I have no idea where the fault lies. Vladan, does this seem like it's a bug in the Telemetry code?

Flags: needinfo?(vdjeric)

Vladan Djeric (:vladan)

Updated

•

9 years ago

Flags: needinfo?(vdjeric)

(Away)

Comment 4

•

9 years ago

From those 0x5a5a5a92 addresses, it looks like the histograms (or their buckets) have been freed. That sounds bad. What could cause that?

Patrick McManus [:mcmanus]

Comment 5

•

9 years ago

could it be a shutdown ordering thing?

Seth Fowler [:seth] [:s2h]

Reporter

Comment 6

•

9 years ago

(In reply to Patrick McManus [:mcmanus] from comment #5)
> could it be a shutdown ordering thing?

I don't think so; the STR in bug 1128255 don't involve shutdown.

(Away)

Comment 7

•

9 years ago

Since it's on Linux and has STR, I bet rr will have the answer.

Benjamin Smedberg

Comment 8

•

9 years ago

dmajor does that mean you can take this to debug with rr?

Flags: needinfo?(dmajor)

(Away)

Comment 9

•

9 years ago

Not really. My rr VM was super dusty and got lost to spring cleaning. According to the docs, reverse-execution (which is what we need here) needs real hardware anyway.

Flags: needinfo?(dmajor)

Timothy Nikkel (:tnikkel)

Comment 10

•

9 years ago

(In reply to David Major [:dmajor] from comment #7)
> Since it's on Linux and has STR, I bet rr will have the answer.

I don't think we actually have STR here. I tried to reproduce based on the descriptions in that bug on my linux machine and I didn't get any crash.

(Away)

Comment 11

•

9 years ago

(In reply to Timothy Nikkel (:tn) from comment #10)
> (In reply to David Major [:dmajor] from comment #7)
> > Since it's on Linux and has STR, I bet rr will have the answer.
> 
> I don't think we actually have STR here. I tried to reproduce based on the
> descriptions in that bug on my linux machine and I didn't get any crash.

Same here. Vladan set me up with a VNC linux machine and it didn't crash either.

Vladan Djeric (:vladan)

Comment 12

•

9 years ago

Ryan, can you help David and Timothy reproduce this crash?

Flags: needinfo?(yixxt)

Ryan

Comment 13

•

9 years ago

I will try in the next few days. I am still using aurora 35 since the bug has has affected every release since 36. As I said before in the other thread it takes between 5 to 20 crashes before the bug reporter even pops up.

Ryan

Comment 14

•

9 years ago

http://officialfan.proboards.com/thread/519714/divas-pics-thread-bella-edition?page=35

http://officialfan.proboards.com/thread/516101/wwe-pics-gifs-vigilante-thread?page=72

I crash when going to above threads, clicking the page number to move along the pages within the thread and then scrolling to view more pictures. Most of the time I will crash or hard freeze by viewing a page or two or three of those threads. Those pages work fine under Stable and Nightly Firefox on Windows and Firefox 35 under Linux.

Flags: needinfo?(yixxt)

(Away)

Comment 15

•

9 years ago

I still haven't seen the crash, but it's possible that my remote session may be interfering with the scrolling. Timothy does it crash for you?

Flags: needinfo?(tnikkel)

Timothy Nikkel (:tnikkel)

Comment 16

•

9 years ago

Attached file stack — Details

Hmm, so I was able to get three crashes using dev edition official builds. Two of them just said "Fatal IO error 11 (Resource temporarily unavailable) on X server :0. One of them dumped some hex addresses (I'll attach). I tried for quite some time in my own m-c build with debug+opt under gdb or not under gdb but I never got a single crash. I even tried tagging my build as official and specifically enabling telemetry in my mozconfig.

I'm not sure what to try.

Flags: needinfo?(tnikkel)

Timothy Nikkel (:tnikkel)

Comment 17

•

9 years ago

Any ideas on how I can use official builds (I'm assuming try build will also reproduce the crash) to track this down (get a stack or something)?

Flags: needinfo?(dmajor)

Timothy Nikkel (:tnikkel)

Comment 18

•

9 years ago

I made a try build with some printfs at the crashing site from comment 0 but I couldn't get it to crash.

(Away)

Comment 19

•

9 years ago

I don't have much experience debugging on Linux so I don't really know what the options are. Can you take the build that does crash, and have gdb auto-attach when it crashes? Or run it under gdb from the start? Does it crash if you run under rr?

Flags: needinfo?(dmajor)

Timothy Nikkel (:tnikkel)

Comment 20

•

9 years ago

Running under gdb it crashes but it never breaks in gdb. One session ended with this crash:
[NPAPI 15507] ###!!! ABORT: Aborting on channel error.: file /builds/slave/m-aurora-l64-ntly-000000000000/build/src/ipc/glue/MessageChannel.cpp, line 1597
[NPAPI 15507] ###!!! ABORT: Aborting on channel error.: file /builds/slave/m-aurora-l64-ntly-000000000000/build/src/ipc/glue/MessageChannel.cpp, line 1597
[Inferior 1 (process 15418) exited with code 01]

I guess I'll try rr next.

Timothy Nikkel (:tnikkel)

Comment 21

•

9 years ago

rr does not seem to like official builds of firefox.

(Away)

Comment 22

•

9 years ago

Um... that's not good. Talk to roc, I'm sure he'll want to know!

Timothy Nikkel (:tnikkel)

Comment 23

•

9 years ago

https://github.com/mozilla/rr/wiki/Building-And-Installing says that one needs to --disable-gstreamer in the build, so that is probably why.

Ryan

Comment 24

•

9 years ago

There does not seem to be anymore crashing or freezing on those web pages with the few 40.0a2 builds I tested over the past couple weeks.

(Away)

Comment 25

•

9 years ago

Thanks for the update, Ryan. I'm going to resolve this but please re-open if you run into the issue again.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → WORKSFORME