Closed Bug 798366 Opened 7 years ago Closed 7 years ago

Intermittent crash in test_obj_group.xul, test_pluginstream_newstream.html, test_html5_tree_construction_part2.html, browser_aboutHome.js [@ libpthread-2.11.so@0xe6ad | unixSync | xSync]

Categories

(Toolkit :: Telemetry, defect, critical)

defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: emorley, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: crash, intermittent-failure)

Crash Data

Attachments

(2 files, 1 obsolete file)

Rev3 Fedora 12x64 mozilla-inbound debug test mochitest-other on 2012-10-04 02:10:07 PDT for push 739f8aac68a6

slave: talos-r3-fed64-044

https://tbpl.mozilla.org/php/getParsedLog.php?id=15807160&tree=Mozilla-Inbound

{
1312 INFO TEST-START | chrome://mochitests/content/a11y/accessible/attributes/test_obj_group.xul
++DOMWINDOW == 28 (0x18c81d0) [serial = 41] [outer = 0x3832210]
1313 INFO TEST-INFO | chrome://mochitests/content/a11y/accessible/attributes/test_obj_group.xul | must wait for load
TEST-UNEXPECTED-FAIL | chrome://mochitests/content/a11y/accessible/attributes/test_obj_group.xul | application timed out after 330 seconds with no output
args: ['/home/cltbld/talos-slave/test/build/bin/screentopng']
INFO | automation.py | Application ran for: 0:13:12.861257
INFO | automation.py | Reading PID log: /tmp/tmpUDj2Nspidlog
Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64-debug/1349339357/firefox-18.0a1.en-US.linux-x86_64.crashreporter-symbols.zip
PROCESS-CRASH | chrome://mochitests/content/a11y/accessible/attributes/test_obj_group.xul | application crashed (minidump found)
Crash dump filename: /tmp/tmp0QaMNZ/minidumps/501cd257-f3c8-a21a-1f6214b2-6b38d7f7.dmp
Operating system: Linux
                  0.0.0 Linux 2.6.31.5-127.fc12.x86_64 #1 SMP Sat Nov 7 21:11:14 EST 2009 x86_64
CPU: amd64
     family 0 model 0 stepping 0
     1 CPU

Crash reason:  SIGABRT
Crash address: 0x1f400000cbb

Thread 0 (crashed)
 0  libpthread-2.11.so + 0xe6ad
    rbx = 0x0000000002b589b0   r12 = 0x00000000000000e8
    r13 = 0x0000000000000000   r14 = 0x0000000000000002
    r15 = 0x0000000000000000   rip = 0x00000034d360e6ad
    rsp = 0x00007fffdb279570   rbp = 0x00007fffdb2795b0
    Found by: given as instruction pointer in context
 1  libmozsqlite3.so!unixSync [sqlite3.c : 26073 + 0x7]
    rip = 0x00007f328a91459a   rsp = 0x00007fffdb279580
    rbp = 0x00007fffdb2795b0
    Found by: stack scanning
 2  libxul.so!xSync [TelemetryVFS.cpp : 158 + 0x5]
    rbx = 0x0000000002b589a0   r12 = 0x00000000000000e8
    r13 = 0x0000000000000000   rip = 0x00007f328813ef37
    rsp = 0x00007fffdb2795c0   rbp = 0x00007fffdb2795f0
    Found by: call frame info
 3  libmozsqlite3.so!sqlite3PagerCommitPhaseOne [sqlite3.c : 42892 + 0x7]
    rbx = 0x0000000002b58848   r12 = 0x0000000003c4ff18
    r13 = 0x0000000000000000   rip = 0x00007f328a93e77b
    rsp = 0x00007fffdb279600   rbp = 0x00007fffdb279660
    Found by: call frame info
 4  libmozsqlite3.so!sqlite3BtreeCommitPhaseOne [sqlite3.c : 51065 + 0xd]
    rbx = 0x00000000021f6908   r12 = 0x0000000003c4ff18
    r13 = 0x0000000000000000   r14 = 0x0000000004816f48
    r15 = 0x0000000000000000   rip = 0x00007f328a93ea02
    rsp = 0x00007fffdb279670   rbp = 0x00007fffdb2796c0
    Found by: call frame info
 5  libmozsqlite3.so!sqlite3VdbeHalt [sqlite3.c : 59920 + 0xd]
    rbx = 0x00000000019ecf58   r12 = 0x0000000003c4ff18
    r13 = 0x0000000000000000   r14 = 0x0000000000000000
    r15 = 0x0000000000000000   rip = 0x00007f328a93f618
    rsp = 0x00007fffdb2796d0   rbp = 0x00007fffdb279780
    Found by: call frame info
 6  libmozsqlite3.so!sqlite3VdbeExec [sqlite3.c : 64314 + 0x7]
    rbx = 0x0000000003d9c188   r12 = 0x0000000000000000
    r13 = 0x0000000003c4ff18   r14 = 0x00000000019ecf58
    r15 = 0x0000000003ae53e0   rip = 0x00007f328a960944
    rsp = 0x00007fffdb279790   rbp = 0x00007fffdb279c30
    Found by: call frame info
 7  libmozsqlite3.so!sqlite3_step [sqlite3.c : 61763 + 0x7]
}

Followed later in mochitest-ipcplugins by:

{
805 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/plugins/test/test_pluginstream_newstream.html | Test timed out.
TEST-UNEXPECTED-FAIL | /tests/dom/plugins/test/test_pluginstream_newstream.html | application timed out after 330 seconds with no output
INFO | automation.py | Application ran for: 0:15:57.428228
INFO | automation.py | Reading PID log: /tmp/tmpc0VS3lpidlog
Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64-debug/1349339357/firefox-18.0a1.en-US.linux-x86_64.crashreporter-symbols.zip
PROCESS-CRASH | /tests/dom/plugins/test/test_pluginstream_newstream.html | application crashed (minidump found)
Crash dump filename: /tmp/tmpIF3fip/minidumps/7cdf0735-50e9-fd24-50d516fb-5048778f.dmp
Operating system: Linux
                  0.0.0 Linux 2.6.31.5-127.fc12.x86_64 #1 SMP Sat Nov 7 21:11:14 EST 2009 x86_64
CPU: amd64
     family 0 model 0 stepping 0
     1 CPU

Crash reason:  SIGABRT
Crash address: 0x1f400000fa7

Thread 0 (crashed)
 0  libpthread-2.11.so + 0xe6ad
    rbx = 0x00000000064c8110   r12 = 0x00000000000000e8
    r13 = 0x0000000000000000   r14 = 0x0000000000000002
    r15 = 0x0000000000000000   rip = 0x00000034d360e6ad
    rsp = 0x00007fff5e91c3c0   rbp = 0x00007fff5e91c400
    Found by: given as instruction pointer in context
 1  libmozsqlite3.so!unixSync [sqlite3.c : 26073 + 0x7]
    rip = 0x00007fab2a23159a   rsp = 0x00007fff5e91c3d0
    rbp = 0x00007fff5e91c400
    Found by: stack scanning
 2  libxul.so!xSync [TelemetryVFS.cpp : 158 + 0x5]
    rbx = 0x00000000064c8100   r12 = 0x00000000000000e8
    r13 = 0x0000000000000000   rip = 0x00007fab27a5bf37
    rsp = 0x00007fff5e91c410   rbp = 0x00007fff5e91c440
    Found by: call frame info
 3  libmozsqlite3.so!sqlite3PagerCommitPhaseOne [sqlite3.c : 42892 + 0x7]
    rbx = 0x00000000064c7fa8   r12 = 0x0000000005b39708
    r13 = 0x0000000000000000   rip = 0x00007fab2a25b77b
    rsp = 0x00007fff5e91c450   rbp = 0x00007fff5e91c4b0
    Found by: call frame info
 4  libmozsqlite3.so!sqlite3BtreeCommitPhaseOne [sqlite3.c : 51065 + 0xd]
    rbx = 0x000000000415fcf8   r12 = 0x0000000005b39708
    r13 = 0x0000000000000000   r14 = 0x00000000055eb0b8
    r15 = 0x0000000000000000   rip = 0x00007fab2a25ba02
    rsp = 0x00007fff5e91c4c0   rbp = 0x00007fff5e91c510
    Found by: call frame info
 5  libmozsqlite3.so!sqlite3VdbeHalt [sqlite3.c : 59920 + 0xd]
    rbx = 0x00000000021f0e38   r12 = 0x0000000005b39708
    r13 = 0x0000000000000000   r14 = 0x0000000000000000
    r15 = 0x0000000000000000   rip = 0x00007fab2a25c618
    rsp = 0x00007fff5e91c520   rbp = 0x00007fff5e91c5d0
    Found by: call frame info
 6  libmozsqlite3.so!sqlite3VdbeExec [sqlite3.c : 64314 + 0x7]
    rbx = 0x0000000005db9328   r12 = 0x0000000000000000
    r13 = 0x0000000005b39708   r14 = 0x00000000021f0e38
    r15 = 0x0000000004d413a0   rip = 0x00007fab2a27d944
    rsp = 0x00007fff5e91c5e0   rbp = 0x00007fff5e91ca80
    Found by: call frame info
 7  libmozsqlite3.so!sqlite3_step [sqlite3.c : 61763 + 0x7]
}
Attachment #671558 - Attachment description: Bug 798366 - part 1: delete TelemetryVFS; r=mak → part 1: delete TelemetryVFS
Attachment #671558 - Flags: review?(mak77)
Attachment #671560 - Attachment description: Bug 798366 - part 2: delete MOZ_SQLITE histogram definitions; r=taras → part 2: delete MOZ_SQLITE histogram definitions
Attachment #671560 - Flags: review?(taras.mozilla)
Attachment #671560 - Flags: review?(taras.mozilla) → review+
Comment on attachment 671558 [details] [diff] [review]
part 1: delete TelemetryVFS

my only concern is about the weird nfs bits that were added at some point. Can you reach out to whoever made those to let them know they'll have to find another avenue for their hack?
(In reply to Taras Glek (:taras) from comment #3)
> Comment on attachment 671558 [details] [diff] [review]
> part 1: delete TelemetryVFS
> 
> my only concern is about the weird nfs bits that were added at some point.
> Can you reach out to whoever made those to let them know they'll have to
> find another avenue for their hack?

Oooh, good point.  I can move those bits over into the main storage code...I think.  CC'ing Martin to see if he has any better ideas, since he added this for bug 433129.
Well, if you remove the telemetryVFS I'll rewrite the nfs patch in Bug 433129.
May I have some more information about what happens and why we are doing this, here I just see a stack and a removal patch.
Do we think the vfs is causing too many issues that is simpler to kill it than investigate them? Or is this part of a wider plan?
Comment on attachment 671558 [details] [diff] [review]
part 1: delete TelemetryVFS

Review of attachment 671558 [details] [diff] [review]:
-----------------------------------------------------------------

So the nfs thing is in Aurora, right? So this patch is basically a regression.
Imo either we backout it from Aurora too, avoiding the regression, or we wait for a new version of that patch that doesn't use the vfs.
Apart that, a couple comments to be fixed but the patch looks ok.

::: storage/src/mozStorageService.cpp
@@ +437,2 @@
>    if (rc != SQLITE_OK)
>      NS_WARNING("sqlite3 did not shutdown cleanly.");

While here could you please fix this (and next ones in this destructor) to use NS_WARN_IF_FALSE? And define rc as DebugOnly<int> rc; or it would warn on compile.

@@ +575,5 @@
>    rc = ::sqlite3_initialize();
>    if (rc != SQLITE_OK)
>      return convertResultCode(rc);
>  
> +  rc = ::sqlite3_quota_initialize(nullptr, 0);

nullptr is C++, but sqlite APIs are C, shouldn't we use NULL then?
Attachment #671558 - Flags: review?(mak77) → review+
Well, the r+ is clearly dependent on answering comment 6 and figuring out a way of proceeding for the nfs issue. Just a code-wise review.
I don't know that the VFS is causing a lot of issues; we've have one intermittent orange that's seemingly attributable to the VFS, but the VFS has been in tree for quite some time.

 (In reply to Marco Bonardo [:mak] from comment #7)
> So the nfs thing is in Aurora, right? So this patch is basically a
> regression.
> Imo either we backout it from Aurora too, avoiding the regression, or we
> wait for a new version of that patch that doesn't use the vfs.
> Apart that, a couple comments to be fixed but the patch looks ok.

I think just this part:

-  rc = ::sqlite3_quota_initialize("telemetry-vfs", 0);
+  rc = ::sqlite3_quota_initialize(nullptr, 0);
   if (rc != SQLITE_OK)

would have to fixed to use the appropriate VFS name and preference that TelemetryVFS currently uses.

> ::: storage/src/mozStorageService.cpp
> @@ +437,2 @@
> >    if (rc != SQLITE_OK)
> >      NS_WARNING("sqlite3 did not shutdown cleanly.");
> 
> While here could you please fix this (and next ones in this destructor) to
> use NS_WARN_IF_FALSE? And define rc as DebugOnly<int> rc; or it would warn
> on compile.

Sure.

> @@ +575,5 @@
> >    rc = ::sqlite3_initialize();
> >    if (rc != SQLITE_OK)
> >      return convertResultCode(rc);
> >  
> > +  rc = ::sqlite3_quota_initialize(nullptr, 0);
> 
> nullptr is C++, but sqlite APIs are C, shouldn't we use NULL then?

I don't think so.
Rev3 Fedora 12x64 mozilla-central opt test mochitests-5/5 on 2012-10-19 05:15:06 PDT for push 987d2f22f83f

slave: talos-r3-fed64-044

https://tbpl.mozilla.org/php/getParsedLog.php?id=16264522&tree=Firefox
Summary: Intermittent crash in test_obj_group.xul, test_pluginstream_newstream.html [@ libpthread-2.11.so@0xe6ad | unixSync | xSync] → Intermittent crash in test_obj_group.xul, test_pluginstream_newstream.html, test_html5_tree_construction_part2.html [@ libpthread-2.11.so@0xe6ad | unixSync | xSync]
(In reply to Nathan Froyd (:froydnj) from comment #9)
> > nullptr is C++, but sqlite APIs are C, shouldn't we use NULL then?
> 
> I don't think so.

Well at least in Storage code we usually use NULL when dealing with SQLite APIs, it's possible nullptr is safe (likely the compiler takes care of that), but per convention I'd keep using NULL.
Whiteboard: [orange]
FWIW, I was unable to find a single instance of this crash signature in the last 4 weeks on crash-stats.
https://tbpl.mozilla.org/php/getParsedLog.php?id=18742675&tree=Firefox
Summary: Intermittent crash in test_obj_group.xul, test_pluginstream_newstream.html, test_html5_tree_construction_part2.html [@ libpthread-2.11.so@0xe6ad | unixSync | xSync] → Intermittent crash in test_obj_group.xul, test_pluginstream_newstream.html, test_html5_tree_construction_part2.html, browser_aboutHome.js [@ libpthread-2.11.so@0xe6ad | unixSync | xSync]
Updated patch that applies, addresses the NFS VFS issue (I think), and
fixes the potential non-main thread accesses to prefs.
Attachment #671558 - Attachment is obsolete: true
Attachment #707666 - Flags: review?(mak77)
Blocks: 832677
Comment on attachment 707666 [details] [diff] [review]
part 1 - delete TelemetryVFS

Review of attachment 707666 [details] [diff] [review]:
-----------------------------------------------------------------

::: storage/src/mozStorageService.cpp
@@ +43,5 @@
> + * This preference is a workaround to allow users/sysadmins to identify
> + * that the profile exists on an NFS share whose implementation
> + * is incompatible with SQLite's default locking implementation.
> + * Bug 433129 attempted to automatically identify such file-systems, 
> + * but a reliable way was not found and it was determined that the fallback 

some trailing spaces on these lines

@@ +55,5 @@
> +#else
> +#define STANDARD_VFS "unix"
> +#define NFS_VFS "unix-excl"
> +#endif
> +  

trailing spaces
Attachment #707666 - Flags: review?(mak77) → review+
...and backed out due to mochitest-2 orange:

https://tbpl.mozilla.org/php/getParsedLog.php?id=19252208&tree=Mozilla-Inbound

http://hg.mozilla.org/integration/mozilla-inbound/rev/46c47be9d44e

I thought the quota handling for IndexedDB happened someplace about sqlite nowadays, but apparently not...
(In reply to Nathan Froyd (:froydnj) from comment #17)
> ...and backed out due to mochitest-2 orange:
> 
> https://tbpl.mozilla.org/php/getParsedLog.php?id=19252208&tree=Mozilla-
> Inbound
> 
> http://hg.mozilla.org/integration/mozilla-inbound/rev/46c47be9d44e
> 
> I thought the quota handling for IndexedDB happened someplace about sqlite
> nowadays, but apparently not...

Apparently we ditched sqlite's quota handling at some point and baked the quota handling into the Telemetry VFS.  Thus, ditching the Telemetry VFS also ditched the quota handling.  I love (non-)composable systems...
So, it's easy enough to solve this by ripping the Telemetry bits out of TelemetryVFS and moving the pref reading into ServiceMainThreadInitializer.  However, I think we have a race:

1. Non-main thread asks for the service.
2. Service initializes, enqueues ServiceMainThreadInitializer.
3. Non-main thread now has a handle on the service, but things are not fully initialized.
4. A database connection is established before ServiceMainThreadInitializer runs.

What happens now depends on where we construct the VFS that we use to supply quota handling:

a) We construct the VFS as soon as the service initializes and read the (cached) value of the pref every time we open a file.  In this scenario, any databases opened before ServiceMainThreadInitializer runs may have NFS problems, but quota handling will be OK.

b) We construct the VFS inside of ServiceMainThreadInitializer.  In this scenario, both quota handling and NFS issues crop up for databases opened before initialization completes.

To be fair, we have approximately the same sort of issues now, except that we're dealing with prefs read off the main thread, which may or may not work...

Mak, WDYT?  Is it worth trying to address this, or should we just go with solution (a)?  I almost think that dispatching ServiceMainThreadInitializer to the main thread DISPATCH_SYNC ought to work, though we'd have to tweak Service::getSingleton to make sure that we're not returning a partially-initialized service on some *other* thread while we're waiting for ServiceMainThreadInitializer to complete...
Flags: needinfo?(mak77)
(In reply to Nathan Froyd (:froydnj) from comment #19)
> So, it's easy enough to solve this by ripping the Telemetry bits out of
> TelemetryVFS and moving the pref reading into ServiceMainThreadInitializer. 
> However, I think we have a race:

I actually noticed this some days ago, and looks like it always "worked" like this, didn't have the time to investigate that though.

> What happens now depends on where we construct the VFS that we use to supply
> quota handling:

Afaik, we don't use anymore a vfs for quota handling but that is managed directly by the quota mananger that is then used by indexedDB and DOMStorage.

> a) We construct the VFS as soon as the service initializes and read the
> (cached) value of the pref every time we open a file.  In this scenario, any
> databases opened before ServiceMainThreadInitializer runs may have NFS
> problems, but quota handling will be OK.

That is unfortunately problematic, the nfs case is disruptive and just guessing the first database may not be that important looks like a poor choice.

> Mak, WDYT?  Is it worth trying to address this

Yes I think we should try to address this, instead of workaround it.
Can we hack up something with mutexes so the async thread waits for mainthread? I don't think it's so bad if another thread waits the mainthread for a little while.
Flags: needinfo?(mak77)
And honestly I actually didn't verify if by chance we already mutex properly here...
(In reply to Marco Bonardo [:mak] from comment #20)
> > What happens now depends on where we construct the VFS that we use to supply
> > quota handling:
> 
> Afaik, we don't use anymore a vfs for quota handling but that is managed
> directly by the quota mananger that is then used by indexedDB and DOMStorage.

That's what I thought, too!  But then all those quota-related tests wouldn't have failed when I landed my stuff yesterday.  And then I discovered that the quota manager is actually called by the TelemetryVFS stuff:

http://mxr.mozilla.org/mozilla-central/source/storage/src/TelemetryVFS.cpp#94
http://mxr.mozilla.org/mozilla-central/source/storage/src/TelemetryVFS.cpp#140
http://mxr.mozilla.org/mozilla-central/source/storage/src/TelemetryVFS.cpp#322

> > Mak, WDYT?  Is it worth trying to address this
> 
> Yes I think we should try to address this, instead of workaround it.
> Can we hack up something with mutexes so the async thread waits for
> mainthread? I don't think it's so bad if another thread waits the mainthread
> for a little while.

I think sync dispatching and some careful programming in getSingleton() will work; I will attempt to do that.
Depends on: 836493
(In reply to Nathan Froyd (:froydnj) from comment #22)
> That's what I thought, too!  But then all those quota-related tests wouldn't
> have failed when I landed my stuff yesterday.  And then I discovered that
> the quota manager is actually called by the TelemetryVFS stuff:
> 
> http://mxr.mozilla.org/mozilla-central/source/storage/src/TelemetryVFS.cpp#94
> http://mxr.mozilla.org/mozilla-central/source/storage/src/TelemetryVFS.
> cpp#140
> http://mxr.mozilla.org/mozilla-central/source/storage/src/TelemetryVFS.
> cpp#322

ugh, this complicates things a bit, also related to this patch. I was actually told the quota manager didn't need the vfs anymore at all, but didn't verify that directly :(
Resolving WFM keyword:intermittent-failure bugs last modified >3 months ago, whose whiteboard contains none of:
{random,disabled,marked,fuzzy,todo,fails,failing,annotated,time-bomb,leave open}

There will inevitably be some false positives; for that (and the bugspam) I apologise. Filter on orangewfm.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.