This bug was filed from the Socorro interface and is report bp-0d4d970c-80a8-4cdb-a765-9cd102140320. ============================================================= This is in the top 10 crashes for Firefox 31.0a1. This crash signature showed up in Firefox 27 but didn't appear in other versions until 31, so it will probably not be too hard to track down a regression range.
0 nss3.dll sqlite3VdbeSerialGet db/sqlite3/src/sqlite3.c 1 nss3.dll sqlite3VdbeRecordCompare db/sqlite3/src/sqlite3.c 2 nss3.dll getAndInitPage db/sqlite3/src/sqlite3.c 3 nss3.dll nss3.dll@0x12a3c0 4 nss3.dll nss3.dll@0x12a3c0 5 nss3.dll sqlite3BtreeMovetoUnpacked db/sqlite3/src/sqlite3.c 6 nss3.dll setSharedCacheTableLock db/sqlite3/src/sqlite3.c 7 nss3.dll nss3.dll@0x12a3c0 8 nss3.dll allocateCursor db/sqlite3/src/sqlite3.c 9 nss3.dll sqlite3VdbeExec db/sqlite3/src/sqlite3.c 10 nss3.dll sqlite3VdbeMakeReady db/sqlite3/src/sqlite3.c 11 nss3.dll codeTableLocks db/sqlite3/src/sqlite3.c 12 nss3.dll sqlite3FinishCoding db/sqlite3/src/sqlite3.c 13 nss3.dll yy_reduce db/sqlite3/src/sqlite3.c 14 mozglue.dll arena_run_dalloc memory/mozjemalloc/jemalloc.c 15 mozglue.dll arena_dalloc memory/mozjemalloc/jemalloc.c 16 mozglue.dll arena_malloc_small memory/mozjemalloc/jemalloc.c 17 mozglue.dll arena_malloc memory/mozjemalloc/jemalloc.c 18 xul.dll nsStringBuffer::Alloc(unsigned __int64) xpcom/string/src/nsSubstring.cpp 19 mozglue.dll arena_dalloc_small memory/mozjemalloc/jemalloc.c 20 xul.dll nsACString_internal::MutatePrep(unsigned int,char * *,unsigned int *) xpcom/string/src/nsTSubstring.cpp 21 mozglue.dll arena_dalloc memory/mozjemalloc/jemalloc.c 22 xul.dll nsACString_internal::ReplacePrepInternal(unsigned int,unsigned int,unsigned int,unsigned int) xpcom/string/src/nsTSubstring.cpp 23 xul.dll nsACString_internal::ReplacePrep(unsigned int,unsigned int,unsigned int) obj-firefox/dist/include/nsTSubstring.h 24 nss3.dll vdbeUnbind db/sqlite3/src/sqlite3.c 25 nss3.dll sqlite3_bind_int64 db/sqlite3/src/sqlite3.c 26 mozglue.dll arena_dalloc_small memory/mozjemalloc/jemalloc.c 27 xul.dll nsTArray_base<nsTArrayInfallibleAllocator,nsTArray_CopyWithMemutils>::SwapArrayElements<nsTArrayInfallibleAllocator>(nsTArray_base<nsTArrayInfallibleAllocator,nsTArray_CopyWithMemutils> &,unsigned int,unsigned __int64) xpcom/glue/nsTArray-inl.h 28 mozglue.dll arena_dalloc_small memory/mozjemalloc/jemalloc.c 29 mozglue.dll arena_dalloc memory/mozjemalloc/jemalloc.c 30 mozglue.dll arena_dalloc memory/mozjemalloc/jemalloc.c 31 xul.dll nsTArray_base<nsTArrayInfallibleAllocator,nsTArray_CopyWithMemutils>::ShrinkCapacity(unsigned int,unsigned __int64) xpcom/glue/nsTArray-inl.h 32 nss3.dll sqlite3Step db/sqlite3/src/sqlite3.c 33 nss3.dll sqlite3_step db/sqlite3/src/sqlite3.c 34 mozglue.dll arena_dalloc memory/mozjemalloc/jemalloc.c 35 xul.dll mozilla::storage::Connection::stepStatement(sqlite3_stmt *) storage/src/mozStorageConnection.cpp 36 xul.dll mozilla::storage::Statement::ExecuteStep(bool *) storage/src/mozStorageStatement.cpp 37 xul.dll mozilla::storage::Statement::BindInt32ByName(nsACString_internal const &,int) storage/src/mozStorageStatement.cpp 38 xul.dll mozilla::net::Seer::TryPredict(mozilla::net::Seer::QueryType,mozilla::net::Seer::TopLevelInfo const &,__int64,nsMainThreadPtrHandle<nsINetworkSeerVerifier> &,mozilla::TimeStamp &) netwerk/base/src/Seer.cpp 39 nss3.dll sqlite3_reset db/sqlite3/src/sqlite3.c 40 xul.dll mozilla::net::Seer::WouldRedirect(mozilla::net::Seer::TopLevelInfo const &,__int64,mozilla::net::Seer::UriInfo &) netwerk/base/src/Seer.cpp 41 xul.dll mozilla::net::Seer::UpdateTopLevel(mozilla::net::Seer::QueryType,mozilla::net::Seer::TopLevelInfo const &,__int64) netwerk/base/src/Seer.cpp 42 xul.dll mozilla::net::Seer::PredictForPageload(mozilla::net::Seer::UriInfo const &,nsMainThreadPtrHandle<nsINetworkSeerVerifier> &,int,mozilla::TimeStamp &) netwerk/base/src/Seer.cpp 43 ntdll.dll LdrInitializeThunk 44 nss3.dll md_UnlockAndPostNotifies nsprpub/pr/src/md/windows/w95cv.c 45 KERNELBASE.dll WaitForSingleObjectEx 46 xul.dll mozilla::TimeStamp::operator-(mozilla::TimeStamp const &) obj-firefox/dist/include/mozilla/TimeStamp.h 47 xul.dll mozilla::TimeStamp::Now(bool) xpcom/ds/TimeStamp_windows.cpp 48 xul.dll mozilla::Telemetry::AccumulateTimeDelta(mozilla::Telemetry::ID,mozilla::TimeStamp,mozilla::TimeStamp) toolkit/components/telemetry/Telemetry.cpp 49 xul.dll mozilla::net::SeerPredictionEvent::Run() netwerk/base/src/Seer.cpp 50 nss3.dll PR_Unlock nsprpub/pr/src/threads/combined/prulock.c 51 nss3.dll PR_Wait nsprpub/pr/src/threads/prmon.c 52 xul.dll nsEventQueue::GetEvent(bool,nsIRunnable * *) xpcom/threads/nsEventQueue.cpp 53 xul.dll nsThread::ProcessNextEvent(bool,bool *) xpcom/threads/nsThread.cpp 54 nss3.dll MD_CURRENT_THREAD nsprpub/pr/src/md/windows/w95thred.c 55 KERNELBASE.dll CreateEventExW 56 xul.dll CallCreateInstance(nsID const &,nsISupports *,nsID const &,void * *) xpcom/glue/nsComponentManagerUtils.cpp 57 nss3.dll PR_Unlock nsprpub/pr/src/threads/combined/prulock.c 58 mozglue.dll arena_malloc_small memory/mozjemalloc/jemalloc.c 59 xul.dll NS_ProcessNextEvent(nsIThread *,bool) xpcom/glue/nsThreadUtils.cpp 60 xul.dll mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate *) ipc/glue/MessagePump.cpp 61 xul.dll MessageLoop::RunHandler() ipc/chromium/src/base/message_loop.cc 62 xul.dll MessageLoop::Run() ipc/chromium/src/base/message_loop.cc 63 xul.dll MessageLoop::MessageLoop(MessageLoop::Type) ipc/chromium/src/base/message_loop.cc 64 xul.dll nsThread::ThreadFunc(void *) xpcom/threads/nsThread.cpp 65 nss3.dll PR_NativeRunThread nsprpub/pr/src/threads/combined/pruthr.c 66 nss3.dll pr_root nsprpub/pr/src/md/windows/w95thred.c 67 msvcr100.dll _callthreadstartex f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c 68 msvcr100.dll _threadstartex f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c 69 kernel32.dll BaseThreadInitThunk 70 ntdll.dll RtlUserThreadStart 71 kernel32.dll BasepReportFault 72 kernel32.dll BasepReportFault Show/hide other threads
Assignee: nobody → hurley
The crash signature seems to show a spike between 2014031422 and 2014031903, with no crash information from 20140315, that makes the range really broad though, http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f073b3d6db1f&tochange=3bc3b9e2cd99 Here are the commmits between 3/18 and 3/19 in case that helps. There was an upgrade to sqlite during that range. http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=082761b7bc54&tochange=3bc3b9e2cd99
May be related to bug 986577, also a top crash that showed up for the first time with Firefox 31.0a1.
Oops. That was meant for a different bug.
Marco do you think this may be related to the sqlite upgrade? Thanks! This one and bug 981720.
Component: Networking → Storage
Product: Core → Toolkit
I mean, 987228. Too many tabs open today :)
Reassign to me if the seer really is the culprit here, though with the sqlite upgrade, no seer changes in the pushlogs, and crashes with the same signature not involving the seer (for example, https://crash-stats.mozilla.com/report/index/cb6e2e9f-4c4e-4eee-bf61-0ed512140324), I don't *think* that will be the case.
Assignee: hurley → nobody
I will reach sqlite support and ask if they can argue anything from the stacks we got.
The whole SQLite team has been looking into this, but we haven't gotten very far. One concern is that the stack traces don't seem right. I see arena_run_dalloc() calling yy_reduce(), for example. (yy_reduce() is an internal routine in SQLite's parser.) This is clearly impossible, so there must be some stack corruption somewhere or another. Either that, or we are misinterpreting the stack trace in the bug report.
The stack walk is clearly busted. Breakpad reverted to stack-scanning because stackwalking failed, so we really don't know much more than sqlite3VdbeSerialGet was in fact the IP at the time of the crash. dmajor can load one of these minidumps in a debugger when he gets back next week, or I can send someone else a minidump to load into VS/windbg to check out what's going on.
In the meanwhile, if we limit the analysis to stacks reported for Firefox 31a1 that don't involve yy_reduce, we get some more realistic stacks: 0 sqlite3.dll sqlite3VdbeSerialGet mozilla/db/sqlite3/src/sqlite3.c 1 sqlite3.dll sqlite3VdbeIdxRowid mozilla/db/sqlite3/src/sqlite3.c 2 sqlite3.dll sqlite3VdbeExec mozilla/db/sqlite3/src/sqlite3.c 3 sqlite3.dll sqlite3Step mozilla/db/sqlite3/src/sqlite3.c 4 sqlite3.dll sqlite3_step mozilla/db/sqlite3/src/sqlite3.c 5 xul.dll mozStorageStatement::ExecuteStep(int *) mozilla/storage/src/mozStorageStatement.cpp 6 xul.dll mozStorageStatement::Execute() mozilla/storage/src/mozStorageStatement.cpp 7 xul.dll nsUrlClassifierStore::Expire(unsigned int,unsigned int) mozilla/toolkit/components/url-classifier/src/nsUrlClassifierDBService.cpp 8 xul.dll nsUrlClassifierDBServiceWorker::ProcessResponseLines(int *) mozilla/toolkit/components/url-classifier/src/nsUrlClassifierDBService.cpp 9 xul.dll nsUrlClassifierDBServiceWorker::UpdateStream(nsACString_internal const &) mozilla/toolkit/components/url-classifier/src/nsUrlClassifierDBService.cpp --- 0 nss3.dll sqlite3VdbeSerialGet db/sqlite3/src/sqlite3.c 1 nss3.dll sqlite3VdbeRecordCompare db/sqlite3/src/sqlite3.c 2 nss3.dll vdbeRecordCompareInt db/sqlite3/src/sqlite3.c 3 nss3.dll sqlite3BtreeMovetoUnpacked db/sqlite3/src/sqlite3.c 4 nss3.dll sqlite3VdbeExec db/sqlite3/src/sqlite3.c 5 nss3.dll sqlite3Step db/sqlite3/src/sqlite3.c 6 nss3.dll sqlite3_step db/sqlite3/src/sqlite3.c 7 xul.dll mozilla::storage::Connection::stepStatement(sqlite3_stmt *) storage/src/mozStorageConnection.cpp 8 xul.dll mozilla::storage::Statement::ExecuteStep(bool *) storage/src/mozStorageStatement.cpp 9 xul.dll mozilla::net::Seer::WouldRedirect(mozilla::net::Seer::TopLevelInfo const &,__int64,mozilla::net::Seer::UriInfo &) netwerk/base/src/Seer.cpp 10 xul.dll mozilla::net::Seer::PredictForPageload(mozilla::net::Seer::UriInfo const &,nsMainThreadPtrHandle<nsINetworkSeerVerifier> &,int,mozilla::TimeStamp &) netwerk/base/src/Seer.cpp 11 xul.dll mozilla::net::SeerPredictionEvent::Run() netwerk/base/src/Seer.cpp --- 0 nss3.dll sqlite3VdbeSerialGet db/sqlite3/src/sqlite3.c 1 nss3.dll sqlite3VdbeRecordCompare db/sqlite3/src/sqlite3.c 2 nss3.dll vdbeRecordCompareInt db/sqlite3/src/sqlite3.c 3 nss3.dll sqlite3BtreeMovetoUnpacked db/sqlite3/src/sqlite3.c 4 nss3.dll sqlite3VdbeExec db/sqlite3/src/sqlite3.c 5 nss3.dll sqlite3Step db/sqlite3/src/sqlite3.c 6 nss3.dll sqlite3_step db/sqlite3/src/sqlite3.c 7 nss3.dll sqlite3_exec db/sqlite3/src/sqlite3.c 8 xul.dll mozilla::storage::Connection::executeSql(char const *) storage/src/mozStorageConnection.cpp 9 xul.dll mozilla::storage::Connection::ExecuteSimpleSQL(nsACString_internal const &) storage/src/mozStorageConnection.cpp 10 xul.dll mozilla::net::Seer::ResetInternal() netwerk/base/src/Seer.cpp these calls are all off the main-thread. Most of the crashes are indeed starting from Seer, that could mean may be a good point to figure which behavior may end up into such a crash. I'd not stop investigation over Seer thread handling yet, it may be useful.
the first stack involving a sqlite3.dll sounds a little bit strange though, so I'd probably ignore it, that leaves us just with crashes starting from Seer...
How accurate are the line numbers on the stack traces? All of the "sane" stack traces are saying that the problem occurs at http://www.sqlite.org/src/artifact/e45e3f9da?ln=3013 which is a very odd place for a crash to occur. Might the problem really be on the previous line?
Does Seer put SQLite in memory-mapped-I/O mode using the "PRAGMA mmap_size=N" statement?
the Seer code is here http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/Seer.cpp the only pragmas I see there are basic stuff (synchronous, foreign_keys, page_count, page_size), we are not using mmap IO mode anywhere yet. Regarding comment 13, I think it's possible in some case the line numbers may not be precise (breakpad may fallback to scanning), though Benjamin can definately answer that question better than me...
On Windows the instruction/line number for crashes is almost always the line after the crash because that's where the instruction pointer would be if you resumed.
We (the SQLite developers) now have a test case which can cause SQLite to crash two lines prior to the line indicated in the stack traces. We have a preliminary patch checked into the SQLite source tree and are working on further testing and validation. Probably there will be a 126.96.36.199 release of SQLite soon to address this oversight. Even if this is the underlying problem, the question still remains: Why are so many users coming up with corrupt Seer databases. SQLite database shouldn't be going corrupt. And not just any corrupt database either - the corruption must be carefully crafted in order to provoke the crash. Is there some other problem somewhere else that is corrupting databases? Or, are all of the bug reports coming from the same profile and that profile just happens to have a corrupt database that causes the crash?
Seer uses Synchronous = OFF, that may explain the fact it may corrupt more easily. Especially in Nightly population where crashes can happen more frequently. I don't know the right answer to the second question, I think some crashes are indeed coming from the same profiles, though I could identify different profiles here (just based off app notes). The crash management team can tell more precisely which kind of filtering we do to avoid multiple submissions.
Thanks for the info that Seer uses synchronous=OFF. Note that synchronous=OFF will only lead to database corruption if the computer crashes or hard-resets or powers-off in the middle of a transaction commit. An application crash should not cause database corruption even with synchronous=OFF. Perhaps all of the bug reports are coming from people pressing the reset button while Seer is committing.
(In reply to Marco Bonardo [:mak] from comment #18) > I don't know the right answer to the second question, I think some crashes > are indeed coming from the same profiles, though I could identify different > profiles here (just based off app notes). The crash management team can tell > more precisely which kind of filtering we do to avoid multiple submissions. We do not send unique profile identifiers in crash reports for privacy reasons. From the "Install Time" in the Reports tab of https://crash-stats.mozilla.com/report/list?product=Firefox&signature=sqlite3VdbeSerialGet you can get a feel of the different installations though, and the "Crashes per Install" section of the Signature Summary tab uses that to proxy the numbers of different installations seeing the crash (note that the same profile updating to a new Nightly build bumps the install time and therefore is counted as two installations, and two profiles using the exact same install of a build are counted as a single installation).
thanks Robert, so it's about 69 "unique" installs in the last 28 days, considered there's one new build per day the number feels low, corruption "may" be a good explanation in this case. It's puzzling if a very specific kind of corruption is needed to cause this crash. Nicholas, I wonder if Seer has any kind of check and recover mode for corrupt databases, since the less secure mode was selected, was that taken into account in the design stage?
(In reply to Marco Bonardo [:mak] from comment #21) > thanks Robert, so it's about 69 "unique" installs in the last 28 days I actually see those 69 in the last 7 days (which is the default view).
fwiw, doesn't seem to change if I select 28 days at the top...
(In reply to Marco Bonardo [:mak] from comment #21) > Nicholas, I wonder if Seer has any kind of check and recover mode for > corrupt databases, since the less secure mode was selected, was that taken > into account in the design stage? There is no check and recover mode (not sure how to implement check, but recover is easy once I have that - just delete everything), but it sounds like it might be useful in at least a few limited circumstances. TBH, I'm not even sure why I chose synchronous=OFF, presumably it was on the recommendation of my local necko sqlite expert, but I have no documentation anywhere of the why. Does using synchronous=OFF reduce the number of fsyncs performed? (Seems likely to me, and that would definitely have been a reason for me using that pragma.)
Synchronous=OFF causes all fsync() calls to be omitted. SQLite uses fsync() (or FlushFileBuffers()) as a write barrier. So normally, everything will work fine without fsync(). But if you take an OS crash, hard-reset, or power-loss while writing, content might get written to oxide out-of-order, meaning that SQLite will be unable to recover the database after reboot.
SQLite version 188.8.131.52 is now available on the SQLite download page (http://www.sqlite.org/download.html). Version 184.108.40.206 differs from 220.127.116.11 by a single line (http://www.sqlite.org/src/fdiff?v1=e45e3f9daf38c5be&v2=714df4e1c82f629d&sbs=1). In version 18.104.22.168, a database with a very particular kind of database corruption might cause a buffer overread which could result in the crash signature seen in this bug. The line added to version 22.214.171.124 will prevent the overread.
(In reply to Nicholas Hurley [:hurley] from comment #24) > There is no check and recover mode (not sure how to implement check PRAGMA integrity_check can do that, in some cases also PRAGMA quick_check, they return OK if the database is fine, errors otherwise. The former is _quite_ slow, especially on db of the size of current netpredictions.sqlite. the latter is fast but doesn't cover indices corruption. Though I guess in this case it may be enough to detect this specific kind of corruption (just guessing). An alternative to reduce number of fsyncs while keeping some durability is to use WAL journal with synchronous normal. In this case the db should never corrupt, but you may have dataloss of recent transactions. It's still making more fsyncs than synchronous off clearly, so it depends on what you expect. Usually when you can rebuild your data at any time, you don't care about durability, but you have this additional burden of checking the db is sane. In other cases a compromise like WAL may be better. (In reply to D. Richard Hipp from comment #26) > SQLite version 126.96.36.199 is now available on the SQLite download page > (http://www.sqlite.org/download.html). Thanks, I'm going to file a bug for the update, we can then check if crash stats go down as expected.
(In reply to Marco Bonardo [:mak] from comment #23) > fwiw, doesn't seem to change if I select 28 days at the top... Sure, it only started to appear within the last 7 days, actually the first day they were recorded is 2014-03-19.
For detecting problems, you can also just look for SQLITE_CORRUPT/NS_ERROR_FILE_CORRUPTED errors to come back from your requests and nuke the database at that point. No need to incur startup I/O costs for a rarely expected case.
Crash Signature: [@ sqlite3VdbeSerialGet] → [@ sqlite3VdbeSerialGet] [@ mozilla::net::Seer::TryPredict(mozilla::net::Seer::QueryType, mozilla::net::Seer::TopLevelInfo const&, __int64, nsMainThreadPtrHandle<nsINetworkSeerVerifier>&, mozilla::TimeStamp&)] [@ mozilla::net::Seer::GetDBFileSize()]
Here's another possibly related crash signature/bug: bug 987248
Crash Signature: [@ sqlite3VdbeSerialGet] [@ mozilla::net::Seer::TryPredict(mozilla::net::Seer::QueryType, mozilla::net::Seer::TopLevelInfo const&, __int64, nsMainThreadPtrHandle<nsINetworkSeerVerifier>&, mozilla::TimeStamp&)] [@ mozilla::net::Seer::GetDBFileSize()] → [@ sqlite3VdbeSerialGet] [@ mozilla::net::Seer::TryPredict(mozilla::net::Seer::QueryType, mozilla::net::Seer::TopLevelInfo const&, __int64, nsMainThreadPtrHandle<nsINetworkSeerVerifier>&, mozilla::TimeStamp&)] [@ mozilla::net::Seer::GetDBFileSize()] [@…
See Also: → 987248
juanb, lizzard, what makes you think that the Firefox 29 crashes could possibly be related to a 31-only regression? Please do NOT add any signatures here that appear before Nightly of 2014-03-19.
Kairo: Hi there. That's not particularly helpful or polite to either Juan or me. I've been doing my best over here to jump in and contribute without any particular information from you or anyone other than this wiki page: https://wiki.mozilla.org/CrashKill/Topcrash Please add more useful tips in that documentation, that would be awesome and would help the whole crashkill/stability team! Thanks.
(In reply to Robert Kaiser (:email@example.com) from comment #31) > juanb, lizzard, what makes you think that the Firefox 29 crashes could > possibly be related to a 31-only regression? > Please do NOT add any signatures here that appear before Nightly of > 2014-03-19. I was looking at the explosiveness report for beta, which had a couple of "Seer" signatures in it, and no bug associated with them. Should I file those separately? https://crash-analysis.mozilla.com/rkaiser/2014-03-25/2014-03-25.firefox.beta.explosiveness.html
(In reply to Liz Henry :lizzard from comment #32) > Kairo: Hi there. That's not particularly helpful or polite to either Juan or > me. Sorry if it came across that way, it was meant as a serious question of what did lead you to believe those signatures were related to the bug here. (In reply to juan becerra [:juanb] from comment #33) > I was looking at the explosiveness report for beta, which had a couple of > "Seer" signatures in it, and no bug associated with them. Should I file > those separately? Yes, please - from what I hear, one (new) bug for all of those signatures is OK for now. It's possible that there might be some relationship in what's triggering the crashes in here and what's causing the 29 crashes, but this bug here is specifically a very recent Nightly regression on a specific place in SQLite code and something the SQLite folks have probably created a patch and update for, so let's track this one separately from the 29 beta regressions.
looks like the crashes number has fall down just after the Sqlite upgrade, so I'm calling this fixed by the Sqlite 188.8.131.52 upgrade. Thanks very much to the Sqlite Team for the awesome support. I think netwerk team should evaluate pursuing changes in a follow-up bug to handle databases corruption more thoroughly in future. I'm also clearing the needinfo since we don't need the minidump analysis amymore
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
This has dropped off significantly in the last week and is now at #49 accounting for 0.27% in Firefox 31 over the last week with 0 crashes reported against build IDs after the fix landed. I fully expect this to continue to fall off in the coming days. Marking this verified fixed based on crashstats.
Did a bug ever get filed to fix the corruption issue?
Ben - it is now. Bug 993031
You need to log in before you can comment on or make changes to this bug.