Closed Bug 987248 Opened 10 years ago Closed 10 years ago

crash in mozilla::net::Seer::WouldRedirect(mozilla::net::Seer::TopLevelInfo const&, __int64, mozilla::net::Seer::UriInfo&)

Categories

(Toolkit :: Storage, defect)

All
Windows NT
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla31
Tracking Status
firefox29 + verified
firefox30 + verified
firefox31 + verified

People

(Reporter: lizzard, Assigned: u408661)

References

Details

(4 keywords)

Crash Data

Attachments

(1 file, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-25601d11-954a-4547-96f5-9a6db2140323.
=============================================================
This crash showed up in Firefox 29 with about 500 crashes total, dropped off almost completely in 30, and is showing up as a topcrasher now for Firefox 31. This looks possibly related to the sqlite upgrade in bug 981720 and the other crashes in bug 987228 and bug 986577. 

https://crash-stats.mozilla.com/report/list?signature=mozilla%3A%3Anet%3A%3ASeer%3A%3AWouldRedirect%28mozilla%3A%3Anet%3A%3ASeer%3A%3ATopLevelInfo+const%26%2C+__int64%2C+mozilla%3A%3Anet%3A%3ASeer%3A%3AUriInfo%26%29&product=Firefox&query_type=contains&range_unit=weeks&process_type=any&hang_type=any&date=2014-03-24+17%3A00%3A00&range_value=1#reports
See Also: → 986577
Liz, I am unsure if this affects 29? In the description, you say that it is the case but you haven't tagged 29 as affected. Thanks
Flags: needinfo?(lhenry)
Oh, thanks Sylvestre! Yes, it is affecting 29 -- I have tagged it now.
Flags: needinfo?(lhenry)
Thanks!
If 29 and 31 are affected, I guess 30 is affected too then :)
Nicholas, since you are the author of some recent changes on this code ( http://hg.mozilla.org/releases/mozilla-beta/rev/4bcbb58917c9 ), does it sound familiar?
Flags: needinfo?(hurley)
Sylvestre - the copies of this in 31 look a lot like the sqlite bug Liz mentioned in comment 0 - the stack is all kinds of messed up (to the point of being worthless), and includes sqlite functions, which is the hallmark of the other crashes, so I'm not particularly worried about the topcrashiness of this in 31.

As for the crashes on 29... I never saw those until now. It claims to be a divide-by-zero error, but unless a particular page has been loaded more than 4 billion times, I'm not seeing any way for the only division in that code path to have zero as the denominator. It should be easy enough to whip up a patch that catches the case where info.loadCount == 0, but I'd like to throw some telemetry in there to see if this is happening very often (perhaps also at the point where we update info.loadCount to see if we're actually overflowing the value).
Flags: needinfo?(hurley)
Attached patch patch (obsolete) — Splinter Review
Proposed patch that (1) protects against div-by-zero in all its possible locations, (2) protects against overflow, and (3) has telemetry for these cases that should never happen.

Running through try: https://tbpl.mozilla.org/?tree=Try&rev=4d9fcf08a7de
Assignee: nobody → hurley
Attachment #8400899 - Flags: review?(mcmanus)
Comment on attachment 8400899 [details] [diff] [review]
patch

Review of attachment 8400899 [details] [diff] [review]:
-----------------------------------------------------------------

nick are you really enamored with using telemetry here? It doesn't seem like a great use of the services infra to me unless we are really just trying to chase down whether a pattern is still happening (and its actionable if true). If you want to keep it I guess I would suggest you track the number of fails rather than the ratio of successes to failures.

Isn't something like DB corruption the more likely explanation here? perhaps when the counts get too big you should just toss the db and start over?
Attachment #8400899 - Flags: review?(mcmanus)
I'm not totally enamored with telemetry, but I did want to see if we were still somehow getting to this state, even with the guards in place. Tracking the number of fails makes sense.

DB corruption is certainly a possibility, though according to bug 986577 comment 29, if that's the case, I should be getting errors back from mozstorage, which would cause me to bail on the process before getting to the point of dividing by zero (presumably... perhaps there are cases where that wouldn't happen).

I went with the guards versus throwing out the db, because it's the simpler solution (which is more reasonable for uplifting to beta) and chances are I'll be throwing out sqlite for the seer in the near future, anyway, so I'm ok with band-aiding over the problem.
Nicholas, do you think this could also help with some of the related net::Seer signatures appearing in Fx29, which are haven't been associated with this bug? Such as:

mozilla::net::Seer::TryPredict(mozilla::net::Seer::QueryType, mozilla::net::Seer::TopLevelInfo const&, __int64, nsMainThreadPtrHandle<nsINetworkSeerVerifier>&, mozilla::TimeStamp&)

mozilla::net::Seer::GetDBFileSize()

If so, I will hold off on filing those after we get some crash data after this lands on beta.
My proposed patch would likely with the signatures in TryPredict (it has the same division as WouldRedirect), assuming they're also divide-by-zero crashes.

The patch would not help in GetDBFileSize, though there is nothing obvious in there that would crash in any circumstances (no division, all pointers are null-checked before use, I just double-checked), so I would go ahead and file that one so I can at least try to figure out what's crashing there.
(In reply to Nicholas Hurley [:hurley] from comment #10)
> My proposed patch would likely with the signatures in TryPredict

make that "would likely HELP with" :)
Attached patch patch v2Splinter Review
New patch, mostly same as the old one (still did not go with starting over - it's slightly more invasive), but with counter telemetry as opposed to proportion telemetry. Pat - do you feel better about this one?
Attachment #8400899 - Attachment is obsolete: true
Attachment #8401537 - Flags: review?(mcmanus)
(In reply to Nicholas Hurley [:hurley] from comment #8)
> DB corruption is certainly a possibility, though according to bug 986577
> comment 29, if that's the case, I should be getting errors back from
> mozstorage, which would cause me to bail on the process before getting to
> the point of dividing by zero (presumably... perhaps there are cases where
> that wouldn't happen).

To be clear, as I understand it SQLite sanity-checks most of what it reads from disk and declares corruption if the stuff is clearly wrong/inconsistent, but it's not like there are checksums/hashes on the records to provide high probability guarantees that the data is never corrupt.  Quickly refreshing my memory from http://www.sqlite.org/fileformat2.html it looks like explicit checksums only exist in the rollback journal and write-ahead log.
also, in some cases the corruption may not yet be reported up to the top level, for example http://www.sqlite.org/src/info/7fa85eaaaf6d211378620d728a759fdfe30a15b0 just fixed one of these cases. Those are bugs that should be reported upstream, if found.
Comment on attachment 8401537 [details] [diff] [review]
patch v2

Review of attachment 8401537 [details] [diff] [review]:
-----------------------------------------------------------------

I don't think "what would the reviewer do" is the right standard. I think this approach is acceptable(ok, admittedly that's a weak form of what would the reviewer do) and we've talked about its pros and cons and if you conclude this is the best thing and you're going stand by it as the author - that's ok by me.
Attachment #8401537 - Flags: review?(mcmanus) → review+
(In reply to Patrick McManus [:mcmanus] from comment #15)
> I don't think "what would the reviewer do" is the right standard.

Agreed. It was less "is this what you would do" and more "are you willing to r+ this?" :)

I'll land as soon as inbound opens.
https://hg.mozilla.org/mozilla-central/rev/f6116de9c5f3
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Nicholas, are you planning an uplift of this patch? Thanks
Flags: needinfo?(hurley)
(In reply to Sylvestre Ledru [:sylvestre] from comment #19)
> Nicholas, are you planning an uplift of this patch? Thanks

Absolutely, I just wanted to make sure it didn't get backed out of m-i before requesting approval. I'm making sure the patch applies cleanly to m-a and m-b right now, then I'll request approval :)
Flags: needinfo?(hurley)
Comment on attachment 8401537 [details] [diff] [review]
patch v2

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 881804
User impact if declined: crashes according to the bug description
Testing completed (on m-c, etc.): on m-c
Risk to taking this patch (and alternatives if risky): low
String or IDL/UUID changes made by this patch: none
Attachment #8401537 - Flags: approval-mozilla-beta?
Attachment #8401537 - Flags: approval-mozilla-aurora?
Attachment #8401537 - Flags: approval-mozilla-beta?
Attachment #8401537 - Flags: approval-mozilla-beta+
Attachment #8401537 - Flags: approval-mozilla-aurora?
Attachment #8401537 - Flags: approval-mozilla-aurora+
QA Whiteboard: [qa+]
Looks like this is working on 29, we see numbers for both signatures significantly down in overall 29 data and nonexistent in the top 300 in 29.0b7 data.

Florin, is that enough to mark it verified, or do we usually gather more than that?
Crash Signature: [@ mozilla::net::Seer::WouldRedirect(mozilla::net::Seer::TopLevelInfo const&, __int64, mozilla::net::Seer::UriInfo&)] → [@ mozilla::net::Seer::WouldRedirect(mozilla::net::Seer::TopLevelInfo const&, __int64, mozilla::net::Seer::UriInfo&)] [@ mozilla::net::Seer::TryPredict(mozilla::net::Seer::QueryType, mozilla::net::Seer::TopLevelInfo const&, __int64, nsMainThreadPtrHandle…
Flags: needinfo?(florin.mezei)
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #23)
> Looks like this is working on 29, we see numbers for both signatures
> significantly down in overall 29 data and nonexistent in the top 300 in
> 29.0b7 data.
> 
> Florin, is that enough to mark it verified, or do we usually gather more
> than that?

Yes, that should be enough to call it Verified. Did you check for 29 only? If yes, could you also check for 30 and 31, and mark them as Verified if OK (mark the whole bug as Verified if verified on 31, and remove "verifyme" keyword if Verified on all 3 versions: 29, 30, 31).
Flags: needinfo?(florin.mezei)
So far, I checked 29 data only, because it's easier to make out clearly. For 30 and 31, I'll need to dig into by-build data, which is a bit hairy with current tooling. But I'll take the QA here as I already did it for 29.
QA Contact: kairo
Doing queries like https://crash-stats.mozilla.com/search/?signature=~mozilla%3A%3Anet%3A%3ASeer%3A%3ATryPredict&version=31.0a1&date=%3E2014-04-01&_facets=build_id&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform and looking at the "build id" facet, I found those last build dates where those signatures happened on Nightly and Aurora:

mozilla::net::Seer::WouldRedirect
31.0a1 - 2014-04-07
30.0a2 - 2014-04-09

mozilla::net::Seer::TryPredict
31.0a1 - 2014-04-07
30.0a2 - 2014-04-09

Those match the checkin dates, which is expected, no crashes from later builds on either channel.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: