Closed
Bug 626953
Opened 14 years ago
Closed 14 years ago
Get orphaned reports into HBase
Categories
(Socorro :: General, task)
Tracking
(Not tracked)
VERIFIED
FIXED
1.7.7
People
(Reporter: alqahira, Assigned: lars)
Details
Attachments
(1 file)
3.15 KB,
patch
|
Details | Diff | Splinter Review |
In the course of trying to do my weekly pre-meeting crash analysis tonight, I discovered that not a single crash--old, new, anything--was available via crash-stats.
*Every* report I tried--for several Camino versions and several Firefox versions--first tried to fetch the archived report, and then eventually ended up failing with "Oh Noes! This archived report could not be located."
Needless to say, you can't really do anything other than superficial browsing of composite reports if you can't look at individual reports to see stacks, sites, modules, etc.
Comment 1•14 years ago
|
||
HBase is down and being copied, preparing for the move to a new datacenter this weekend. It should be back up later tonight.
There have been several attempts over the past few days, and I am not sure if a downtime notice was posted this time around; I don't see one from a cursory glance.
I'll post a note in this bug when it's back up, please reopen sooner if this doesn't make sense.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → INVALID
Reporter | ||
Comment 2•14 years ago
|
||
(In reply to comment #1)
> There have been several attempts over the past few days, and I am not sure if a
> downtime notice was posted this time around; I don't see one from a cursory
> glance.
For a downtime as extended as this one (going on 24 hours now) to a major system, it really be would be nice to get a downtime notice on the IT blog/other relevant blog syndicated to planet ;)
Comment 3•14 years ago
|
||
(In reply to comment #2)
> (In reply to comment #1)
> > There have been several attempts over the past few days, and I am not sure if a
> > downtime notice was posted this time around; I don't see one from a cursory
> > glance.
>
> For a downtime as extended as this one (going on 24 hours now) to a major
> system, it really be would be nice to get a downtime notice on the IT
> blog/other relevant blog syndicated to planet ;)
It should not be down now, thanks for bringing this up. I'll take a look.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Comment 4•14 years ago
|
||
(In reply to comment #3)
> (In reply to comment #2)
> > (In reply to comment #1)
> > > There have been several attempts over the past few days, and I am not sure if a
> > > downtime notice was posted this time around; I don't see one from a cursory
> > > glance.
> >
> > For a downtime as extended as this one (going on 24 hours now) to a major
> > system, it really be would be nice to get a downtime notice on the IT
> > blog/other relevant blog syndicated to planet ;)
>
> It should not be down now, thanks for bringing this up. I'll take a look.
Smokey, have you tried it? Can you please send me a link to one that's not working?
(In reply to comment #1)
> I'll post a note in this bug when it's back up, please reopen sooner if this
> doesn't make sense.
I did not actually do this, sorry if I kept you hanging.
Reporter | ||
Comment 5•14 years ago
|
||
I guess when I checked last night (right before going to sleep), I only checked the crash I sent in on the 18th and then didn't check anything else. :(
I checked a smattering of reports now, and they're up--with the exception of the the one I submitted on the 18th (while things *were* down): http://crash-stats.mozilla.com/report/index/e54c77bc-2f78-46f7-b679-5a0fe2110118
Is there a bug tracking getting those came-in-during-HBase-downtime crashes accessible, or are they just lost?
Sorry for the confusion here, and thanks for checking back.
Comment 6•14 years ago
|
||
Smokey, two possible factors:
1. We lost all incoming crashes during a 4 hour window on Sunday night. Was this the time when the orphan was submitted?
2. We had some crashes that came in in HBase downtime that didn't get loaded into Hbase immediately afterwards. We have been loading them in
https://bugzilla.mozilla.org/show_bug.cgi?id=627028
and it's basically done. There were a couple of orphans though. Let me see if yours was one of them.
Comment 7•14 years ago
|
||
(In reply to comment #6)
>
> 2. We had some crashes that came in in HBase downtime that didn't get loaded
> into Hbase immediately afterwards. We have been loading them in
> https://bugzilla.mozilla.org/show_bug.cgi?id=627028
> and it's basically done. There were a couple of orphans though. Let me see if
> yours was one of them.
Jabba: lars doesn't have access to see these, can you take a look?
Assignee: nobody → jdow
Comment 8•14 years ago
|
||
I found it. it is in pm-app-collector03:/opt/local_failed_hbase_crashes/20110118/name/e5/4c/
Reporter | ||
Comment 9•14 years ago
|
||
(In reply to comment #6)
> Smokey, two possible factors:
> 1. We lost all incoming crashes during a 4 hour window on Sunday night. Was
> this the time when the orphan was submitted?
>
> 2. We had some crashes that came in in HBase downtime that didn't get loaded
> into Hbase immediately afterwards. We have been loading them in
> https://bugzilla.mozilla.org/show_bug.cgi?id=627028
> and it's basically done. There were a couple of orphans though. Let me see if
> yours was one of them.
Not sure if comment 8 has answered this or not, but for completeness, the report was submitted at 11:43 PM EST on Tuesday 18 Jan.
Comment 10•14 years ago
|
||
So the action on this bug now is to find a way to get orphaned reports into HBase. Renaming bug, assigning to lars.
Assignee: jdow → lars
Severity: blocker → normal
Summary: No individual crash reports are accessible on crash-stats → Get orphaned reports into HBase
Target Milestone: --- → 1.7.7
Comment 11•14 years ago
|
||
Script is written, will run on Thursday.
Comment 12•14 years ago
|
||
(In reply to comment #11)
> Script is written, will run on Thursday.
Ran "orphanSubmit.py --dryrun=True" on all SJC servers, and counted the occurrence of "dry run - pushing" in the log:
pm-app-collector02: 349164
pm-app-collector03: 311257
pm-app-collector04: 134450
pm-app-collector05: 417001
pm-app-collector06: 133388
Total: 1,345,260 crashes to be submitted
Have not started pushing these to PHX yet. I think we are ready to start.
Comment 13•14 years ago
|
||
Lars, any suggestion for number of threads to set? Default is 4, these boxes have 8 cores each. We'll probably be blocked on disk and network most of the time, so might be worth experimenting with setting these higher than 8?
Comment 14•14 years ago
|
||
(In reply to comment #13)
> Lars, any suggestion for number of threads to set? Default is 4, these boxes
> have 8 cores each. We'll probably be blocked on disk and network most of the
> time, so might be worth experimenting with setting these higher than 8?
We are starting now. numberOfThreads for orphanSubmitter set to 16.
Comment 15•14 years ago
|
||
We currently have everything in SJC pointing to one hbase/thrift node in PHX, hp-node10.
Daniel has checked that this node is not hosting anything critical, and jabba is going to remove it from the zeus loadbalancer pool on the PHX side.
Currently pm-app-collector02 (1 of 5 total) is moving crashes over this connection, we are monitoring and holding off on starting anymore until the above is done.
Comment 16•14 years ago
|
||
With 16 threads, we seem to be saturating the local disk IO, which is the expected bottleneck.
We could tune this further, but we seem to be pushing a reasonable amount of traffic (~2.4MB/s) so I don't think we're losing too much to context switching, GIL etc (we expect threads to spend most of their time blocking on disk or network IO).
Comment 17•14 years ago
|
||
pm-app-collector02 through 06 are running.
jabba reminded us that 01 also has some old crashes, from the time it used to be in service (we used it to start rolling out 1.7.6 but then decided not to go forward, so it's been out of production for a while).
Getting pm-app-collector01 ready now.
No longer blocks: 629798
Comment 18•14 years ago
|
||
All collectors running; pushing 8-10 MB/s total from MPT collectors -> PHX hp-node10.
Comment 19•14 years ago
|
||
The orphan submitter is retrying excessively for some expected failure modes (bug 619695). A lot of these crashes are degenerate (for example, missing or malformed required fields like "submitted_timestamp").
It'd be faster to stop this and fix the bug than continue in this state.
Halting for now, I'll try to get a count of what's been submitted and what's left to go.
Comment 20•14 years ago
|
||
After reviewing, testing and applying this patch, started over with "run2". Did a dry-run to get a starting point:
"""
pm-app-collector01: 84709
pm-app-collector02: 256709
pm-app-collector03: 238960
pm-app-collector04: 134448
pm-app-collector05: 353836
pm-app-collector06: 18193
"""
This continued until Friday 11:34 PM Pacific, when hp-node10 started stopped responding.
The totals for tonight were:
"""
pm-app-collector01: 51622
pm-app-collector02: 7302
pm-app-collector03: 32078
pm-app-collector04: 29773
pm-app-collector05: 53652
pm-app-collector06: 18193
"""
Leaving it there for tonight.
Comment 21•14 years ago
|
||
hp-node10 is back up, started submitting at 2011-01-29 16:33:56 (PST).
Here is a dry-run for "run3", showing what's left to do:
pm-app-collector01: 33910
pm-app-collector02: 250194
pm-app-collector03: 207498
pm-app-collector04: 105472
pm-app-collector05: 300594
pm-app-collector06: 0
Total: 897,668
Comment 22•14 years ago
|
||
Finished around 2011-01-29 21:12:40
Here is a final dry-run ("run4"), I looked at the logs and these all appear to be "degenerate" crashes which are either missing a dump file, or a required field (such as "submitted_timestamp):
pm-app-collector01: 42
pm-app-collector02: 34
pm-app-collector03: 34
pm-app-collector04: 23
pm-app-collector05: 42
pm-app-collector06: 23
I'll archive the logs and post them somewhere, I think we are done otherwise.
Comment 23•14 years ago
|
||
(In reply to comment #5)
> I checked a smattering of reports now, and they're up--with the exception of
> the the one I submitted on the 18th (while things *were* down):
> http://crash-stats.mozilla.com/report/index/e54c77bc-2f78-46f7-b679-5a0fe2110118
Looks like this one is back now; calling this done.
Status: REOPENED → RESOLVED
Closed: 14 years ago → 14 years ago
Resolution: --- → FIXED
Updated•13 years ago
|
Component: Socorro → General
Product: Webtools → Socorro
You need to log in
before you can comment on or make changes to this bug.
Description
•