Closed Bug 626953 Opened 14 years ago Closed 14 years ago

Get orphaned reports into HBase

Categories

(Socorro :: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: alqahira, Assigned: lars)

Details

Attachments

(1 file)

In the course of trying to do my weekly pre-meeting crash analysis tonight, I discovered that not a single crash--old, new, anything--was available via crash-stats. *Every* report I tried--for several Camino versions and several Firefox versions--first tried to fetch the archived report, and then eventually ended up failing with "Oh Noes! This archived report could not be located." Needless to say, you can't really do anything other than superficial browsing of composite reports if you can't look at individual reports to see stacks, sites, modules, etc.
HBase is down and being copied, preparing for the move to a new datacenter this weekend. It should be back up later tonight. There have been several attempts over the past few days, and I am not sure if a downtime notice was posted this time around; I don't see one from a cursory glance. I'll post a note in this bug when it's back up, please reopen sooner if this doesn't make sense.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → INVALID
(In reply to comment #1) > There have been several attempts over the past few days, and I am not sure if a > downtime notice was posted this time around; I don't see one from a cursory > glance. For a downtime as extended as this one (going on 24 hours now) to a major system, it really be would be nice to get a downtime notice on the IT blog/other relevant blog syndicated to planet ;)
(In reply to comment #2) > (In reply to comment #1) > > There have been several attempts over the past few days, and I am not sure if a > > downtime notice was posted this time around; I don't see one from a cursory > > glance. > > For a downtime as extended as this one (going on 24 hours now) to a major > system, it really be would be nice to get a downtime notice on the IT > blog/other relevant blog syndicated to planet ;) It should not be down now, thanks for bringing this up. I'll take a look.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
(In reply to comment #3) > (In reply to comment #2) > > (In reply to comment #1) > > > There have been several attempts over the past few days, and I am not sure if a > > > downtime notice was posted this time around; I don't see one from a cursory > > > glance. > > > > For a downtime as extended as this one (going on 24 hours now) to a major > > system, it really be would be nice to get a downtime notice on the IT > > blog/other relevant blog syndicated to planet ;) > > It should not be down now, thanks for bringing this up. I'll take a look. Smokey, have you tried it? Can you please send me a link to one that's not working? (In reply to comment #1) > I'll post a note in this bug when it's back up, please reopen sooner if this > doesn't make sense. I did not actually do this, sorry if I kept you hanging.
I guess when I checked last night (right before going to sleep), I only checked the crash I sent in on the 18th and then didn't check anything else. :( I checked a smattering of reports now, and they're up--with the exception of the the one I submitted on the 18th (while things *were* down): http://crash-stats.mozilla.com/report/index/e54c77bc-2f78-46f7-b679-5a0fe2110118 Is there a bug tracking getting those came-in-during-HBase-downtime crashes accessible, or are they just lost? Sorry for the confusion here, and thanks for checking back.
Smokey, two possible factors: 1. We lost all incoming crashes during a 4 hour window on Sunday night. Was this the time when the orphan was submitted? 2. We had some crashes that came in in HBase downtime that didn't get loaded into Hbase immediately afterwards. We have been loading them in https://bugzilla.mozilla.org/show_bug.cgi?id=627028 and it's basically done. There were a couple of orphans though. Let me see if yours was one of them.
(In reply to comment #6) > > 2. We had some crashes that came in in HBase downtime that didn't get loaded > into Hbase immediately afterwards. We have been loading them in > https://bugzilla.mozilla.org/show_bug.cgi?id=627028 > and it's basically done. There were a couple of orphans though. Let me see if > yours was one of them. Jabba: lars doesn't have access to see these, can you take a look?
Assignee: nobody → jdow
I found it. it is in pm-app-collector03:/opt/local_failed_hbase_crashes/20110118/name/e5/4c/
(In reply to comment #6) > Smokey, two possible factors: > 1. We lost all incoming crashes during a 4 hour window on Sunday night. Was > this the time when the orphan was submitted? > > 2. We had some crashes that came in in HBase downtime that didn't get loaded > into Hbase immediately afterwards. We have been loading them in > https://bugzilla.mozilla.org/show_bug.cgi?id=627028 > and it's basically done. There were a couple of orphans though. Let me see if > yours was one of them. Not sure if comment 8 has answered this or not, but for completeness, the report was submitted at 11:43 PM EST on Tuesday 18 Jan.
So the action on this bug now is to find a way to get orphaned reports into HBase. Renaming bug, assigning to lars.
Assignee: jdow → lars
Severity: blocker → normal
Summary: No individual crash reports are accessible on crash-stats → Get orphaned reports into HBase
Target Milestone: --- → 1.7.7
Script is written, will run on Thursday.
(In reply to comment #11) > Script is written, will run on Thursday. Ran "orphanSubmit.py --dryrun=True" on all SJC servers, and counted the occurrence of "dry run - pushing" in the log: pm-app-collector02: 349164 pm-app-collector03: 311257 pm-app-collector04: 134450 pm-app-collector05: 417001 pm-app-collector06: 133388 Total: 1,345,260 crashes to be submitted Have not started pushing these to PHX yet. I think we are ready to start.
Lars, any suggestion for number of threads to set? Default is 4, these boxes have 8 cores each. We'll probably be blocked on disk and network most of the time, so might be worth experimenting with setting these higher than 8?
(In reply to comment #13) > Lars, any suggestion for number of threads to set? Default is 4, these boxes > have 8 cores each. We'll probably be blocked on disk and network most of the > time, so might be worth experimenting with setting these higher than 8? We are starting now. numberOfThreads for orphanSubmitter set to 16.
We currently have everything in SJC pointing to one hbase/thrift node in PHX, hp-node10. Daniel has checked that this node is not hosting anything critical, and jabba is going to remove it from the zeus loadbalancer pool on the PHX side. Currently pm-app-collector02 (1 of 5 total) is moving crashes over this connection, we are monitoring and holding off on starting anymore until the above is done.
With 16 threads, we seem to be saturating the local disk IO, which is the expected bottleneck. We could tune this further, but we seem to be pushing a reasonable amount of traffic (~2.4MB/s) so I don't think we're losing too much to context switching, GIL etc (we expect threads to spend most of their time blocking on disk or network IO).
Blocks: 629798
pm-app-collector02 through 06 are running. jabba reminded us that 01 also has some old crashes, from the time it used to be in service (we used it to start rolling out 1.7.6 but then decided not to go forward, so it's been out of production for a while). Getting pm-app-collector01 ready now.
No longer blocks: 629798
All collectors running; pushing 8-10 MB/s total from MPT collectors -> PHX hp-node10.
The orphan submitter is retrying excessively for some expected failure modes (bug 619695). A lot of these crashes are degenerate (for example, missing or malformed required fields like "submitted_timestamp"). It'd be faster to stop this and fix the bug than continue in this state. Halting for now, I'll try to get a count of what's been submitted and what's left to go.
After reviewing, testing and applying this patch, started over with "run2". Did a dry-run to get a starting point: """ pm-app-collector01: 84709 pm-app-collector02: 256709 pm-app-collector03: 238960 pm-app-collector04: 134448 pm-app-collector05: 353836 pm-app-collector06: 18193 """ This continued until Friday 11:34 PM Pacific, when hp-node10 started stopped responding. The totals for tonight were: """ pm-app-collector01: 51622 pm-app-collector02: 7302 pm-app-collector03: 32078 pm-app-collector04: 29773 pm-app-collector05: 53652 pm-app-collector06: 18193 """ Leaving it there for tonight.
hp-node10 is back up, started submitting at 2011-01-29 16:33:56 (PST). Here is a dry-run for "run3", showing what's left to do: pm-app-collector01: 33910 pm-app-collector02: 250194 pm-app-collector03: 207498 pm-app-collector04: 105472 pm-app-collector05: 300594 pm-app-collector06: 0 Total: 897,668
Finished around 2011-01-29 21:12:40 Here is a final dry-run ("run4"), I looked at the logs and these all appear to be "degenerate" crashes which are either missing a dump file, or a required field (such as "submitted_timestamp): pm-app-collector01: 42 pm-app-collector02: 34 pm-app-collector03: 34 pm-app-collector04: 23 pm-app-collector05: 42 pm-app-collector06: 23 I'll archive the logs and post them somewhere, I think we are done otherwise.
(In reply to comment #5) > I checked a smattering of reports now, and they're up--with the exception of > the the one I submitted on the 18th (while things *were* down): > http://crash-stats.mozilla.com/report/index/e54c77bc-2f78-46f7-b679-5a0fe2110118 Looks like this one is back now; calling this done.
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
qa verified.
Status: RESOLVED → VERIFIED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: