Closed Bug 626953 Opened 14 years ago Closed 14 years ago

Get orphaned reports into HBase

Categories

(Socorro :: General, task)

Product:

Component:

Platform:

x86

macOS

Type:

task

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

VERIFIED FIXED

Milestone:

1.7.7

People

(Reporter: alqahira, Assigned: lars)

Details

Attachments

(1 file)

workaround from lars for excessive retry 14 years ago Robert Helmer [:rhelmer] 3.15 KB, patch		Details \| Diff \| Splinter Review

Smokey Ardisson (offline for a while; not following bugs - do not email)

Reporter

Description

•

14 years ago

In the course of trying to do my weekly pre-meeting crash analysis tonight, I discovered that not a single crash--old, new, anything--was available via crash-stats. *Every* report I tried--for several Camino versions and several Firefox versions--first tried to fetch the archived report, and then eventually ended up failing with "Oh Noes! This archived report could not be located." Needless to say, you can't really do anything other than superficial browsing of composite reports if you can't look at individual reports to see stacks, sites, modules, etc.

Robert Helmer [:rhelmer]

Comment 1

•

14 years ago

HBase is down and being copied, preparing for the move to a new datacenter this weekend. It should be back up later tonight. There have been several attempts over the past few days, and I am not sure if a downtime notice was posted this time around; I don't see one from a cursory glance. I'll post a note in this bug when it's back up, please reopen sooner if this doesn't make sense.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → INVALID

Smokey Ardisson (offline for a while; not following bugs - do not email)

Reporter

Comment 2

•

14 years ago

(In reply to comment #1) > There have been several attempts over the past few days, and I am not sure if a > downtime notice was posted this time around; I don't see one from a cursory > glance. For a downtime as extended as this one (going on 24 hours now) to a major system, it really be would be nice to get a downtime notice on the IT blog/other relevant blog syndicated to planet ;)

Robert Helmer [:rhelmer]

Comment 3

•

14 years ago

(In reply to comment #2) > (In reply to comment #1) > > There have been several attempts over the past few days, and I am not sure if a > > downtime notice was posted this time around; I don't see one from a cursory > > glance. > > For a downtime as extended as this one (going on 24 hours now) to a major > system, it really be would be nice to get a downtime notice on the IT > blog/other relevant blog syndicated to planet ;) It should not be down now, thanks for bringing this up. I'll take a look.

Status: RESOLVED → REOPENED

Resolution: INVALID → ---

Robert Helmer [:rhelmer]

Comment 4

•

14 years ago

(In reply to comment #3) > (In reply to comment #2) > > (In reply to comment #1) > > > There have been several attempts over the past few days, and I am not sure if a > > > downtime notice was posted this time around; I don't see one from a cursory > > > glance. > > > > For a downtime as extended as this one (going on 24 hours now) to a major > > system, it really be would be nice to get a downtime notice on the IT > > blog/other relevant blog syndicated to planet ;) > > It should not be down now, thanks for bringing this up. I'll take a look. Smokey, have you tried it? Can you please send me a link to one that's not working? (In reply to comment #1) > I'll post a note in this bug when it's back up, please reopen sooner if this > doesn't make sense. I did not actually do this, sorry if I kept you hanging.

Smokey Ardisson (offline for a while; not following bugs - do not email)

Reporter

Comment 5

•

14 years ago

I guess when I checked last night (right before going to sleep), I only checked the crash I sent in on the 18th and then didn't check anything else. :( I checked a smattering of reports now, and they're up--with the exception of the the one I submitted on the 18th (while things *were* down): http://crash-stats.mozilla.com/report/index/e54c77bc-2f78-46f7-b679-5a0fe2110118 Is there a bug tracking getting those came-in-during-HBase-downtime crashes accessible, or are they just lost? Sorry for the confusion here, and thanks for checking back.

Laura Thomson :laura

Comment 6

•

14 years ago

Smokey, two possible factors: 1. We lost all incoming crashes during a 4 hour window on Sunday night. Was this the time when the orphan was submitted? 2. We had some crashes that came in in HBase downtime that didn't get loaded into Hbase immediately afterwards. We have been loading them in https://bugzilla.mozilla.org/show_bug.cgi?id=627028 and it's basically done. There were a couple of orphans though. Let me see if yours was one of them.

Laura Thomson :laura

Comment 7

•

14 years ago

(In reply to comment #6) > > 2. We had some crashes that came in in HBase downtime that didn't get loaded > into Hbase immediately afterwards. We have been loading them in > https://bugzilla.mozilla.org/show_bug.cgi?id=627028 > and it's basically done. There were a couple of orphans though. Let me see if > yours was one of them. Jabba: lars doesn't have access to see these, can you take a look?

Assignee: nobody → jdow

Justin Dow [:jabba]

Comment 8

•

14 years ago

I found it. it is in pm-app-collector03:/opt/local_failed_hbase_crashes/20110118/name/e5/4c/

Smokey Ardisson (offline for a while; not following bugs - do not email)

Reporter

Comment 9

•

14 years ago

(In reply to comment #6) > Smokey, two possible factors: > 1. We lost all incoming crashes during a 4 hour window on Sunday night. Was > this the time when the orphan was submitted? > > 2. We had some crashes that came in in HBase downtime that didn't get loaded > into Hbase immediately afterwards. We have been loading them in > https://bugzilla.mozilla.org/show_bug.cgi?id=627028 > and it's basically done. There were a couple of orphans though. Let me see if > yours was one of them. Not sure if comment 8 has answered this or not, but for completeness, the report was submitted at 11:43 PM EST on Tuesday 18 Jan.

Laura Thomson :laura

Comment 10

•

14 years ago

So the action on this bug now is to find a way to get orphaned reports into HBase. Renaming bug, assigning to lars.

Assignee: jdow → lars

Severity: blocker → normal

Summary: No individual crash reports are accessible on crash-stats → Get orphaned reports into HBase

Target Milestone: --- → 1.7.7

Laura Thomson :laura

Comment 11

•

14 years ago

Script is written, will run on Thursday.

Robert Helmer [:rhelmer]

Comment 12

•

14 years ago

(In reply to comment #11) > Script is written, will run on Thursday. Ran "orphanSubmit.py --dryrun=True" on all SJC servers, and counted the occurrence of "dry run - pushing" in the log: pm-app-collector02: 349164 pm-app-collector03: 311257 pm-app-collector04: 134450 pm-app-collector05: 417001 pm-app-collector06: 133388 Total: 1,345,260 crashes to be submitted Have not started pushing these to PHX yet. I think we are ready to start.

Robert Helmer [:rhelmer]

Comment 13

•

14 years ago

Lars, any suggestion for number of threads to set? Default is 4, these boxes have 8 cores each. We'll probably be blocked on disk and network most of the time, so might be worth experimenting with setting these higher than 8?

Robert Helmer [:rhelmer]

Comment 14

•

14 years ago

(In reply to comment #13) > Lars, any suggestion for number of threads to set? Default is 4, these boxes > have 8 cores each. We'll probably be blocked on disk and network most of the > time, so might be worth experimenting with setting these higher than 8? We are starting now. numberOfThreads for orphanSubmitter set to 16.

Robert Helmer [:rhelmer]

Comment 15

•

14 years ago

We currently have everything in SJC pointing to one hbase/thrift node in PHX, hp-node10. Daniel has checked that this node is not hosting anything critical, and jabba is going to remove it from the zeus loadbalancer pool on the PHX side. Currently pm-app-collector02 (1 of 5 total) is moving crashes over this connection, we are monitoring and holding off on starting anymore until the above is done.

Robert Helmer [:rhelmer]

Comment 16

•

14 years ago

With 16 threads, we seem to be saturating the local disk IO, which is the expected bottleneck. We could tune this further, but we seem to be pushing a reasonable amount of traffic (~2.4MB/s) so I don't think we're losing too much to context switching, GIL etc (we expect threads to spend most of their time blocking on disk or network IO).

Robert Helmer [:rhelmer]

Updated

•

14 years ago

Blocks: 629798

Robert Helmer [:rhelmer]

Comment 17

•

14 years ago

pm-app-collector02 through 06 are running. jabba reminded us that 01 also has some old crashes, from the time it used to be in service (we used it to start rolling out 1.7.6 but then decided not to go forward, so it's been out of production for a while). Getting pm-app-collector01 ready now.

No longer blocks: 629798

Robert Helmer [:rhelmer]

Comment 18

•

14 years ago

All collectors running; pushing 8-10 MB/s total from MPT collectors -> PHX hp-node10.

Robert Helmer [:rhelmer]

Comment 19

•

14 years ago

The orphan submitter is retrying excessively for some expected failure modes (bug 619695). A lot of these crashes are degenerate (for example, missing or malformed required fields like "submitted_timestamp"). It'd be faster to stop this and fix the bug than continue in this state. Halting for now, I'll try to get a count of what's been submitted and what's left to go.

Robert Helmer [:rhelmer]

Comment 20

•

14 years ago

Attached patch workaround from lars for excessive retry — Details — Splinter Review

After reviewing, testing and applying this patch, started over with "run2". Did a dry-run to get a starting point: """ pm-app-collector01: 84709 pm-app-collector02: 256709 pm-app-collector03: 238960 pm-app-collector04: 134448 pm-app-collector05: 353836 pm-app-collector06: 18193 """ This continued until Friday 11:34 PM Pacific, when hp-node10 started stopped responding. The totals for tonight were: """ pm-app-collector01: 51622 pm-app-collector02: 7302 pm-app-collector03: 32078 pm-app-collector04: 29773 pm-app-collector05: 53652 pm-app-collector06: 18193 """ Leaving it there for tonight.

Robert Helmer [:rhelmer]

Comment 21

•

14 years ago

hp-node10 is back up, started submitting at 2011-01-29 16:33:56 (PST). Here is a dry-run for "run3", showing what's left to do: pm-app-collector01: 33910 pm-app-collector02: 250194 pm-app-collector03: 207498 pm-app-collector04: 105472 pm-app-collector05: 300594 pm-app-collector06: 0 Total: 897,668

Robert Helmer [:rhelmer]

Comment 22

•

14 years ago

Finished around 2011-01-29 21:12:40 Here is a final dry-run ("run4"), I looked at the logs and these all appear to be "degenerate" crashes which are either missing a dump file, or a required field (such as "submitted_timestamp): pm-app-collector01: 42 pm-app-collector02: 34 pm-app-collector03: 34 pm-app-collector04: 23 pm-app-collector05: 42 pm-app-collector06: 23 I'll archive the logs and post them somewhere, I think we are done otherwise.

Robert Helmer [:rhelmer]

Comment 23

•

14 years ago

(In reply to comment #5) > I checked a smattering of reports now, and they're up--with the exception of > the the one I submitted on the 18th (while things *were* down): > http://crash-stats.mozilla.com/report/index/e54c77bc-2f78-46f7-b679-5a0fe2110118 Looks like this one is back now; calling this done.

Status: REOPENED → RESOLVED

Closed: 14 years ago → 14 years ago

Resolution: --- → FIXED

Matt Brandt [:mbrandt]

Comment 24

•

14 years ago

qa verified.

Status: RESOLVED → VERIFIED

Nobody; OK to take it and work on it

Updated

•

13 years ago

Component: Socorro → General

Product: Webtools → Socorro

You need to log in before you can comment on or make changes to this bug.