Closed Bug 599195 Opened 14 years ago Closed 14 years ago

Correlation reports broken the last couple of days...

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jst, Assigned: aravind)

Details

It's been a few days at least since I noticed this, and I don't know how long before then this broke, but we no longer get correlation reports (extensions, modules, core count, etc) for crashes on crash-stats.mozilla.com.
This makes it hard to investigate the reasons for many of our crashes, so raising importance of this bug.
Severity: normal → blocker
Can you provide URLS or Product/Versions?

This feature depends on the files in http://people.mozilla.com/crash_analysis/20100923/

Example:
This Fx 4.0b6 crash has correlations
http://crash-stats.mozilla.com/report/index/01951b76-7a6e-4353-a446-1ecc42100923
This Fx 3.6.9 crash report does not
http://crash-stats.mozilla.com/report/index/86a3b4d7-0a6a-4d75-9904-42a7d2100917

This may be a simple matter of getting the Fx versions you need generated in p.m.c/crash_analysis (IT Bug CC aravind)

A more robust system is under development: (backend Bug#554373)
Does/should this cover the empty correlations found here? http://crash-stats.mozilla.com/topcrasher/byversion/Firefox/3.6.9, or should that be a separate bug?
(In reply to comment #3)
This type of issue has been reported before, but it's not really a bug.

Firefox 3.6.9 correlation reports weren't generated for today. (nor for a while).
http://people.mozilla.com/crash_analysis/20101012/

The real bug is that we haven't replaced the crash-analysis hack with the hadoop backend. I think that is in the works
Still seems broken (3 weeks later), correlation reports are *extremely* valuable, not having them for long periods of time is not acceptable.
(In reply to comment #5)
If this is for 3.6.9, then we just have to ask IT (or whoever runs the dbaron reports on people) to add that version.

I'm not sure of the bug# or schedule for adding Bug#554373 (Hadoop proper fix) to the frontend.
What I've been seeing is more 4.0 beta stuff than anything else, but that doesn't mean it's a problem only for 4.0 beta, that's just what I've run into many many times recently. To name a few, have a look at:

http://crash-stats.mozilla.com/report/index/e0b5a37b-7e76-41b4-8a49-020e02100927
http://crash-stats.mozilla.com/report/index/8280a8ff-067d-45c2-9c7d-ee6792100922
Yes, I don't see 4.0b7pre in http://people.mozilla.com/crash_analysis/20101013/. I'll ping IT to find out who can fix this.
@aravind: please add 4.0b7pre and 3.6.9 to http://people.mozilla.com/crash_analysis/20101013/
Assignee: nobody → server-ops
Component: Socorro → Server Operations
Product: Webtools → mozilla.org
QA Contact: socorro → mrz
Version: Trunk → other
Assignee: server-ops → aravind
The problem here is that the hbase connection to pull out the crashes is being flaky, Here is the log from the python script.

DEBUG Ooid: "2f53ad4d-74d8-43df-8c2c-08aa82101013"
DEBUG MainThread - retry_wrapper: get_processed_json_as_string, try number 1
DEBUG MainThread - retry_wrapper: handled exception, timed out
DEBUG MainThread - retry_wrapper: about to retry connection
DEBUG make_connection, timeout = 5000
DEBUG connection successful
DEBUG MainThread - retry_wrapper: get_processed_json_as_string, try number 2
DEBUG MainThread - retry_wrapper: handled exception, timed out
Traceback (most recent call last):
  File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 889, in ?
    connection.export_jsonz_tarball_for_ooids(*args)
  File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 493, in export_jsonz_tarball_for_ooids
    json = self.get_processed_json_as_string(ooid)
  File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 143, in f
    result = fn(self, *args, **kwargs)
  File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 401, in get_processed_json_as_string
    listOfRawRows = self.client.getRowWithColumns('crash_reports',row_id,['processed_data:json'])
  File "/data/breakpad/processor/thirdparty/hbase/hbase.py", line 1116, in getRowWithColumns
    return self.recv_getRowWithColumns()
  File "/data/breakpad/processor/thirdparty/hbase/hbase.py", line 1129, in recv_getRowWithColumns
    (fname, mtype, rseqid) = self._iprot.readMessageBegin()
  File "/data/breakpad/processor/thirdparty/thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin
    sz = self.readI32()
  File "/data/breakpad/processor/thirdparty/thrift/protocol/TBinaryProtocol.py", line 203, in readI32
    buff = self.trans.readAll(4)
  File "/data/breakpad/processor/thirdparty/thrift/transport/TTransport.py", line 58, in readAll
    chunk = self.read(sz-have)
  File "/data/breakpad/processor/thirdparty/thrift/transport/TTransport.py", line 155, in read
    self.__rbuf = StringIO(self.__trans.read(max(sz, self.DEFAULT_BUFFER)))
  File "/data/breakpad/processor/thirdparty/thrift/transport/TSocket.py", line 92, in read
    buff = self.handle.recv(sz)
__main__.FatalException: the connection is not viable.  retries fail: 


I increased the hbase timeout to 60s.

Also, One thing to note here is that in the past choffman had asked me to generate reports for the two most active firefox beta versions.  Here is the count from the last 24 hours.

 version  | counts 
----------+--------
 4.0b6    |  25632
 4.0b4    |   1857
 4.0b8pre |   1834
 4.0b5    |   1647
 4.0b1    |   1220
 4.0b2    |   1185
 4.0b3    |    869
 4.0b7pre |    776
 3.1b3    |    706
 3.6b4    |    542
(10 rows)


Did we want to change the script to instead generate reports for specific versions?
when is this process scheduled to run?
(In reply to comment #11)
> when is this process scheduled to run?

5:00 AM.
Increasing the timeout seems to have helped.  I also added a manual override to include 4.0b7pre and 3.6.9 in the generated reports.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.