Closed Bug 598752 Opened 14 years ago Closed 14 years ago

Socorro staging should be running 1.7

Categories

(mozilla.org Graveyard :: Server Operations, task)

All
Other
task
Not set
critical

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: laura, Assigned: jabba)

Details

This is the branch we need to run in staging:
https://socorro.googlecode.com/svn/branches/1.7reversion/

This branch was copied from the production tag, and is where development for 1.7.4 is taking place.
Assignee: server-ops → aravind
Please cc me on bugs like this in the future; thanks.
Assignee: aravind → jdow
This is now blocking forward progress; do you think you could take care of it today please Jabba?
Severity: major → critical
Yes, it is first on my list this morning as soon as I get to the office.
I think I've got everything set up. There is one issue, though. When starting the monitor, it fails with this:

2010-09-28 17:18:42,623 DEBUG - MainThread - creating crashStorePool
2010-09-28 17:18:42,624 INFO - priorityLoopingThread - priorityJobAllocationLoop starting.
2010-09-28 17:18:42,625 INFO - jobCleanupThread - jobCleanupLoop starting.
2010-09-28 17:18:42,625 INFO - jobCleanupThread - beginning jobCleanupLoop cycle.
2010-09-28 17:18:42,626 DEBUG - jobCleanupThread - dealing with completed and failed jobs
2010-09-28 17:18:42,627 DEBUG - MainThread - creating crashStore for MainThread
2010-09-28 17:18:42,627 INFO - connecting to hbase
2010-09-28 17:18:42,627 DEBUG - make_connection, timeout = 10000
2010-09-28 17:18:42,629 DEBUG - connection fails: Could not connect to thrift-socorro-hadoop-stg.mozilla.org:9090
2010-09-28 17:18:42,632 DEBUG - connection fails: Could not connect to thrift-socorro-hadoop-stg.mozilla.org:9090
2010-09-28 17:18:42,632 CRITICAL - MainThread - hbase is gone! hbase is gone!
2010-09-28 17:18:42,632 CRITICAL - MainThread Caught Error: socorro.hbase.hbaseClient.NoConnectionException
2010-09-28 17:18:42,632 DEBUG - priorityLoopingThread - outer detects quit
2010-09-28 17:18:42,633 CRITICAL - the connection is not viable.  retries fail: No connection was made to HBase (2 tries): thrift.transport.TTransport.TTransportException-Could not connect to thrift-socorro-hadoop-stg.mozilla.org:9090
2010-09-28 17:18:42,633 DEBUG - jobCleanupThread - starting deletion
2010-09-28 17:18:42,633 INFO - priorityLoopingThread - priorityLoop done.
2010-09-28 17:18:42,635 CRITICAL - trace back follows:
  File "/data/breakpad/processor/socorro/monitor/monitor.py", line 332, in standardJobAllocationLoop
    crashStorage = self.crashStorePool.crashStorage()
  File "/data/breakpad/processor/socorro/collector/crashstorage.py", line 508, in crashStorage
    self[name] = c = self.crashStorageClass(self.config)
  File "/data/breakpad/processor/socorro/collector/crashstorage.py", line 261, in __init__
    self.hbaseConnection = hbaseClient.HBaseConnectionForCrashReports(config.hbaseHost, config.hbasePort, config.hbaseTimeout, logger=self.logger)
  File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 324, in __init__
    mutation,logger)
  File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 247, in __init__
    self.make_connection(timeout=self.timeout)
  File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 266, in make_connection
    self.transport.open()
  File "/data/breakpad/processor/thirdparty/thrift/transport/TTransport.py", line 145, in open
    return self.__trans.open()
  File "/data/breakpad/processor/thirdparty/thrift/transport/TSocket.py", line 89, in open
    raise TTransportException(type=TTransportException.NOT_OPEN, message=message)

2010-09-28 17:18:42,635 CRITICAL - cannot continue - quitting
2010-09-28 17:18:42,635 DEBUG - MainThread - waiting to join.
2010-09-28 17:18:42,650 DEBUG - jobCleanupThread - end of this cleanup iteration
2010-09-28 17:18:42,650 DEBUG - jobCleanupThread - got quit message
2010-09-28 17:18:42,650 INFO - jobCleanupThread - jobCleanupLoop done.
2010-09-28 17:18:42,651 DEBUG - MainThread - calling databaseConnectionPool.cleanup().
2010-09-28 17:18:42,651 DEBUG - MainThread - killing database connections
2010-09-28 17:18:42,651 DEBUG - MainThread - connection jobCleanupThread closed
2010-09-28 17:18:42,651 DEBUG - MainThread - connection priorityLoopingThread closed
2010-09-28 17:18:42,651 INFO - done.


Also, currently the cronjob that submits jobs to staging is currently not enabled.
Is thrift-socorro-hadoop-stg.mozilla.org:9090 the correct address?  Is it up and accepting connections?  Is the network configured appropriately to allow this connection?
http://cm-hadoop01:9091/thrift/health reports all is fine:
cm-hadoop02:9090 - OK
cm-hadoop03:9090 - OK
cm-hadoop04:9090 - OK
cm-hadoop05:9090 - OK
I restarted the monitor this morning and submitted new crashes and it seemed to be working now. Perhaps there was a temporary network issue last night. Should I enable the cronjob to submit crashes to stage regularly?
(In reply to comment #7)
> I restarted the monitor this morning and submitted new crashes and it seemed to
> be working now. Perhaps there was a temporary network issue last night. Should
> I enable the cronjob to submit crashes to stage regularly?

This is fine with WebQA (and needed, in fact), but I don't know of any potential issues with doing so for the developer side.

Laura?
I've got the new collector staging box up and I *think* everything is working, but I'd still need to do an end-to-end test to make sure everything is working as expected. Could someone post a step-by-step of what to look for in which logs to confirm things are working or not working?
I just tried to submit a crash from khan and I was unable to connect to with this url: http://crash-reports.stage.mozilla.com/submit   This is the traditional url to submit to staging.  So there is still something wrong.
ah, indeed I only set up the ssl portion of that https://crash-reports.stage.mozilla.com/submit . Is http also required? I'll have to add that config to the netscaler as well.
probably not required, I'd forgotten that the default in the test sumbitter was to not use https.  I tried it with https and it worked.  Now going to follow my submission through the system...
Lars and Aravind tracked down a few config errors and now the staging is working.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
(In reply to comment #13)
> Lars and Aravind tracked down a few config errors and now the staging is
> working.

But not yet populated with much data; can we set up the cronjob (comment 7), or is there some reason why we wouldn't, Laura/Lars?
I just enabled the cronjob as I closed this bug. Let me know if data doesn't start coming in.
reopened to get staging updated to the latest from this branch: https://socorro.googlecode.com/svn/branches/1.7reversion/
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Everything is up to date on that branch.
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
(In reply to comment #17)
> Everything is up to date on that branch.

Justin, is the cron is re-enabled too?  If so, I'll wait a while for data to show up before reopening.  Thx.
Yes, the cron is enabled and I removed a stuck lock file that was preventing it from running. I'm watching it right now to make sure it will run.
it looks like it is definitely running consistently now.
In the past, I'm pretty sure we've has staging looks pretty much like prod; do I need to file a separate bug to get a dump, or something?

http://crash-stats.stage.mozilla.com/products/Firefox is pretty sparse, with only http://crash-stats.stage.mozilla.com/topcrasher/byurl/Firefox/3.6.9 having even one piece of data.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This is finally working properly.
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
Seems to be fine; thanks.

Verified FIXED on http://crash-stats.stage.mozilla.com/products/Firefox.
Status: RESOLVED → VERIFIED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.