Closed
Bug 598752
Opened 14 years ago
Closed 14 years ago
Socorro staging should be running 1.7
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: laura, Assigned: jabba)
Details
This is the branch we need to run in staging: https://socorro.googlecode.com/svn/branches/1.7reversion/ This branch was copied from the production tag, and is where development for 1.7.4 is taking place.
Updated•14 years ago
|
Assignee: server-ops → aravind
Comment 1•14 years ago
|
||
Please cc me on bugs like this in the future; thanks.
Assignee | ||
Updated•14 years ago
|
Assignee: aravind → jdow
Reporter | ||
Comment 2•14 years ago
|
||
This is now blocking forward progress; do you think you could take care of it today please Jabba?
Severity: major → critical
Assignee | ||
Comment 3•14 years ago
|
||
Yes, it is first on my list this morning as soon as I get to the office.
Assignee | ||
Comment 4•14 years ago
|
||
I think I've got everything set up. There is one issue, though. When starting the monitor, it fails with this: 2010-09-28 17:18:42,623 DEBUG - MainThread - creating crashStorePool 2010-09-28 17:18:42,624 INFO - priorityLoopingThread - priorityJobAllocationLoop starting. 2010-09-28 17:18:42,625 INFO - jobCleanupThread - jobCleanupLoop starting. 2010-09-28 17:18:42,625 INFO - jobCleanupThread - beginning jobCleanupLoop cycle. 2010-09-28 17:18:42,626 DEBUG - jobCleanupThread - dealing with completed and failed jobs 2010-09-28 17:18:42,627 DEBUG - MainThread - creating crashStore for MainThread 2010-09-28 17:18:42,627 INFO - connecting to hbase 2010-09-28 17:18:42,627 DEBUG - make_connection, timeout = 10000 2010-09-28 17:18:42,629 DEBUG - connection fails: Could not connect to thrift-socorro-hadoop-stg.mozilla.org:9090 2010-09-28 17:18:42,632 DEBUG - connection fails: Could not connect to thrift-socorro-hadoop-stg.mozilla.org:9090 2010-09-28 17:18:42,632 CRITICAL - MainThread - hbase is gone! hbase is gone! 2010-09-28 17:18:42,632 CRITICAL - MainThread Caught Error: socorro.hbase.hbaseClient.NoConnectionException 2010-09-28 17:18:42,632 DEBUG - priorityLoopingThread - outer detects quit 2010-09-28 17:18:42,633 CRITICAL - the connection is not viable. retries fail: No connection was made to HBase (2 tries): thrift.transport.TTransport.TTransportException-Could not connect to thrift-socorro-hadoop-stg.mozilla.org:9090 2010-09-28 17:18:42,633 DEBUG - jobCleanupThread - starting deletion 2010-09-28 17:18:42,633 INFO - priorityLoopingThread - priorityLoop done. 2010-09-28 17:18:42,635 CRITICAL - trace back follows: File "/data/breakpad/processor/socorro/monitor/monitor.py", line 332, in standardJobAllocationLoop crashStorage = self.crashStorePool.crashStorage() File "/data/breakpad/processor/socorro/collector/crashstorage.py", line 508, in crashStorage self[name] = c = self.crashStorageClass(self.config) File "/data/breakpad/processor/socorro/collector/crashstorage.py", line 261, in __init__ self.hbaseConnection = hbaseClient.HBaseConnectionForCrashReports(config.hbaseHost, config.hbasePort, config.hbaseTimeout, logger=self.logger) File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 324, in __init__ mutation,logger) File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 247, in __init__ self.make_connection(timeout=self.timeout) File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 266, in make_connection self.transport.open() File "/data/breakpad/processor/thirdparty/thrift/transport/TTransport.py", line 145, in open return self.__trans.open() File "/data/breakpad/processor/thirdparty/thrift/transport/TSocket.py", line 89, in open raise TTransportException(type=TTransportException.NOT_OPEN, message=message) 2010-09-28 17:18:42,635 CRITICAL - cannot continue - quitting 2010-09-28 17:18:42,635 DEBUG - MainThread - waiting to join. 2010-09-28 17:18:42,650 DEBUG - jobCleanupThread - end of this cleanup iteration 2010-09-28 17:18:42,650 DEBUG - jobCleanupThread - got quit message 2010-09-28 17:18:42,650 INFO - jobCleanupThread - jobCleanupLoop done. 2010-09-28 17:18:42,651 DEBUG - MainThread - calling databaseConnectionPool.cleanup(). 2010-09-28 17:18:42,651 DEBUG - MainThread - killing database connections 2010-09-28 17:18:42,651 DEBUG - MainThread - connection jobCleanupThread closed 2010-09-28 17:18:42,651 DEBUG - MainThread - connection priorityLoopingThread closed 2010-09-28 17:18:42,651 INFO - done. Also, currently the cronjob that submits jobs to staging is currently not enabled.
Reporter | ||
Comment 5•14 years ago
|
||
Is thrift-socorro-hadoop-stg.mozilla.org:9090 the correct address? Is it up and accepting connections? Is the network configured appropriately to allow this connection?
Comment 6•14 years ago
|
||
http://cm-hadoop01:9091/thrift/health reports all is fine: cm-hadoop02:9090 - OK cm-hadoop03:9090 - OK cm-hadoop04:9090 - OK cm-hadoop05:9090 - OK
Assignee | ||
Comment 7•14 years ago
|
||
I restarted the monitor this morning and submitted new crashes and it seemed to be working now. Perhaps there was a temporary network issue last night. Should I enable the cronjob to submit crashes to stage regularly?
Comment 8•14 years ago
|
||
(In reply to comment #7) > I restarted the monitor this morning and submitted new crashes and it seemed to > be working now. Perhaps there was a temporary network issue last night. Should > I enable the cronjob to submit crashes to stage regularly? This is fine with WebQA (and needed, in fact), but I don't know of any potential issues with doing so for the developer side. Laura?
Assignee | ||
Comment 9•14 years ago
|
||
I've got the new collector staging box up and I *think* everything is working, but I'd still need to do an end-to-end test to make sure everything is working as expected. Could someone post a step-by-step of what to look for in which logs to confirm things are working or not working?
Comment 10•14 years ago
|
||
I just tried to submit a crash from khan and I was unable to connect to with this url: http://crash-reports.stage.mozilla.com/submit This is the traditional url to submit to staging. So there is still something wrong.
Assignee | ||
Comment 11•14 years ago
|
||
ah, indeed I only set up the ssl portion of that https://crash-reports.stage.mozilla.com/submit . Is http also required? I'll have to add that config to the netscaler as well.
Comment 12•14 years ago
|
||
probably not required, I'd forgotten that the default in the test sumbitter was to not use https. I tried it with https and it worked. Now going to follow my submission through the system...
Assignee | ||
Comment 13•14 years ago
|
||
Lars and Aravind tracked down a few config errors and now the staging is working.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
(In reply to comment #13) > Lars and Aravind tracked down a few config errors and now the staging is > working. But not yet populated with much data; can we set up the cronjob (comment 7), or is there some reason why we wouldn't, Laura/Lars?
Assignee | ||
Comment 15•14 years ago
|
||
I just enabled the cronjob as I closed this bug. Let me know if data doesn't start coming in.
Comment 16•14 years ago
|
||
reopened to get staging updated to the latest from this branch: https://socorro.googlecode.com/svn/branches/1.7reversion/
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 17•14 years ago
|
||
Everything is up to date on that branch.
Status: REOPENED → RESOLVED
Closed: 14 years ago → 14 years ago
Resolution: --- → FIXED
(In reply to comment #17) > Everything is up to date on that branch. Justin, is the cron is re-enabled too? If so, I'll wait a while for data to show up before reopening. Thx.
Assignee | ||
Comment 19•14 years ago
|
||
Yes, the cron is enabled and I removed a stuck lock file that was preventing it from running. I'm watching it right now to make sure it will run.
Assignee | ||
Comment 20•14 years ago
|
||
it looks like it is definitely running consistently now.
In the past, I'm pretty sure we've has staging looks pretty much like prod; do I need to file a separate bug to get a dump, or something? http://crash-stats.stage.mozilla.com/products/Firefox is pretty sparse, with only http://crash-stats.stage.mozilla.com/topcrasher/byurl/Firefox/3.6.9 having even one piece of data.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 22•14 years ago
|
||
This is finally working properly.
Status: REOPENED → RESOLVED
Closed: 14 years ago → 14 years ago
Resolution: --- → FIXED
Seems to be fine; thanks. Verified FIXED on http://crash-stats.stage.mozilla.com/products/Firefox.
Status: RESOLVED → VERIFIED
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•