598752 - Socorro staging should be running 1.7

Reporter

Description

•

14 years ago

This is the branch we need to run in staging:
https://socorro.googlecode.com/svn/branches/1.7reversion/

This branch was copied from the production tag, and is where development for 1.7.4 is taking place.

Aravind Gottipati [:aravind]

Updated

•

14 years ago

Assignee: server-ops → aravind

Stephen Donner [:stephend] Not actively reading bugmail

Comment 1

•

14 years ago

Please cc me on bugs like this in the future; thanks.

Justin Dow [:jabba]

Assignee

Updated

•

14 years ago

Assignee: aravind → jdow

Laura Thomson :laura

Reporter

Comment 2

•

14 years ago

This is now blocking forward progress; do you think you could take care of it today please Jabba?

Severity: major → critical

Justin Dow [:jabba]

Assignee

Comment 3

•

14 years ago

Yes, it is first on my list this morning as soon as I get to the office.

Justin Dow [:jabba]

Assignee

Comment 4

•

14 years ago

I think I've got everything set up. There is one issue, though. When starting the monitor, it fails with this:

2010-09-28 17:18:42,623 DEBUG - MainThread - creating crashStorePool
2010-09-28 17:18:42,624 INFO - priorityLoopingThread - priorityJobAllocationLoop starting.
2010-09-28 17:18:42,625 INFO - jobCleanupThread - jobCleanupLoop starting.
2010-09-28 17:18:42,625 INFO - jobCleanupThread - beginning jobCleanupLoop cycle.
2010-09-28 17:18:42,626 DEBUG - jobCleanupThread - dealing with completed and failed jobs
2010-09-28 17:18:42,627 DEBUG - MainThread - creating crashStore for MainThread
2010-09-28 17:18:42,627 INFO - connecting to hbase
2010-09-28 17:18:42,627 DEBUG - make_connection, timeout = 10000
2010-09-28 17:18:42,629 DEBUG - connection fails: Could not connect to thrift-socorro-hadoop-stg.mozilla.org:9090
2010-09-28 17:18:42,632 DEBUG - connection fails: Could not connect to thrift-socorro-hadoop-stg.mozilla.org:9090
2010-09-28 17:18:42,632 CRITICAL - MainThread - hbase is gone! hbase is gone!
2010-09-28 17:18:42,632 CRITICAL - MainThread Caught Error: socorro.hbase.hbaseClient.NoConnectionException
2010-09-28 17:18:42,632 DEBUG - priorityLoopingThread - outer detects quit
2010-09-28 17:18:42,633 CRITICAL - the connection is not viable.  retries fail: No connection was made to HBase (2 tries): thrift.transport.TTransport.TTransportException-Could not connect to thrift-socorro-hadoop-stg.mozilla.org:9090
2010-09-28 17:18:42,633 DEBUG - jobCleanupThread - starting deletion
2010-09-28 17:18:42,633 INFO - priorityLoopingThread - priorityLoop done.
2010-09-28 17:18:42,635 CRITICAL - trace back follows:
  File "/data/breakpad/processor/socorro/monitor/monitor.py", line 332, in standardJobAllocationLoop
    crashStorage = self.crashStorePool.crashStorage()
  File "/data/breakpad/processor/socorro/collector/crashstorage.py", line 508, in crashStorage
    self[name] = c = self.crashStorageClass(self.config)
  File "/data/breakpad/processor/socorro/collector/crashstorage.py", line 261, in __init__
    self.hbaseConnection = hbaseClient.HBaseConnectionForCrashReports(config.hbaseHost, config.hbasePort, config.hbaseTimeout, logger=self.logger)
  File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 324, in __init__
    mutation,logger)
  File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 247, in __init__
    self.make_connection(timeout=self.timeout)
  File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 266, in make_connection
    self.transport.open()
  File "/data/breakpad/processor/thirdparty/thrift/transport/TTransport.py", line 145, in open
    return self.__trans.open()
  File "/data/breakpad/processor/thirdparty/thrift/transport/TSocket.py", line 89, in open
    raise TTransportException(type=TTransportException.NOT_OPEN, message=message)

2010-09-28 17:18:42,635 CRITICAL - cannot continue - quitting
2010-09-28 17:18:42,635 DEBUG - MainThread - waiting to join.
2010-09-28 17:18:42,650 DEBUG - jobCleanupThread - end of this cleanup iteration
2010-09-28 17:18:42,650 DEBUG - jobCleanupThread - got quit message
2010-09-28 17:18:42,650 INFO - jobCleanupThread - jobCleanupLoop done.
2010-09-28 17:18:42,651 DEBUG - MainThread - calling databaseConnectionPool.cleanup().
2010-09-28 17:18:42,651 DEBUG - MainThread - killing database connections
2010-09-28 17:18:42,651 DEBUG - MainThread - connection jobCleanupThread closed
2010-09-28 17:18:42,651 DEBUG - MainThread - connection priorityLoopingThread closed
2010-09-28 17:18:42,651 INFO - done.


Also, currently the cronjob that submits jobs to staging is currently not enabled.

Laura Thomson :laura

Reporter

Comment 5

•

14 years ago

Is thrift-socorro-hadoop-stg.mozilla.org:9090 the correct address?  Is it up and accepting connections?  Is the network configured appropriately to allow this connection?

Daniel Einspanjer [:dre] [:deinspanjer]

Comment 6

•

14 years ago

http://cm-hadoop01:9091/thrift/health reports all is fine:
cm-hadoop02:9090 - OK
cm-hadoop03:9090 - OK
cm-hadoop04:9090 - OK
cm-hadoop05:9090 - OK

Justin Dow [:jabba]

Assignee

Comment 7

•

14 years ago

I restarted the monitor this morning and submitted new crashes and it seemed to be working now. Perhaps there was a temporary network issue last night. Should I enable the cronjob to submit crashes to stage regularly?

Stephen Donner [:stephend] Not actively reading bugmail

Comment 8

•

14 years ago

(In reply to comment #7)
> I restarted the monitor this morning and submitted new crashes and it seemed to
> be working now. Perhaps there was a temporary network issue last night. Should
> I enable the cronjob to submit crashes to stage regularly?

This is fine with WebQA (and needed, in fact), but I don't know of any potential issues with doing so for the developer side.

Laura?

Justin Dow [:jabba]

Assignee

Comment 9

•

14 years ago

I've got the new collector staging box up and I *think* everything is working, but I'd still need to do an end-to-end test to make sure everything is working as expected. Could someone post a step-by-step of what to look for in which logs to confirm things are working or not working?

K Lars Lohn [:lars] [:klohn]

Comment 10

•

14 years ago

I just tried to submit a crash from khan and I was unable to connect to with this url: http://crash-reports.stage.mozilla.com/submit   This is the traditional url to submit to staging.  So there is still something wrong.

Justin Dow [:jabba]

Assignee

Comment 11

•

14 years ago

ah, indeed I only set up the ssl portion of that https://crash-reports.stage.mozilla.com/submit . Is http also required? I'll have to add that config to the netscaler as well.

K Lars Lohn [:lars] [:klohn]

Comment 12

•

14 years ago

probably not required, I'd forgotten that the default in the test sumbitter was to not use https.  I tried it with https and it worked.  Now going to follow my submission through the system...

Justin Dow [:jabba]

Assignee

Comment 13

•

14 years ago

Lars and Aravind tracked down a few config errors and now the staging is working.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Stephen Donner [:stephend] Not actively reading bugmail

Comment 14

•

14 years ago

(In reply to comment #13)
> Lars and Aravind tracked down a few config errors and now the staging is
> working.

But not yet populated with much data; can we set up the cronjob (comment 7), or is there some reason why we wouldn't, Laura/Lars?

Justin Dow [:jabba]

Assignee

Comment 15

•

14 years ago

I just enabled the cronjob as I closed this bug. Let me know if data doesn't start coming in.

K Lars Lohn [:lars] [:klohn]

Comment 16

•

14 years ago

reopened to get staging updated to the latest from this branch: https://socorro.googlecode.com/svn/branches/1.7reversion/

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Justin Dow [:jabba]

Assignee

Comment 17

•

14 years ago

Everything is up to date on that branch.

Status: REOPENED → RESOLVED

Closed: 14 years ago → 14 years ago

Resolution: --- → FIXED

Stephen Donner [:stephend] Not actively reading bugmail

Comment 18

•

14 years ago

(In reply to comment #17)
> Everything is up to date on that branch.

Justin, is the cron is re-enabled too?  If so, I'll wait a while for data to show up before reopening.  Thx.

Justin Dow [:jabba]

Assignee

Comment 19

•

14 years ago

Yes, the cron is enabled and I removed a stuck lock file that was preventing it from running. I'm watching it right now to make sure it will run.

Justin Dow [:jabba]

Assignee

Comment 20

•

14 years ago

it looks like it is definitely running consistently now.

Stephen Donner [:stephend] Not actively reading bugmail

Comment 21

•

14 years ago

In the past, I'm pretty sure we've has staging looks pretty much like prod; do I need to file a separate bug to get a dump, or something?

http://crash-stats.stage.mozilla.com/products/Firefox is pretty sparse, with only http://crash-stats.stage.mozilla.com/topcrasher/byurl/Firefox/3.6.9 having even one piece of data.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Justin Dow [:jabba]

Assignee

Comment 22

•

14 years ago

This is finally working properly.

Status: REOPENED → RESOLVED

Closed: 14 years ago → 14 years ago

Resolution: --- → FIXED

Stephen Donner [:stephend] Not actively reading bugmail

Comment 23

•

14 years ago

Seems to be fine; thanks.

Verified FIXED on http://crash-stats.stage.mozilla.com/products/Firefox.

Status: RESOLVED → VERIFIED

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: mozilla.org → mozilla.org Graveyard