Closed
Bug 1307968
Opened 9 years ago
Closed 9 years ago
Stage submitter dies
Categories
(Socorro :: Infra, task)
Socorro
Infra
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jschneider, Assigned: willkg)
References
Details
We're seeing this error. I'll validate it's the same one thrown before it dies next time it dies (can't find the old bug or issue).
2016-10-05 20:23:14,467 DEBUG - submitter_app - - QueuingThread - RabbitMQCrashStorage acking fbb6f5a9-76d4-4ee5-a339-783bc2161005 with delivery_tag 7
2016-10-05 20:23:14,467 DEBUG - submitter_app - - QueuingThread - received (('54660a77-b97e-4453-8c88-8d6fa2161005',), {'finished_func': <functools.partial object at 0x3160260>})
2016-10-05 20:23:14,555 ERROR - submitter_app - - Thread-1 - Error in processing a job
Traceback (most recent call last):
File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/socorrolib/lib/threaded_task_manager.py", line 352, in run
function(*args, **kwargs) # execute the task
File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/socorrolib/app/fetch_transform_save_app.py", line 257, in transform
self._transform(crash_id)
File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/socorro-master-py2.7.egg/socorro/collector/submitter_app.py", line 235, in _transform
crash_id
File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/socorro-master-py2.7.egg/socorro/collector/breakpad_submitter_utilities.py", line 53, in save_raw_crash_with_file_dumps
submission_response = urllib2.urlopen(request).read().strip()
File "/usr/lib64/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib64/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib64/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib64/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: Internal Server Error
Comment 1•9 years ago
|
||
It seems to be dead again, according to datadog.
Assignee | ||
Comment 2•9 years ago
|
||
Looking at this error, the problem is that the stage collector is kicking up an HTTP 500 and this script doesn't know how to deal with that.
Possibilities:
1. maybe this only happens during deployments?
2. maybe the stage collector isn't stable for reasons? do we have anything that monitors responses from the stage collector?
Regardless, I can fix this. I'll have the stage submitter handle HTTP 500s and retry after a minute or something like that.
Assigning this to me.
Adrian: I'd toss a "the submitter is down" issue into another bug. Then someone can restart it and solve that bug.
Assignee: nobody → willkg
Assignee | ||
Updated•9 years ago
|
Summary: Stage submitter bug → Stage submitter dies when it gets an HTTP 500
Assignee | ||
Comment 3•9 years ago
|
||
I added a graph to the Stage Perf dashboard that shows HTTP 200 vs. 500 errors for the stage collector. It's pretty interesting--we get a lot more HTTP 500 errors than I feel comfortable with.
Also, if you look at the last two weeks, the stage submitter spends roughly half its time down.
Looking at the graphs, I see a lot of HTTP 500s, so my "HTTP 500 kills the submitter" theory may not be valid. I'll poke around some more.
Assignee | ||
Comment 4•9 years ago
|
||
I poked around some more.
The threaded task manager "handles" this error and spits the above out to the log, but the submitter should continue. I don't think that traceback is related to the cause of submitter process dying. That makes sense because other people have looked into this and if it were that simple, it'd probably be fixed by now.
A while back, I offered to look into the problem more after the submitter got reified. I re-up on that offer. Once it's reified, it's a lot easier to tinker with and debug than its current state.
Assignee | ||
Comment 5•9 years ago
|
||
Making this depend on the reifying bug and fixing the title.
Depends on: 1289466
Summary: Stage submitter dies when it gets an HTTP 500 → Stage submitter dies
Assignee | ||
Comment 6•9 years ago
|
||
Earlier this week, I fixed the problem that was causing the HTTP 500 errors that JP was seeing that were coming from the collector. The stage collector is working much better now.
However, that work didn't affect the reliability of the stage submitter.
Today, JP walked me through the stage submitter and I spent some quality time with it afterwards. I did some cleanup on the node and then rearranged how it runs. Over the course of that work, I reaped dozens of hung sendmail, submitter and other processes. Otherwise, I didn't see any obvious errors or indications of why it's been so flaky lately.
I'm going to keep an eye on the submitter for the next few days to see if it dies again. If it does, I should be in a better spot to see what happened.
Keeping this bug open for a bit.
Status: NEW → ASSIGNED
Comment 7•9 years ago
|
||
>
> I did some cleanup on the node and then rearranged how it runs.
>
For posterity, could you write down how it is now set up here or elsewhere?
Flags: needinfo?(willkg)
Assignee | ||
Comment 8•9 years ago
|
||
I wrote up a README in /home/centos/README.submitter figuring it's right there for people logging in to the submitter to fiddle with it.
I can copy that to mana, too, if you think that's helpful. Theoretically, it'd go with the infrastructure docs, but I don't think we have those right now.
Comment 9•9 years ago
|
||
This works for me. It's enough to be able to follow a pointer to it on the box.
Flags: needinfo?(willkg)
Assignee | ||
Comment 10•9 years ago
|
||
I added a section in the mana page for the "Stage submitter" with a rough explanation of what sordid things it does and where the README is for it.
Assignee | ||
Comment 11•9 years ago
|
||
As an update, the stage submitter seems to be working well as a cron job based on "tail -f"ing the log and watching the graph in datadog.
I'll keep tabs on it tomorrow. If we haven't had any issues by the end of tomorrow, I'll close this out with the theory being that cleaning up the processes and reworking how it runs "fixed it".
Assignee | ||
Comment 12•9 years ago
|
||
We have a monitor for the number of incoming crashes saved to s3 for the -stage environment which we use to determine if the stage submitter is working or not. That was set to notify if it hadn't seen data for 2 minutes. That raised an alert just now.
I checked the stage submitter and it's running fine--no problems in the logs and it's run every 2 minutes since I fiddled with it earlier.
Because the cron job runs every 2 minutes and datadog batches data, I figured it's prudent to raise it to notify us if it hadn't seen data for 5 minutes. That would cover 2 cron cycles. Seems like a better number for now.
Assignee | ||
Comment 13•9 years ago
|
||
The stage submitter has been submitting for almost 24 hours straight without hiccups. I think it's probably fine. It'll send alerts if it's not fine and we can open new bugs and whatever then.
Closing as FIXIFIED!
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•