Closed Bug 1307968 Opened 9 years ago Closed 9 years ago

Stage submitter dies

Categories

(Socorro :: Infra, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jschneider, Assigned: willkg)

References

Details

We're seeing this error. I'll validate it's the same one thrown before it dies next time it dies (can't find the old bug or issue). 2016-10-05 20:23:14,467 DEBUG - submitter_app - - QueuingThread - RabbitMQCrashStorage acking fbb6f5a9-76d4-4ee5-a339-783bc2161005 with delivery_tag 7 2016-10-05 20:23:14,467 DEBUG - submitter_app - - QueuingThread - received (('54660a77-b97e-4453-8c88-8d6fa2161005',), {'finished_func': <functools.partial object at 0x3160260>}) 2016-10-05 20:23:14,555 ERROR - submitter_app - - Thread-1 - Error in processing a job Traceback (most recent call last): File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/socorrolib/lib/threaded_task_manager.py", line 352, in run function(*args, **kwargs) # execute the task File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/socorrolib/app/fetch_transform_save_app.py", line 257, in transform self._transform(crash_id) File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/socorro-master-py2.7.egg/socorro/collector/submitter_app.py", line 235, in _transform crash_id File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/socorro-master-py2.7.egg/socorro/collector/breakpad_submitter_utilities.py", line 53, in save_raw_crash_with_file_dumps submission_response = urllib2.urlopen(request).read().strip() File "/usr/lib64/python2.7/urllib2.py", line 127, in urlopen return _opener.open(url, data, timeout) File "/usr/lib64/python2.7/urllib2.py", line 410, in open response = meth(req, response) File "/usr/lib64/python2.7/urllib2.py", line 523, in http_response 'http', request, response, code, msg, hdrs) File "/usr/lib64/python2.7/urllib2.py", line 448, in error return self._call_chain(*args) File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain result = func(*args) File "/usr/lib64/python2.7/urllib2.py", line 531, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) HTTPError: HTTP Error 500: Internal Server Error
It seems to be dead again, according to datadog.
Looking at this error, the problem is that the stage collector is kicking up an HTTP 500 and this script doesn't know how to deal with that. Possibilities: 1. maybe this only happens during deployments? 2. maybe the stage collector isn't stable for reasons? do we have anything that monitors responses from the stage collector? Regardless, I can fix this. I'll have the stage submitter handle HTTP 500s and retry after a minute or something like that. Assigning this to me. Adrian: I'd toss a "the submitter is down" issue into another bug. Then someone can restart it and solve that bug.
Assignee: nobody → willkg
Summary: Stage submitter bug → Stage submitter dies when it gets an HTTP 500
I added a graph to the Stage Perf dashboard that shows HTTP 200 vs. 500 errors for the stage collector. It's pretty interesting--we get a lot more HTTP 500 errors than I feel comfortable with. Also, if you look at the last two weeks, the stage submitter spends roughly half its time down. Looking at the graphs, I see a lot of HTTP 500s, so my "HTTP 500 kills the submitter" theory may not be valid. I'll poke around some more.
I poked around some more. The threaded task manager "handles" this error and spits the above out to the log, but the submitter should continue. I don't think that traceback is related to the cause of submitter process dying. That makes sense because other people have looked into this and if it were that simple, it'd probably be fixed by now. A while back, I offered to look into the problem more after the submitter got reified. I re-up on that offer. Once it's reified, it's a lot easier to tinker with and debug than its current state.
Making this depend on the reifying bug and fixing the title.
Depends on: 1289466
Summary: Stage submitter dies when it gets an HTTP 500 → Stage submitter dies
Earlier this week, I fixed the problem that was causing the HTTP 500 errors that JP was seeing that were coming from the collector. The stage collector is working much better now. However, that work didn't affect the reliability of the stage submitter. Today, JP walked me through the stage submitter and I spent some quality time with it afterwards. I did some cleanup on the node and then rearranged how it runs. Over the course of that work, I reaped dozens of hung sendmail, submitter and other processes. Otherwise, I didn't see any obvious errors or indications of why it's been so flaky lately. I'm going to keep an eye on the submitter for the next few days to see if it dies again. If it does, I should be in a better spot to see what happened. Keeping this bug open for a bit.
Status: NEW → ASSIGNED
> > I did some cleanup on the node and then rearranged how it runs. > For posterity, could you write down how it is now set up here or elsewhere?
Flags: needinfo?(willkg)
I wrote up a README in /home/centos/README.submitter figuring it's right there for people logging in to the submitter to fiddle with it. I can copy that to mana, too, if you think that's helpful. Theoretically, it'd go with the infrastructure docs, but I don't think we have those right now.
This works for me. It's enough to be able to follow a pointer to it on the box.
Flags: needinfo?(willkg)
I added a section in the mana page for the "Stage submitter" with a rough explanation of what sordid things it does and where the README is for it.
As an update, the stage submitter seems to be working well as a cron job based on "tail -f"ing the log and watching the graph in datadog. I'll keep tabs on it tomorrow. If we haven't had any issues by the end of tomorrow, I'll close this out with the theory being that cleaning up the processes and reworking how it runs "fixed it".
We have a monitor for the number of incoming crashes saved to s3 for the -stage environment which we use to determine if the stage submitter is working or not. That was set to notify if it hadn't seen data for 2 minutes. That raised an alert just now. I checked the stage submitter and it's running fine--no problems in the logs and it's run every 2 minutes since I fiddled with it earlier. Because the cron job runs every 2 minutes and datadog batches data, I figured it's prudent to raise it to notify us if it hadn't seen data for 5 minutes. That would cover 2 cron cycles. Seems like a better number for now.
The stage submitter has been submitting for almost 24 hours straight without hiccups. I think it's probably fine. It'll send alerts if it's not fine and we can open new bugs and whatever then. Closing as FIXIFIED!
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.