Closed Bug 980086 Opened 12 years ago Closed 8 years ago

self-serve agents' error handling needs more care

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: dustin, Unassigned)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2025] )

Attachments

(1 file, 2 obsolete files)

While debugging a while back, I added some invalid messages. They're still circulating: 2014-03-05 14:38:29,457 Channel open 2014-03-05 14:40:04,426 Received {u'action': u'reprioritize', u'body': {u'priority': 4, u'brid': -1}, u'who': u'dmitchell@mozilla.com (testing)'} 2014-03-05 14:40:04,427 Loading masters from https://hg.mozilla.org/build/tools/raw-file/default/buildfarm/maintenance/production-masters.json 2014-03-05 14:40:05,067 Loading branches from https://hg.mozilla.org/build/tools/raw-file/default/buildfarm/maintenance/production-branches.json 2014-03-05 14:40:05,572 reprioritizing request by dmitchell@mozilla.com (testing) of request -1 to priority 4 2014-03-05 14:40:06,851 No request with id -1, giving up 2014-03-05 14:40:06,852 Error processing message Traceback (most recent call last): File "/builds/selfserve-agent/lib/python2.7/site-packages/buildapi/scripts/selfserve_agent.py", line 236, in receive_message msg['request_id'] = message_data['body']['request_id'] KeyError: 'request_id' 2014-03-05 14:40:06,853 Received {u'action': u'reprioritize', u'body': {u'priority': 2, u'brid': -1}, u'who': u'dmitchell@mozilla.com (testing)'} 2014-03-05 14:40:06,854 reprioritizing request by dmitchell@mozilla.com (testing) of request -1 to priority 2 2014-03-05 14:40:07,023 No request with id -1, giving up 2014-03-05 14:40:07,023 Error processing message Traceback (most recent call last): File "/builds/selfserve-agent/lib/python2.7/site-packages/buildapi/scripts/selfserve_agent.py", line 236, in receive_message msg['request_id'] = message_data['body']['request_id'] KeyError: 'request_id' 2014-03-05 14:40:11,031 Received {u'action': u'reprioritize', u'body': {u'priority': 1, u'brid': -1}, u'who': u'dmitchell@mozilla.com (testing)'} 2014-03-05 14:40:11,032 reprioritizing request by dmitchell@mozilla.com (testing) of request -1 to priority 1 2014-03-05 14:40:11,201 No request with id -1, giving up 2014-03-05 14:40:11,201 Error processing message Traceback (most recent call last): File "/builds/selfserve-agent/lib/python2.7/site-packages/buildapi/scripts/selfserve_agent.py", line 236, in receive_message msg['request_id'] = message_data['body']['request_id'] KeyError: 'request_id' 2014-03-05 14:40:11,202 Received {u'action': u'reprioritize', u'body': {u'priority': 4, u'brid': -1}, u'who': u'dmitchell@mozilla.com (testing)'} 2014-03-05 14:40:11,202 reprioritizing request by dmitchell@mozilla.com (testing) of request -1 to priority 4 2014-03-05 14:40:11,371 No request with id -1, giving up 2014-03-05 14:40:11,372 Error processing message Traceback (most recent call last): File "/builds/selfserve-agent/lib/python2.7/site-packages/buildapi/scripts/selfserve_agent.py", line 236, in receive_message msg['request_id'] = message_data['body']['request_id'] KeyError: 'request_id' The problem is, every time a self-serve agent starts up, it gets some of these messages, then just leaves them un-acked. So when that agent disconnects, rabbit re-queues the messages and the whole thing starts over again. Instead, self-serve should log and ack any messages that cannot be processed, preferably sending a response message with error information.
Comment on attachment 8386433 [details] [diff] [review] fix leaks of VoiceEngines in getUserMedia I think this is the wrong bug Randell
Attachment #8386433 - Attachment is obsolete: true
Attached patch ss.diff (obsolete) — Splinter Review
Something like this maybe?
Comment on attachment 8390837 [details] [diff] [review] ss.diff That's a very specific fix for these "trapped" messages. I think we could address this specific issue much more directly by checking for and discarding such messages before calling action_func. More generally, though, we need a way to distinguish "this will never work" errors (which should be acked and, if a request_id is present, replied to), transient failures (which should result in a msg.requeue() followed by a brief pause), connection errors (which should be raised to the ReliableConsumer for reconnect), and other Python exceptions (which shouldn't be handled at all). The code's doing a pretty terrible job of all of that right now.
Attachment #8390837 - Flags: review-
Assignee: nobody → dustin
Attachment #8390837 - Attachment is obsolete: true
Attachment #8391170 - Flags: review?(rail)
Attachment #8391170 - Flags: review?(rail) → review+
Attachment #8391170 - Flags: checked-in+
Summary: self-serve doesn't ack invalid messages → self-serve agents' error handling needs more care
Let's try to fix this up while moving buildapi into relengapi.
Depends on: 1026110
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2025]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2025] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2032]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2032] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2025]
Assignee: dustin → nobody
Component: Tools → General
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: