self-serve agents' error handling needs more care

NEW
Unassigned

Status

Release Engineering
Tools
3 years ago
2 years ago

People

(Reporter: dustin, Unassigned)

Tracking

(Depends on: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2025] )

Attachments

(1 attachment, 2 obsolete attachments)

(Reporter)

Description

3 years ago
While debugging a while back, I added some invalid messages.  They're still circulating:

2014-03-05 14:38:29,457 Channel open
2014-03-05 14:40:04,426 Received {u'action': u'reprioritize', u'body': {u'priority': 4, u'brid': -1}, u'who': u'dmitchell@mozilla.com (testing)'}
2014-03-05 14:40:04,427 Loading masters from https://hg.mozilla.org/build/tools/raw-file/default/buildfarm/maintenance/production-masters.json
2014-03-05 14:40:05,067 Loading branches from https://hg.mozilla.org/build/tools/raw-file/default/buildfarm/maintenance/production-branches.json
2014-03-05 14:40:05,572 reprioritizing request by dmitchell@mozilla.com (testing) of request -1 to priority 4
2014-03-05 14:40:06,851 No request with id -1, giving up
2014-03-05 14:40:06,852 Error processing message
Traceback (most recent call last):
  File "/builds/selfserve-agent/lib/python2.7/site-packages/buildapi/scripts/selfserve_agent.py", line 236, in receive_message
    msg['request_id'] = message_data['body']['request_id']
KeyError: 'request_id'
2014-03-05 14:40:06,853 Received {u'action': u'reprioritize', u'body': {u'priority': 2, u'brid': -1}, u'who': u'dmitchell@mozilla.com (testing)'}
2014-03-05 14:40:06,854 reprioritizing request by dmitchell@mozilla.com (testing) of request -1 to priority 2
2014-03-05 14:40:07,023 No request with id -1, giving up
2014-03-05 14:40:07,023 Error processing message
Traceback (most recent call last):
  File "/builds/selfserve-agent/lib/python2.7/site-packages/buildapi/scripts/selfserve_agent.py", line 236, in receive_message
    msg['request_id'] = message_data['body']['request_id']
KeyError: 'request_id'
2014-03-05 14:40:11,031 Received {u'action': u'reprioritize', u'body': {u'priority': 1, u'brid': -1}, u'who': u'dmitchell@mozilla.com (testing)'}
2014-03-05 14:40:11,032 reprioritizing request by dmitchell@mozilla.com (testing) of request -1 to priority 1
2014-03-05 14:40:11,201 No request with id -1, giving up
2014-03-05 14:40:11,201 Error processing message
Traceback (most recent call last):
  File "/builds/selfserve-agent/lib/python2.7/site-packages/buildapi/scripts/selfserve_agent.py", line 236, in receive_message
    msg['request_id'] = message_data['body']['request_id']
KeyError: 'request_id'
2014-03-05 14:40:11,202 Received {u'action': u'reprioritize', u'body': {u'priority': 4, u'brid': -1}, u'who': u'dmitchell@mozilla.com (testing)'}
2014-03-05 14:40:11,202 reprioritizing request by dmitchell@mozilla.com (testing) of request -1 to priority 4
2014-03-05 14:40:11,371 No request with id -1, giving up
2014-03-05 14:40:11,372 Error processing message
Traceback (most recent call last):
  File "/builds/selfserve-agent/lib/python2.7/site-packages/buildapi/scripts/selfserve_agent.py", line 236, in receive_message
    msg['request_id'] = message_data['body']['request_id']
KeyError: 'request_id'

The problem is, every time a self-serve agent starts up, it gets some of these messages, then just leaves them un-acked.  So when that agent disconnects, rabbit re-queues the messages and the whole thing starts over again.

Instead, self-serve should log and ack any messages that cannot be processed, preferably sending a response message with error information.
Created attachment 8386433 [details] [diff] [review]
fix leaks of VoiceEngines in getUserMedia

WIP patch
(Reporter)

Comment 2

3 years ago
Comment on attachment 8386433 [details] [diff] [review]
fix leaks of VoiceEngines in getUserMedia

I think this is the wrong bug Randell
Attachment #8386433 - Attachment is obsolete: true
Created attachment 8390837 [details] [diff] [review]
ss.diff

Something like this maybe?
(Reporter)

Comment 4

3 years ago
Comment on attachment 8390837 [details] [diff] [review]
ss.diff

That's a very specific fix for these "trapped" messages.  I think we could address this specific issue much more directly by checking for and discarding such messages before calling action_func.

More generally, though, we need a way to distinguish "this will never work" errors (which should be acked and, if a request_id is present, replied to), transient failures (which should result in a msg.requeue() followed by a brief pause), connection errors (which should be raised to the ReliableConsumer for reconnect), and other Python exceptions (which shouldn't be handled at all).  The code's doing a pretty terrible job of all of that right now.
Attachment #8390837 - Flags: review-
(Reporter)

Comment 5

3 years ago
Created attachment 8391170 [details] [diff] [review]
bug980086-2.patch
Assignee: nobody → dustin
Attachment #8390837 - Attachment is obsolete: true
Attachment #8391170 - Flags: review?(rail)
Attachment #8391170 - Flags: review?(rail) → review+
(Reporter)

Updated

3 years ago
Attachment #8391170 - Flags: checked-in+
(Reporter)

Updated

3 years ago
Summary: self-serve doesn't ack invalid messages → self-serve agents' error handling needs more care
(Reporter)

Comment 6

3 years ago
Let's try to fix this up while moving buildapi into relengapi.
Depends on: 1026110

Updated

3 years ago
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2025]

Updated

3 years ago
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2025] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2032]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2032] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2025]
(Reporter)

Updated

2 years ago
Assignee: dustin → nobody
You need to log in before you can comment on or make changes to this bug.