pulse_actions - Performance analysis of pulse_actions

RESOLVED FIXED

Status

Testing
General
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: armenzg, Assigned: armenzg)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

It seems that we can't keep up processing messages.
This changed in the last few days.

I've increased the number of workers (right now I tripled our cost), however, I would like to analyze why this is happening and address it.
2 minutes and 39 seconds is unnaceptable:
> Nov 01 13:43:03 pulse-actions app/worker1.1: mozci	 INFO:	 ==> We want to trigger 'Windows 7 32-bit mozilla-inbound talos other-e10s' a total of 1 time(s). 
> Nov 01 13:43:03 pulse-actions app/worker1.1: mozci	 INFO:	  
> Nov 01 13:45:42 pulse-actions app/worker1.1: mozci	 INFO:	 We found a job that finished, however, it did not produced files. 


[1]
Nov 01 13:42:57 pulse-actions app/worker1.1: root	 INFO:	 #### New user request ####. 
Nov 01 13:42:57 pulse-actions app/worker1.1: root	 INFO:	 This log was produced by https://github.com/mozilla/pulse_actions in case you want to help us out! :D 
Nov 01 13:42:58 pulse-actions app/worker1.1: treeherder_job_action	 INFO:	 Backfill action requested by jmaher@mozilla.com for 'Windows 7 32-bit mozilla-inbound talos other-e10s' 
Nov 01 13:42:58 pulse-actions app/worker1.1: treeherder_job_action	 INFO:	 Request for https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=b555a8f33edb7ac95555ab341d5b8281ae5b234f&selectedJob=38409929 
Nov 01 13:43:00 pulse-actions app/worker1.1: mozci	 INFO:	 We're *aiming* to backfill; note that we ignore the revision that you request to backfill from (b555a8f33edb) up to 12.0 pushes (seta skip: 8; factor: 1.5) and we backfill up to the last green if found. 
Nov 01 13:43:00 pulse-actions app/worker1.1: mozci	 INFO:	 https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=Windows 7 32-bit mozilla-inbound talos other-e10s&tochange=54a28d6fbed4db85e0f40ebeaf52d563c8226509&fromchange=0a1486cf11b228e41d27499b79b1e3ea6be3e28f 
Nov 01 13:43:00 pulse-actions app/worker1.1: mozci	 INFO:	 We want to have 1 job(s) of Windows 7 32-bit mozilla-inbound talos other-e10s on the following revisions:  
Nov 01 13:43:00 pulse-actions app/worker1.1: mozci	 INFO:	  - 0a1486cf11b228e41d27499b79b1e3ea6be3e28f 
Nov 01 13:43:00 pulse-actions app/worker1.1: mozci	 INFO:	  - 33c2241696c07a203f5355771a40a8c28a0767b5 
Nov 01 13:43:00 pulse-actions app/worker1.1: mozci	 INFO:	  - 7b5270bac41b49133cb44af4851f120c1de7cdd4 
Nov 01 13:43:00 pulse-actions app/worker1.1: mozci	 INFO:	  - b62b6568d65e63bd2f514f7a5099135065431123 
Nov 01 13:43:00 pulse-actions app/worker1.1: mozci	 INFO:	  - 5c834894e4b46cf628237789b900996e6525a729 
Nov 01 13:43:00 pulse-actions app/worker1.1: mozci	 INFO:	  - c6587fbc2dbe99f0fabc25454dbe8a02e1eeccf4 
Nov 01 13:43:00 pulse-actions app/worker1.1: mozci	 INFO:	  - 72764ba31b81562e3069f6935da33cae06fe6b00 
Nov 01 13:43:00 pulse-actions app/worker1.1: mozci	 INFO:	  - 8c08e1aca9ea4953d610ead9b7d1cd4eb0d822d3 
Nov 01 13:43:00 pulse-actions app/worker1.1: mozci	 INFO:	  - 2777fabbe62a3a55a3514d724303977bac7260fb 
Nov 01 13:43:00 pulse-actions app/worker1.1: mozci	 INFO:	  - b171726c465e305411429b29012f48e890598b3f 
Nov 01 13:43:00 pulse-actions app/worker1.1: mozci	 INFO:	  - ac4628bf99126783da833e1641d673c92ee4b1a7 
Nov 01 13:43:00 pulse-actions app/worker1.1: mozci	 INFO:	  - 54a28d6fbed4db85e0f40ebeaf52d563c8226509 
Nov 01 13:43:00 pulse-actions app/worker1.1: mozci	 INFO:	  
Nov 01 13:43:00 pulse-actions app/worker1.1: mozci	 INFO:	 === 0a1486cf11b228e41d27499b79b1e3ea6be3e28f === 
Nov 01 13:43:01 pulse-actions app/worker1.1: mozci	 INFO:	 We have found 0 potential job(s) matching 'Windows 7 32-bit mozilla-inbound talos other-e10s' on 0a1486cf11b228e41d27499b79b1e3ea6be3e28f. We need to trigger more. 
Nov 01 13:43:01 pulse-actions app/worker1.1: buildapi	 INFO:	 We're going to re-trigger an existing completed job with request_id: 129750737 1 time(s). 
Nov 01 13:43:01 pulse-actions app/worker1.1: mozci	 INFO:	  
Nov 01 13:43:01 pulse-actions app/worker1.1: mozci	 INFO:	 === 33c2241696c07a203f5355771a40a8c28a0767b5 === 
Nov 01 13:43:03 pulse-actions app/worker1.1: mozci	 INFO:	 We have found 0 potential job(s) matching 'Windows 7 32-bit mozilla-inbound talos other-e10s' on 33c2241696c07a203f5355771a40a8c28a0767b5. We need to trigger more. 
Nov 01 13:43:03 pulse-actions app/worker1.1: buildapi	 INFO:	 We're going to re-trigger an existing completed job with request_id: 129750707 1 time(s). 
Nov 01 13:43:03 pulse-actions app/worker1.1: mozci	 INFO:	  
Nov 01 13:43:03 pulse-actions app/worker1.1: mozci	 INFO:	 === 7b5270bac41b49133cb44af4851f120c1de7cdd4 === 
Nov 01 13:43:03 pulse-actions app/worker1.1: mozci	 INFO:	 We have found 0 potential job(s) matching 'Windows 7 32-bit mozilla-inbound talos other-e10s' on 7b5270bac41b49133cb44af4851f120c1de7cdd4. We need to trigger more. 
Nov 01 13:43:03 pulse-actions app/worker1.1: mozci	 INFO:	 ==> We want to trigger 'Windows 7 32-bit mozilla-inbound talos other-e10s' a total of 1 time(s). 
Nov 01 13:43:03 pulse-actions app/worker1.1: mozci	 INFO:	  
Nov 01 13:45:42 pulse-actions app/worker1.1: mozci	 INFO:	 We found a job that finished, however, it did not produced files. 
Nov 01 13:45:42 pulse-actions app/worker1.1: mozci	 INFO:	 Status of job: failure 
Nov 01 13:45:42 pulse-actions app/worker1.1: mozci	 INFO:	 The build job WINNT 5.2 mozilla-inbound build failed on revision 7b5270bac41b49133cb44af4851f120c1de7cdd4 without generating the necessary files. We will not trigger anything. 
Nov 01 13:45:42 pulse-actions app/worker1.1: mozci	 INFO:	  
Nov 01 13:45:42 pulse-actions app/worker1.1: mozci	 INFO:	 === b62b6568d65e63bd2f514f7a5099135065431123 === 
Nov 01 13:45:43 pulse-actions app/worker1.1: mozci	 INFO:	 We have 1 job(s) for 'Windows 7 32-bit mozilla-inbound talos other-e10s' which is enough for the 1 job(s) we want. 
Nov 01 13:45:43 pulse-actions app/worker1.1: mozci	 INFO:	  
Nov 01 13:45:43 pulse-actions app/worker1.1: mozci	 INFO:	 === 5c834894e4b46cf628237789b900996e6525a729 === 
Nov 01 13:45:43 pulse-actions app/worker1.1: mozci	 INFO:	 We have 1 job(s) for 'Windows 7 32-bit mozilla-inbound talos other-e10s' which is enough for the 1 job(s) we want. 
Nov 01 13:45:43 pulse-actions app/worker1.1: mozci	 INFO:	  
Nov 01 13:45:43 pulse-actions app/worker1.1: mozci	 INFO:	 === c6587fbc2dbe99f0fabc25454dbe8a02e1eeccf4 === 
Nov 01 13:45:44 pulse-actions app/worker1.1: mozci	 INFO:	 We have 1 job(s) for 'Windows 7 32-bit mozilla-inbound talos other-e10s' which is enough for the 1 job(s) we want. 
Nov 01 13:45:44 pulse-actions app/worker1.1: mozci	 INFO:	  
Nov 01 13:45:44 pulse-actions app/worker1.1: mozci	 INFO:	 === 72764ba31b81562e3069f6935da33cae06fe6b00 === 
Nov 01 13:45:45 pulse-actions app/worker1.1: mozci	 INFO:	 We have found 0 potential job(s) matching 'Windows 7 32-bit mozilla-inbound talos other-e10s' on 72764ba31b81562e3069f6935da33cae06fe6b00. We need to trigger more. 
Nov 01 13:45:45 pulse-actions app/worker1.1: mozci	 INFO:	 ==> We want to trigger 'Windows 7 32-bit mozilla-inbound talos other-e10s' a total of 1 time(s). 
Nov 01 13:45:45 pulse-actions app/worker1.1: mozci	 INFO:	  
Nov 01 13:45:46 pulse-actions app/worker1.1: mozci	 INFO:	 We found a job that finished, however, it did not produced files. 
Nov 01 13:45:46 pulse-actions app/worker1.1: mozci	 INFO:	 Status of job: failure 
Nov 01 13:45:46 pulse-actions app/worker1.1: mozci	 INFO:	 The build job WINNT 5.2 mozilla-inbound build failed on revision 72764ba31b81562e3069f6935da33cae06fe6b00 without generating the necessary files. We will not trigger anything. 
Nov 01 13:45:46 pulse-actions app/worker1.1: mozci	 INFO:	  
Nov 01 13:45:46 pulse-actions app/worker1.1: mozci	 INFO:	 === 8c08e1aca9ea4953d610ead9b7d1cd4eb0d822d3 === 
Nov 01 13:45:47 pulse-actions app/worker1.1: mozci	 INFO:	 We have 1 job(s) for 'Windows 7 32-bit mozilla-inbound talos other-e10s' which is enough for the 1 job(s) we want. 
Nov 01 13:45:47 pulse-actions app/worker1.1: mozci	 INFO:	  
Nov 01 13:45:47 pulse-actions app/worker1.1: mozci	 INFO:	 === 2777fabbe62a3a55a3514d724303977bac7260fb === 
Nov 01 13:45:47 pulse-actions app/worker1.1: mozci	 INFO:	 We have 1 job(s) for 'Windows 7 32-bit mozilla-inbound talos other-e10s' which is enough for the 1 job(s) we want. 
Nov 01 13:45:47 pulse-actions app/worker1.1: mozci	 INFO:	  
Nov 01 13:45:47 pulse-actions app/worker1.1: mozci	 INFO:	 === b171726c465e305411429b29012f48e890598b3f === 
Nov 01 13:45:48 pulse-actions app/worker1.1: mozci	 INFO:	 We have 6 job(s) for 'Windows 7 32-bit mozilla-inbound talos other-e10s' which is enough for the 1 job(s) we want. 
Nov 01 13:45:48 pulse-actions app/worker1.1: mozci	 INFO:	  
Nov 01 13:45:48 pulse-actions app/worker1.1: mozci	 INFO:	 === ac4628bf99126783da833e1641d673c92ee4b1a7 === 
Nov 01 13:45:48 pulse-actions app/worker1.1: mozci	 INFO:	 We have 2 job(s) for 'Windows 7 32-bit mozilla-inbound talos other-e10s' which is enough for the 1 job(s) we want. 
Nov 01 13:45:49 pulse-actions app/worker1.1: mozci	 INFO:	  
Nov 01 13:45:49 pulse-actions app/worker1.1: mozci	 INFO:	 === 54a28d6fbed4db85e0f40ebeaf52d563c8226509 === 
Nov 01 13:45:49 pulse-actions app/worker1.1: mozci	 INFO:	 We have 1 job(s) for 'Windows 7 32-bit mozilla-inbound talos other-e10s' which is enough for the 1 job(s) we want. 
Nov 01 13:45:49 pulse-actions app/worker1.1: root	 WARNING:	 The handler did not give us an exit_code 
Nov 01 13:45:49 pulse-actions app/worker1.1: root	 INFO:	 Message {u'action': u'backfill', u'job_id': 38409929, u'requester': u'jmaher@mozilla.com', u'version': 1, u'project': u'mozilla-inbound', u'build_system_type': u'buildbot', u'job_guid': u'c74adb69274fad5d88b93f673ff25ea3b317c7da'}, took 172 seconds to execute 
Nov 01 13:45:50 pulse-actions app/worker1.1: root	 INFO:	 Log uploaded to https://tc-gp-public-31d.s3-us-west-2.amazonaws.com/ateam/pulse-action-dev/196ac91e-c0cf-4a78-a06e-d13d82e04d67 
Nov 01 13:45:50 pulse-actions app/worker1.1: root	 INFO:	 Created Treeherder 'Sch' job. 
Nov 01 13:45:50 pulse-actions app/worker1.1: root	 INFO:	 #### End of user request ####.
The number of workers processing the requests seems to be irrelevant. I brought the number down to 2 workers since we can have workers without processing pulse messages.
See worker 1.3 [1], it did not process any messages for 6 hours.

I thought that we can have many workers listening to the same queue.

rail, have you faced something similar before?
Have you noticed any issues with Pulse in the last week?

[1]
Nov 01 11:12:04 pulse-actions heroku/worker1.3: Restarting 
Nov 01 11:12:04 pulse-actions heroku/worker1.3: State changed from up to starting 
Nov 01 11:12:06 pulse-actions heroku/worker1.3: Stopping all processes with SIGTERM 
Nov 01 11:12:07 pulse-actions heroku/worker1.3: Process exited with status 143 
Nov 01 11:12:26 pulse-actions heroku/worker1.3: Starting process with command `python pulse_actions/worker.py --config-file configs/manual_requests.json --memory-saving` 
Nov 01 11:12:27 pulse-actions heroku/worker1.3: State changed from starting to up 
Nov 01 11:12:30 pulse-actions app/worker1.3: root	 INFO:	 Console output logs INFO level messages. 
Nov 01 11:12:30 pulse-actions app/worker1.3: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/job-actions, buildbot.#.backfill) 
Nov 01 11:12:30 pulse-actions app/worker1.3: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/resultset-actions, #.#) 
Nov 01 11:12:30 pulse-actions app/worker1.3: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/resultset-runnable-job-actions, #) 
Nov 01 11:14:06 pulse-actions app/worker1.3: root	 INFO:	 Message {u'_meta': {u'sent': u'2016-11-01T11:14:02+01:00', u'exchange': u'exchange/build/normalized', u'routing_key': u'build.mo 
Nov 01 11:16:07 pulse-actions heroku/worker1.3: State changed from up to down 
Nov 01 11:16:10 pulse-actions heroku/worker1.3: Stopping all processes with SIGTERM 
Nov 01 11:16:10 pulse-actions heroku/worker1.3: Process exited with status 143 
Nov 01 14:00:12 pulse-actions heroku/worker1.3: Starting process with command `python pulse_actions/worker.py --config-file configs/manual_requests.json --memory-saving` 
Nov 01 14:00:13 pulse-actions heroku/worker1.3: State changed from starting to up 
Nov 01 14:00:17 pulse-actions app/worker1.3: root	 INFO:	 Console output logs INFO level messages. 
Nov 01 14:00:18 pulse-actions app/worker1.3: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/job-actions, buildbot.#.backfill) 
Nov 01 14:00:18 pulse-actions app/worker1.3: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/resultset-runnable-job-actions, #) 
Nov 01 14:00:18 pulse-actions app/worker1.3: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/resultset-actions, #.#) 
Nov 01 14:24:18 pulse-actions heroku/worker1.3: Restarting 
Nov 01 14:24:18 pulse-actions heroku/worker1.3: State changed from up to starting 
Nov 01 14:24:20 pulse-actions heroku/worker1.3: Stopping all processes with SIGTERM 
Nov 01 14:24:21 pulse-actions heroku/worker1.3: Process exited with status 143 
Nov 01 14:24:22 pulse-actions heroku/worker1.3: Starting process with command `python pulse_actions/worker.py --config-file configs/manual_requests.json --memory-saving` 
Nov 01 14:24:22 pulse-actions heroku/worker1.3: State changed from starting to up 
Nov 01 14:24:25 pulse-actions app/worker1.3: root	 INFO:	 Console output logs INFO level messages. 
Nov 01 14:24:25 pulse-actions app/worker1.3: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/job-actions, buildbot.#.backfill) 
Nov 01 14:24:25 pulse-actions app/worker1.3: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/resultset-actions, #.#) 
Nov 01 14:24:25 pulse-actions app/worker1.3: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/resultset-runnable-job-actions, #) 
Nov 01 14:32:09 pulse-actions app/worker1.3: root	 INFO:	 Message {u'times': 1, u'version': 1, u'action': u'cancel_all', u'requester': u'jorgk@jorgk.com', u'resultset_id': u'2223', u'pro 
Nov 01 14:35:38 pulse-actions heroku/worker1.3: Restarting 
Nov 01 14:35:38 pulse-actions heroku/worker1.3: State changed from up to starting 
Nov 01 14:35:42 pulse-actions heroku/worker1.3: Stopping all processes with SIGTERM 
Nov 01 14:35:42 pulse-actions heroku/worker1.3: Process exited with status 143 
Nov 01 14:35:44 pulse-actions heroku/worker1.3: Starting process with command `python pulse_actions/worker.py --config-file configs/manual_requests.json --memory-saving` 
Nov 01 14:35:45 pulse-actions heroku/worker1.3: State changed from starting to up 
Nov 01 14:35:47 pulse-actions heroku/worker1.3: State changed from up to down 
Nov 01 14:35:48 pulse-actions app/worker1.3: root	 INFO:	 Console output logs INFO level messages. 
Nov 01 14:35:48 pulse-actions app/worker1.3: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/resultset-runnable-job-actions, #) 
Nov 01 14:35:48 pulse-actions app/worker1.3: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/job-actions, buildbot.#.backfill) 
Nov 01 14:35:48 pulse-actions app/worker1.3: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/resultset-actions, #.#) 
Nov 01 14:35:49 pulse-actions app/worker1.3: root	 INFO:	 #### New user request ####. 
Nov 01 14:35:49 pulse-actions app/worker1.3: root	 INFO:	 This log was produced by https://github.com/mozilla/pulse_actions in case you want to help us out! :D 
Nov 01 14:35:50 pulse-actions app/worker1.3: treeherder_job_action	 INFO:	 Backfill action requested by jmaher@mozilla.com for 'Windows 7 32-bit autoland talos other-e10s' 
Nov 01 14:35:50 pulse-actions app/worker1.3: treeherder_job_action	 INFO:	 Request for https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=aadb57a9ddf83910a604f6be3b787fee661249db&selectedJob=5887512 
Nov 01 14:35:51 pulse-actions heroku/worker1.3: Starting process with command `python pulse_actions/worker.py --config-file configs/manual_requests.json --memory-saving` 
Nov 01 14:35:51 pulse-actions heroku/worker1.3: Stopping all processes with SIGTERM 
Nov 01 14:35:51 pulse-actions heroku/worker1.3: State changed from starting to up 
Nov 01 14:35:52 pulse-actions heroku/worker1.3: Process exited with status 143 
Nov 01 14:35:53 pulse-actions app/worker1.3: root	 INFO:	 Console output logs INFO level messages. 
Nov 01 14:35:53 pulse-actions app/worker1.3: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/resultset-runnable-job-actions, #) 
Nov 01 14:35:53 pulse-actions app/worker1.3: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/resultset-actions, #.#) 
Nov 01 14:35:53 pulse-actions app/worker1.3: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/job-actions, buildbot.#.backfill) 
Nov 01 16:19:19 pulse-actions heroku/worker1.3: Restarting 
Nov 01 16:19:19 pulse-actions heroku/worker1.3: State changed from up to starting 
Nov 01 16:19:23 pulse-actions heroku/worker1.3: Stopping all processes with SIGTERM 
Nov 01 16:19:23 pulse-actions app/worker1.3: root	 INFO:	 #### New user request ####. 
Nov 01 16:19:23 pulse-actions app/worker1.3: root	 INFO:	 This log was produced by https://github.com/mozilla/pulse_actions in case you want to help us out! :D 
Nov 01 16:19:23 pulse-actions app/worker1.3: treeherder_push_action	 INFO:	 trigger_missing_jobs action requested by philringnalda@gmail.com on repo_name autoland with resultset_id: 3753 
Nov 01 16:19:24 pulse-actions heroku/worker1.3: Process exited with status 143 
Nov 01 16:19:25 pulse-actions heroku/worker1.3: Starting process with command `python pulse_actions/worker.py --config-file configs/manual_requests.json --memory-saving` 
Nov 01 16:19:26 pulse-actions heroku/worker1.3: State changed from starting to up 
Nov 01 16:19:30 pulse-actions app/worker1.3: root	 INFO:	 Console output logs DEBUG level messages. 
Nov 01 16:19:30 pulse-actions app/worker1.3: mozci	 DEBUG:	 Disabling validations. 
Nov 01 16:19:30 pulse-actions app/worker1.3: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/job-actions, buildbot.#.backfill) 
Nov 01 16:19:30 pulse-actions app/worker1.3: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/resultset-actions, #.#) 
Nov 01 16:19:30 pulse-actions app/worker1.3: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/resultset-runnable-job-actions, #) 
Nov 01 17:39:40 pulse-actions heroku/worker1.3: State changed from up to down 
Nov 01 17:39:44 pulse-actions heroku/worker1.3: Stopping all processes with SIGTERM 
Nov 01 17:39:45 pulse-actions heroku/worker1.3: Process exited with status 143
Keep in mind that people might re-trigger the pulse_actions job also multiple times because they do not appear on Treeherder now. With submitting the pending status we might gain a lot. I filed bug 1314623 for that.
See Also: → bug 1314623
Depends on: 1314930
Created attachment 8807204 [details]
Profiling of a backfill request - does not include POST calls

This problem is multi dimensional. All steaming from either bugs that were hard to figure out without spending a lot of time or a not good enough design (increasing number of workers instead of dispatching to TaskCluster - dynamic scaling).

* The load and usage might have been increasing for a while and we've reached a tipping point
* Heroku workers could have become slower
** I don't see any metrics being fed into New Relic even though the app seems to be connected
* Having multiple consumers is not working as expected
** Increasing the number of workers does not mean that all will be consuming
** It could be that I'm doing something silly and only work well when there's not much in the queue
** I will look into using what rail recommends [1]
* Beefing up the Heroku workers barely has any impact because:
** We're currently using Buildapi/buildjson for job status (TH lacks some data) which is network and disk bound
** See attachment to see that half the time of a backfill request is spent on network and json manipulation
* Change mozci to do operation concurrently
* Schedule some actions via TaskCluster to reduce calls on Buildapi (reduce tip risk when increasing number of workers)
** e.g. talos jobs for build

Ideas to try:
* Try rail's suggestion:
> The client fetches only one message from the queue, works on it, acks or rejects, then fetches another one.
> By default it may fetch multiple messages and steal them from other clients. If this client is slow, it will
> slow down the whole process
> I don't know how to set the QoS in mozillapulse, https://wiki.mozilla.org/Auto-tools/Projects/Pulse talks
> about prefetch count, but it's not clear how to set it using mozillapulse 
> I use plain kombu http://kombu.readthedocs.io/en/latest/reference/kombu.mixins.html#kombu.mixins.ConsumerMixin this one
> It gives you access to the channel object, where you can set the QoS
* Instead of writing buildjson files to disk and then reading back; uncompress on the fly
** Unfortunately we have memory restrictions on the Heroku instances
* Switch completely away from Buildapi/buildjson (memory and performance gain)
** We need to fix the dependent bug
* Improve the performance of each type of request
** Take one by one all

[1] https://github.com/rail/funsize/blob/master/funsize/worker.py#L101
Summary: Performance analysis of pulse_actions → pulse_actions - Performance analysis of pulse_actions
So far I've determined that pulse_actions is processing the Pulse messages in the queue but it is *not* acknowledging the messages.
I've also noticed that we can't have two consumers.
Until I kill one the other one will *not* see any messages to be consumed.
It seems I inadvertedly introduced a regression [1] that caused the acknowledge global flag to be set to False (*facepalm*)
I fixed this in [2] It's a brittle and complicated piece of code that needs to be redone.

While the backlog clears, I will check if using multiple consumers works as expected.

[1] https://github.com/mozilla/pulse_actions/commit/f5098052892f90ff15bd94d394327ab905448983
[2] https://github.com/mozilla/pulse_actions/commit/2022354f8bd6661835e80b20caf1ea1387fd9996
This is now fixed and queue has cleared.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.