Closed Bug 1048737 Opened 10 years ago Closed 8 years ago

addons cron on mxr-processor1 takes too long to run

Categories

(Developer Services :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ashish, Unassigned)

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3289])

History: https://bugzilla.mozilla.org/buglist.cgi?list_id=10923699&short_desc=mxr-processor1.private.scl3.mozilla.com&query_format=advanced&short_desc_type=allwordssubstr&component=Server%20Operations%3A%20MOC&product=mozilla.org

> root      3467  0.0  0.0 106100  1164 ?        S    Jul31   0:00 /bin/sh /data/www/mxr.mozilla.org/update-full-onetree.sh addons
> root      7149  0.0  0.0 106100   436 ?        S    Aug01   0:00  \_ /bin/sh /data/www/mxr.mozilla.org/update-full-onetree.sh addons
> root      7150  0.0  0.0 141156  2152 ?        S    Aug01   0:00      \_ perl update-xref.pl addons
> root      7159  0.0  0.0 106092  1024 ?        S    Aug01   0:00          \_ sh -c time   /data/www/mxr.mozilla.org/genxref /data/mxr-data/addons/addons >> /data/mxr-data/addons/genxref.log 2>&1
> root      7160 99.4  0.8 327552 139744 ?       R    Aug01 5690:11              \_ /usr/bin/perl /data/www/mxr.mozilla.org/genxref /data/mxr-data/addons/addons

genxref always takes a very long time (more than a week) to complete. I wonder when was the last time it actually completed a full run because each time the check alerts, the oncalls kill the process (as per documentation) (current run from July 31 was positively spawned by hand in Bug 1047212). So, to actually determine how long the run would take, I'm not killing gexref.
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/678]
<nagios-scl3:#sysadmins> Tue 05:27:21 PDT [5435] 
  mxr-processor1.private.scl3.mozilla.com:File Age - /var/lock/mxr/long is OK: 
  OK: 1dir(s) -- /var/lock/mxr/long: 0 files 
  (http://m.mozilla.org/File+Age+-+/var/lock/mxr/long)
:jakem - Do you have any suggestions here?
Flags: needinfo?(nmaul)
Summary: mxr-processor1.private.scl3 takes too long to run → addons cron on mxr-processor1 takes too long to run
No great ideas, sorry. Our best bet will be to work with the AMO folks to develop a better way for this to work. Or perhaps we could reduce it to running monthly. @jorgev might be the best person to talk to here... not sure. I've CC'd him. Failing that, @oremj or @jason would be my next guesses. Also CC'd.

I do know that this is a fairly unusual test by MXR standards. MXR expects to check out code from a repo and then scan it... there is no such repo in this case, so instead there's a script that checks a database to find the paths/names of all current addons, then fetches them. It has some intelligence as to minimizing the amount of data it has to fetch (relying on already-fetched content), but I'm not convinced it works properly in all cases. It then must also unpack every XPI that it fetched so it can scan their contents. Finally, MXR can scan the files as normal.

It could also be that we've simply outgrown this design altogether. I imagine the sheer volume of AMO Addons has only increased over time, so maybe it simply doesn't complete in a reasonable time frame anymore.

MXR is on the way out and DXR is the new hotness. I don't know how soon that might be usable, but it might be worth considering a stopgap solution here, with the proper solution being engineered for use with DXR, not MXR.

It's theoretically possible to spin up another processor node that attempts to process only the Addons tree. This won't really make addons faster, but we could set separate schedules/alerts on it, and it would have less of an effect on the other trees.

I also suspect several pieces of the current addons tree processing are probably serial in nature. If they could be parallelized, they might run much faster overall (given enough cores and I/O throughput). Doing this probably requires a separate box though... parallelizing addons on the existing system would have the effect of choking off CPU from other trees. It might run in days (or even hours) instead of weeks, but that's way too long for everything else to be stalled. :)
Flags: needinfo?(nmaul)
I can't find the bug atm, but I found a case where two addon's were causing genxref to loop for days. I managed to "fix" it, but the regexps are so bewildering that I couldn't tell why it worked or if it broke something else. The issue went away before I pushed anything, though. 

MXR is going away, so significant time or cost should not be spent on this. If someone wants to debug and provide a patch, I'm happy to apply it. Otherwise, I would actually suggest we create a new semi-monthly cronjob that just runs addons, and tweak nagios if needed. 

Also, Dev Services, not WebOps. :-)
Component: WebOps: Other → WebOps: Source Control
I'm not familiar with the implementation of the Add-ons MXR, but I would be happy to help make it more efficient. An obvious first question to ask would be if we need to process all add-ons every time, or if we can just update the index incrementally with only the ones that have been updated since the last run (a couple hundred per week).
This just came up again. Anything to help here before it does away?
Group: infra
Component: WebOps: Source Control → General
Product: Infrastructure & Operations → Developer Services
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/678] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3279] [kanban:https://kanbanize.com/ctrl_board/4/678]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3279] [kanban:https://kanbanize.com/ctrl_board/4/678] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3284] [kanban:https://kanbanize.com/ctrl_board/4/678]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3284] [kanban:https://kanbanize.com/ctrl_board/4/678] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3289] [kanban:https://kanbanize.com/ctrl_board/4/678]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3289] [kanban:https://kanbanize.com/ctrl_board/4/678] → [kanban:https://kanbanize.com/ctrl_board/4/678]
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/678] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3289]
Assignee: server-ops-webops → nobody
QA Contact: nmaul
service decommissioned
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.