718632 - TBPL should prefetch logs

Reporter

Description

•

13 years ago

In bug 717005 comment 31, I wrote: > Is there any reason we can't aggressively prefetch all the logs, as soon > as the build finishes? Or, if not, just the non-green logs, as soon as we > know the build is non-green? > > If not, a "fetch all teh orange!" button might be helpful. We can, of > course, do this as a separate bug. This change would make TBPL many times more pleasant to use!

Justin Lebar (not reading bugmail)

Reporter

Updated

•

13 years ago

Depends on: 717005

Arpad Borsos [:Swatinem]

Assignee

Comment 1

•

13 years ago

Attached patch patch (obsolete) — Details — Splinter Review

This has been a lot easier than I thought. I need some comments on the IPC though.

Assignee: nobody → arpad.borsos

Status: NEW → ASSIGNED

Attachment #595478 - Flags: review?(mstange)

Attachment #595478 - Flags: review?(ehsan)

(no longer active)

Comment 2

•

13 years ago

Comment on attachment 595478 [details] [diff] [review] patch Review of attachment 595478 [details] [diff] [review]: ----------------------------------------------------------------- Neat!

Attachment #595478 - Flags: review?(ehsan) → review+

Markus Stange [:mstange]

Comment 3

•

13 years ago

Comment on attachment 595478 [details] [diff] [review] patch Looks great to me. How much longer does the script now take to run, usually?

Attachment #595478 - Flags: review?(mstange) → review+

Arpad Borsos [:Swatinem]

Assignee

Comment 4

•

13 years ago

Well that is a problem, really. I had started the process on my server and after 2 hours, it was still not finished prefetching the logs for ~ 5000 runs. Things got worse: I have no lock file for the cron job, so every 5 minutes, a new job was spawned, blocked on the log processing of the first one. (each import job spawned a php job that was busy-waiting for the first job to finish processing) I was lucky that I checked on my server, otherwise it would have ended up in swap-hell. philor measured how long it takes to generate one summary. Doing 5000 serially is an extreme performance killer. So as far as I remember from a previous bug, the production tbpl actually has a lock on the importer so it’s not possible to start it while the first process is not finished yet. That would avoid the problem that I had. However, we still need to speed the process up a bit. The subprocess.call as it is now is blocking, so just one single php worker is run at a single time. We could just use subprocess.Popen and leave it be. As far as I understand thats non-blocking. But that would lead to the case when we have 5000 parallel php workers that would swamp the machine just as easily. I could try creating a configurable number (say 10) of python threads, each blocking on a php worker. I don’t know python that well, are lists threadsafe? Any other ideas?

Justin Lebar (not reading bugmail)

Reporter

Comment 5

•

13 years ago

> Any other ideas? You could prefetch only logs from orange builds? > I don’t know python that well, are lists threadsafe? No, although deques are. http://docs.python.org/library/collections.html#deque-objects

Arpad Borsos [:Swatinem]

Assignee

Comment 6

•

13 years ago

Attached patch v2 with worker threads (obsolete) — Details — Splinter Review

(In reply to Justin Lebar [:jlebar] from comment #5) > > Any other ideas? > > You could prefetch only logs from orange builds? You are right. Clicking a green run is not very likely, there is no use in prefetching the tinderbox_print for that. > > I don’t know python that well, are lists threadsafe? > > No, although deques are. > http://docs.python.org/library/collections.html#deque-objects Thanks for the pointer. I now tried 10 python threads, each spawning and blocking on a php worker. I also added a timeout to the php scripts as I observed one of the workers going rogue and just sitting there for 15 minutes. I did an incremental import with ~500 new runs (no idea how many of them were != 'success') and the import finished in 6 minutes with 10 workers. This should be better with more workers and better hardware (my vserver just has 512M RAM + 512 swap) Note to self: make the worker count configurable So we are slowly getting there. And I start liking python, although I wish tbpl had been done in lovely non-blocking node.js :-)

Attachment #595478 - Attachment is obsolete: true

Attachment #595558 - Flags: review?(mstange)

Attachment #595558 - Flags: review?(ehsan)

Justin Lebar (not reading bugmail)

Reporter

Comment 7

•

13 years ago

Comment on attachment 595558 [details] [diff] [review] v2 with worker threads Rather than using multiple Python threads, you could open multiple subprocesses in one thread and then use select() on them. This shouldn't perform any worse, because Python has this Global Interpreter Lock which prevents multiple threads from running Python code concurrently. That may or may not be easier than the current state, but might have less overhead if you want to bump up the concurrency.

Arpad Borsos [:Swatinem]

Assignee

Comment 8

•

13 years ago

I think the python code is really concise and readable this way. But I’m still having problems with long-running php workers that delay the importer for too long. Any ideas?

Justin Lebar (not reading bugmail)

Reporter

Comment 9

•

13 years ago

> But I’m still having problems with long-running php workers that delay the importer for too long. > Any ideas? Well, if you use select(), it's easy to time out a subprocess which has been running for too long. :)

(no longer active)

Comment 10

•

13 years ago

Can't we just spawn the subprocesses in the background and finish without waiting on them? I can't see why we need to wait on the subprocesses.

Arpad Borsos [:Swatinem]

Assignee

Comment 11

•

13 years ago

(In reply to Ehsan Akhgari [:ehsan] from comment #10) > Can't we just spawn the subprocesses in the background and finish without > waiting on them? I can't see why we need to wait on the subprocesses. That way we can’t control how many they are. Like I said in comment 4, depending on the import size, this could be as much as 5000. And here are a few numbers before I go to sleep: Note: I have no idea how many of them are != 'success' Inserted 632 new run entries. real 5m48.689s user 2m29.890s sys 0m23.040s Inserted 200 new run entries. real 2m28.915s user 0m36.920s sys 0m4.550s

Phil Ringnalda (:philor)

Comment 12

•

13 years ago

Keep in mind, while planning on fetching up the log the second that builds-4hr tells us about a build, that buildbot makes absolutely no promises about the log existing prior to telling us about the build.

Arpad Borsos [:Swatinem]

Assignee

Comment 13

•

13 years ago

(In reply to Phil Ringnalda (:philor) from comment #12) > Keep in mind, while planning on fetching up the log the second that > builds-4hr tells us about a build, that buildbot makes absolutely no > promises about the log existing prior to telling us about the build. Thats not a problem, we just throw when the log is not available on the ftp. So I’ve added some instrumentation which logs something like this: generating raw log (9203783): 2461ms generating tinderbox_print log (9203783): 2706ms fetching/waiting for raw log (9203783): 0ms fetching bugs for "pseudo-element-of-native-anonymous.html" (9203889): 10995ms generating general_error log (9203783): 617ms fetching bugs for "spellcheck-textarea-attr-inherit.html" (9204044): 12022ms fetching bugs for "browser_permissions.js" (9204008): 14296ms fetching bugs for "poster-15.html" (9203921): 13554ms generating annotatedsummary log (9203921): 99778ms fetching bugs for "spellcheck-textarea-property-dynamic-override-inherit.html" (9204145): 12775ms generating annotatedsummary log (9204145): 164062ms (full output at http://tbpl.swatinem.de/dataimport/log.txt ) So fetching the log from ftp and log parsing in php is 2-3s. The worst offender by far is bugzilla queries, I see 8-19 seconds there. According to the php manual, the max_execution_time only refers to script execution itself, not to stream operations. That explains why the processes run longer that 30s. So what we can do here is to: 1. extend the ignore list of bug searches. I see "plain,", "html,<body>", "Main app process exited normally" and other things as search terms. 2. make a real bug cache. The one we have now, AnnotatedSummaryGenerator->bugsCache, is per-process so not a real cache at all. We could save the results in a new database table and clear entries that are older than, say 2 days and re-fetch them. or 3. delegate this problem to bzapi or server ops to make bz queries faster and do the caching on their end.

Arpad Borsos [:Swatinem]

Assignee

Comment 14

•

13 years ago

Attached patch v3 (obsolete) — Details — Splinter Review

I’m quite happy with the patch now: - Nice logging ( http://tbpl.swatinem.de/dataimport/log.txt ) - A real db-based bug-cache - select() and hopefully no deadlocks :-) - configurable worker count and timeout It’s still very slow on my machine, but with more workers it should be fine.

Attachment #595558 - Attachment is obsolete: true

Attachment #595558 - Flags: review?(mstange)

Attachment #595558 - Flags: review?(ehsan)

Attachment #595726 - Flags: review?(mstange)

Attachment #595726 - Flags: review?(ehsan)

Attachment #595726 - Flags: feedback?(justin.lebar+bug)

Justin Lebar (not reading bugmail)

Reporter

Comment 15

•

13 years ago

Comment on attachment 595726 [details] [diff] [review] v3 Looks pretty sensible to me. I dunno how happy the log server will be to be hammered like this on the occasion that a build turns everything orange. I guess the oranges won't all come in at the same time. Also, what about comment 12 -- will we retry in case the logs aren't there?

Attachment #595726 - Flags: feedback?(justin.lebar+bug) → feedback+

(no longer active)

Comment 16

•

13 years ago

Comment on attachment 595726 [details] [diff] [review] v3 Review of attachment 595726 [details] [diff] [review]: ----------------------------------------------------------------- To be honest, this patch goes way beyond my Python knowledge. Anyone else wanna give this a shot?

Attachment #595726 - Flags: review?(ehsan)

Robert Helmer [:rhelmer]

Comment 17

•

13 years ago

(In reply to Ehsan Akhgari [:ehsan] from comment #16) > Comment on attachment 595726 [details] [diff] [review] > v3 > > Review of attachment 595726 [details] [diff] [review]: > ----------------------------------------------------------------- > > To be honest, this patch goes way beyond my Python knowledge. Anyone else > wanna give this a shot? This looks pretty reasonable to me (calling out to PHP to do processing seems a bit odd, but it'd good reuse of existing code). The use of lambda and passing functions makes this a bit more tricky, but I don't think it's too bad. I can review this and also test it in my local install if you'd like, lmk.

Arpad Borsos [:Swatinem]

Assignee

Comment 18

•

13 years ago

Comment on attachment 595726 [details] [diff] [review] v3 (In reply to Robert Helmer [:rhelmer] from comment #17) > I can review this and also test it in my local install if you'd like, lmk. I would appreciate it, thanks. (In reply to Justin Lebar [:jlebar] from comment #15) > I dunno how happy the log server will be to be hammered like this on the > occasion that a build turns everything orange. I guess the oranges won't > all come in at the same time. It should not be that bad when we load smaller incremental changes via the cronjob > Also, what about comment 12 -- will we retry in case the logs aren't there? We throw in that case, and thanks for the change I did in ParallelLogGenerating.php, we rethrow and die() when control reaches getLogExcerpt.php. Before, we would just timeout after 60s. The assumption was that maybe a parallel script would fetch a valid log, but that assumption wasn’t a good one.

Attachment #595726 - Flags: review?(rhelmer)

Robert Helmer [:rhelmer]

Comment 19

•

13 years ago

Comment on attachment 595726 [details] [diff] [review] v3 >+ class PrefetchJob(object): >+ def __init__(self, job): >+ self.process = subprocess.Popen(job, stdout = subprocess.PIPE, stderr = open('/dev/null', 'w')) Why not: stderr = None Also it'd be nice to catch OSError and print the failing command (I didn't have the PHP CLI installed and the traceback isn't very informative :) ) Code looks ok to me (and seems to work fine) otherwise, although I do occasionally see: PHP Notice: Undefined property: stdClass::$logDescription in /var/www/tbpl/php/inc/AnnotatedSummaryGenerator.php on line 110 This is likely a pre-existing issue and goes to the Apache error logs (that nobody regularly looks at)

Attachment #595726 - Flags: review?(rhelmer) → review+

Arpad Borsos [:Swatinem]

Assignee

Comment 20

•

13 years ago

Attached patch v4, comments addressed — Details — Splinter Review

(In reply to Robert Helmer [:rhelmer] from comment #19) > Why not: > stderr = None That way a lot of php startup spew (failed loading extension, ...) goes to stderr, I just wanted to suppress that. > Also it'd be nice to catch OSError and print the failing command (I didn't > have the PHP CLI installed and the traceback isn't very informative :) ) Done. > Code looks ok to me (and seems to work fine) otherwise, although I do > occasionally see: > PHP Notice: Undefined property: stdClass::$logDescription in > /var/www/tbpl/php/inc/AnnotatedSummaryGenerator.php on line 110 > > This is likely a pre-existing issue and goes to the Apache error logs (that > nobody regularly looks at) Thanks for catching that. That was my copy&paste fail :-)

Attachment #595726 - Attachment is obsolete: true

Attachment #595726 - Flags: review?(mstange)

Attachment #596127 - Flags: review?(mstange)

Markus Stange [:mstange]

Comment 21

•

13 years ago

Comment on attachment 596127 [details] [diff] [review] v4, comments addressed I don't completely understand the python part, but I trust rhelmer and you on that. The rest looks great.

Attachment #596127 - Flags: review?(mstange) → review+

Arpad Borsos [:Swatinem]

Assignee

Comment 22

•

13 years ago

In case the python part makes problems we can just return after the db.commit() That way we at least get the real bug-cache which should be a massive improvement by itself.

Arpad Borsos [:Swatinem]

Assignee

Updated

•

13 years ago

Depends on: 728203

Arpad Borsos [:Swatinem]

Assignee

Comment 23

•

13 years ago

Pushed: http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/rev/a266f04ed4ec Lets see how the staging server handles the load. It should run a lot smoother once the bug cache is primed.

Phil Ringnalda (:philor)

Comment 24

•

13 years ago

At the moment, it seems to be saying "Log not available" as a summary for everything, which won't be much fun to debug.

Arpad Borsos [:Swatinem]

Assignee

Comment 25

•

13 years ago

https://tbpl-dev.allizom.org/php/getLogExcerpt.php?id=9492362&tree=Firefox&type=annotated <- this works (the „filename“ is blacklisted in the bug search) For all the others, we are catching some kind of Exception: http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/file/a266f04ed4ec/php/getLogExcerpt.php#l65

Arpad Borsos [:Swatinem]

Assignee

Comment 26

•

13 years ago

The bugscache table was created in the wrong database. Should work fine now.

Phil Ringnalda (:philor)

Comment 27

•

13 years ago

Suppose we had the once-promised QA team. What would you tell them to look for on tbpl-dev, to tell whether or not this is working as designed? Near as I can tell just from gut feeling of how long things should take, we're never successfully prefetching logs (everything always spins for a fetching-length time "loading results" and "retrieving summary"), and we might be caching bug searches (the permaorange 10.7 failures are handy for loading the same failures over and over).

Arpad Borsos [:Swatinem]

Assignee

Comment 28

•

13 years ago

Well I would just ask how full the database is. Or how the load on the server is. A log from the python importer would be good, maybe there is something wrong with the importer, maybe there is no php-cli installed, like the issue rhelmer was seeing?

Arpad Borsos [:Swatinem]

Assignee

Updated

•

13 years ago

Depends on: 730677

Robert Helmer [:rhelmer]

Comment 29

•

13 years ago

Sorry for the noise, going to temporarily back this out so we can get bug 733556 tested on -dev and out into production. Maybe we should use a feature branch for this?

Robert Helmer [:rhelmer]

Comment 30

•

13 years ago

(In reply to Robert Helmer [:rhelmer] from comment #29) > Sorry for the noise, going to temporarily back this out so we can get bug > 733556 tested on -dev and out into production. > > Maybe we should use a feature branch for this? Backed out - http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/file/54c6550a89c4

Phil Ringnalda (:philor)

Comment 31

•

13 years ago

Not sure how we could manage it on a feature branch: so far as Swatinem knows it is finished and working just fine on his server, and what's left is to find out whether or not it works on MoCo's server. I stared at tbpl-dev for several days without being able to guess whether it was working at all, asked how I could tell, then Swatinem filed bug 730677 to find out, and we'd sure like to find out, what with the whole "this implements a bug cache, and gerv is yelling at us about hitting bzapi too much, probably because someone else is yelling at him about bzapi hitting bmo too much."

patch 13 years ago Arpad Borsos [:Swatinem] 3.66 KB, patch	ehsan.akhgari : review+ mstange : review+	Details \| Diff \| Splinter Review
v2 with worker threads 13 years ago Arpad Borsos [:Swatinem] 4.42 KB, patch		Details \| Diff \| Splinter Review
v3 13 years ago Arpad Borsos [:Swatinem] 14.49 KB, patch	rhelmer : review+ justin.lebar+bug : feedback+	Details \| Diff \| Splinter Review
v4, comments addressed 13 years ago Arpad Borsos [:Swatinem] 14.63 KB, patch	mstange : review+	Details \| Diff \| Splinter Review
Fix daily import 12 years ago Arpad Borsos [:Swatinem] 1.01 KB, patch	emorley : review+	Details \| Diff \| Splinter Review
Dont prefetch for old logs 12 years ago Arpad Borsos [:Swatinem] 1.78 KB, patch	emorley : review+	Details \| Diff \| Splinter Review
Only prefetch runs from builds-4hr.js.gz 12 years ago Ed Morley [:emorley] 2.12 KB, patch		Details \| Diff \| Splinter Review
Output the buildername and revision 12 years ago Ed Morley [:emorley] 3.11 KB, patch	Swatinem : review+	Details \| Diff \| Splinter Review
Temporary extra logging 12 years ago Ed Morley [:emorley] 3.66 KB, patch	Swatinem : review+	Details \| Diff \| Splinter Review
Remove leftover fclose() 12 years ago Arpad Borsos [:Swatinem] 979 bytes, patch	emorley : review+	Details \| Diff \| Splinter Review
Make bugscache schema match production 12 years ago Ed Morley [:emorley] 893 bytes, patch	Swatinem : review+	Details \| Diff \| Splinter Review
Purge bugscache sooner 12 years ago Ed Morley [:emorley] 1.30 KB, patch		Details \| Diff \| Splinter Review