Closed Bug 790889 Opened 9 years ago Closed 9 years ago

getParsedLog.php & getLogExcerpt.php?type=annotated timing out ("Timeout waiting for log generation.")

Categories

(Tree Management Graveyard :: TBPL, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

(Depends on 1 open bug)

Details

(Whiteboard: [sheriff-want])

Attachments

(2 files)

Both the initial prefetch and attempts to load the annotated summary, or even the brief log time out; presumably due to the changes in ParallelLogGenerating.php or AnnotatedSummaryGenerator.php

Examples:
(Win7 m-oth)
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=36d865c2f094
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=6b6dc0f7f82a

Also occurs on the same logs on tbpl-dev:
https://tbpl-dev.allizom.org/?tree=Mozilla-Inbound&rev=36d865c2f094

Output of prefetch import:
https://tbpl-dev.allizom.org/cache/import-log-1347519301.txt
{
Prefetching: Rev3 WINNT 6.1 mozilla-inbound debug test mochitest-other on 36d865c2f094
php /data/genericrhel6-dev/src/tbpl-dev.allizom.org/tbpl/dataimport/../php/getLogExcerpt.php id=15183344&type=annotated
generating raw log: 984ms
completed after 17360ms
}

Compared to other logs:
{
Prefetching: Rev3 WINNT 6.1 elm opt test xpcshell on ea384fb8d2c3
php /data/genericrhel6-dev/src/tbpl-dev.allizom.org/tbpl/dataimport/../php/getLogExcerpt.php id=15183336&type=annotated
generating raw log: 185ms
generating general_error log: 283ms
generating annotatedsummary log: 39ms
completed after 371ms
}

or

{
Prefetching: Rev3 WINNT 6.1 try debug test reftest on fa6e314ed62c
php /data/genericrhel6-dev/src/tbpl-dev.allizom.org/tbpl/dataimport/../php/getLogExcerpt.php id=15183340&type=annotated
generating raw log: 550ms
generating general_error log: 1748ms
fetching bugs for "linear-repeat-1c.html": 4660ms
fetching bugs for "repeating-linear-1a.html": 5779ms
fetching bugs for "repeating-linear-1b.html": 5065ms
generating annotatedsummary log: 15571ms
completed after 17371ms
}

The problematic logs are:
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32-debug/1347510795/mozilla-inbound_win7-debug_test-mochitest-other-bm16-tests1-windows-build3717.txt.gz
  and
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32-debug/1347511815/mozilla-inbound_win7-debug_test-mochitest-other-bm15-tests1-windows-build3897.txt.gz

Not sure if we need to back out the prefetching or not at this point :-( At least now we have bug 656155 in production, opening the raw log is less of a hassle, so might stick with it for a bit to give us a chance to debug.
For the other failure:
{
Prefetching: Rev3 WINNT 6.1 mozilla-inbound debug test mochitest-other on 6b6dc0f7f82a
php /data/genericrhel6-dev/src/tbpl-dev.allizom.org/tbpl/dataimport/../php/getLogExcerpt.php id=15183298&type=annotated
generating raw log: 947ms
completed after 12215ms
}

We're dying at:
http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/file/23ba106f6d47/php/inc/ParallelLogGenerating.php#l30

Trying to reproduce locally with some more logging.
Locally with extra logging, I get...

{
Prefetching: Rev3 WINNT 6.1 mozilla-inbound debug test mochitest-other on 36d865c2f094
php /var/www/tbpl/dataimport/../php/getLogExcerpt.php id=15183344&type=annotated
starting general_error log generation
starting raw log generation
generating raw log: 7228ms
generating general_error log: 13646ms
starting annotatedsummary log generation
fetching bugs for "browser_586068-cascaded_restore.js": 5730ms
generating annotatedsummary log: 6177ms
completed after 20108ms
}

So my guess is that we get stuck with content as NULL, and never delete it to let another process try processing the log?
Swatinem: edmorley: you are right with content being NULL
Swatinem: so prefetch starts the processing, php writes NULL to synchronize, then prefetch kills the php process and it has no chance to clear that NULL flag
Swatinem: im not at home right now so i cant patch that myself
Swatinem: just stick a DELETE in after the prefetch step in the importer
Swatinem: should be fine running that every 5 minutes or so
edmorley: for all content NULL ?
Swatinem: and a manual DELETE should also be a short term solution until the change is deployed
Swatinem: using DELETE WHERE content IS NULL will surely delete some genuinely running jobs
Swatinem: but its better to have duplicated processing then none at all
edmorley: we could just delete the individual NULL entry as part of that worker's timeout I guess? (won't fix the cases that are already in the DB)
edmorley: the latter we could get a DBA to fix
edmorley: or else patch and revert
Swatinem: we could do that when we kill the job, yes: http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/file/b3f9daa143ce/dataimport/import-buildbot-data.py#l377
edmorley: although the output would indicate we never got the "Timed out..." for some reason?
Swatinem: i dont think php has a sig_kill handler? so it cant clean up after itself...
Swatinem: which output?
edmorley: https://tbpl-dev.allizom.org/cache/import-log-1347519301.txt
edmorley: generating raw log: 984ms
edmorley: completed after 17360ms
edmorley: (from comment 0)
Swatinem: edmorley: hm in that case i have no idea, it should say `generating annotatedsummary log`
Swatinem: so python thinks the process exited normally, but it did not?
Swatinem: so i think just DELETEing all NULL rows would be the safest thing to do, also in case the php script died for some other reason and was not supervised by the python importer
edmorley: yeah agree
edmorley: I think we need a bunch more logging too
Swatinem: we do :-D
Swatinem: i mean the NULL based synchronization has been there for some time now, i wonder why it hasnt been a problem so far
Swatinem: apart from the fact that we now show the symptoms for every log that needs > 30s to generate
edmorley: rather than 60
edmorley: yeah
edmorley: well we have hit it before iirc
edmorley: months ago, when we had a spell of horrid oranges that would dump loads of stuff into the log
edmorley: and so presumably time out in the bzapi calls
edmorley: in fact, should we just remove the null at the die('Timeout waiting for log generation.'); line?
edmorley: (I'll just attach both patches, save pestering you any more for now :-))
Attached patch Crude solutionSplinter Review
The simple but crude solution discussed on IRC. Will mean that every 5 minutes we clear out all in-progress jobs, so potentially risk processing some logs twice (albeit hopefully not many, since prefetch will take care of most logs people are looking at on TBPL) - but at least mean we don't get stuck on "Timeout waiting for log generation".
Assignee: nobody → bmo
Status: NEW → ASSIGNED
Attachment #660782 - Flags: review?(arpad.borsos)
Whilst the issue in this bug isn't caused by hitting the die() (since we'd see it in the logs), I think it would be good to clean up properly here too.
Attachment #660785 - Flags: review?(arpad.borsos)
So whilst both of these patches are helpful, they don't fix the root cause: which is why was the worker exiting early during log generation for those specific logs? (there was no "Timed out..." from the python side). Ideally we should check and output the exit code of the worker (and switch away from die() since it exits with code 0).

We also need more logging. Will file some bugs shortly.
Comment on attachment 660782 [details] [diff] [review]
Crude solution

With this on production, for one of the affected runs I now get:

"Service Unavailable

The service is temporarily unavailable. Please try again later."

in the annotated summary box - which is presumably our root cause.
philor: edmorley: and it's not debug Windows Moth we can't load logs for, it's debug Moth logs over 2MB, there's a 2.9MB Linux64 one on aurora

I'll take a look at this & the fact that failures of the type bug 681138 no longer given suggestions, tomorrow :-)
Attachment #660785 - Flags: review?(arpad.borsos) → review+
Attachment #660782 - Flags: review?(arpad.borsos) → review+
For the reftest_with_caret failures not getting suggestions, I found a test_webgl_conformance_test_suite.html failure (which gets 30 or so bugs, to reftests_with_caret's 25) which also no longer gets suggestions, so I think we're tripping over more suggestions than we expect to ever get, somehow.
Depends on: 791259
We currently have the 30 second timeout in import-buildbot-data.py as well as the cli-only ini_set('max_execution_time', 30); 

I presume in some cases we're hitting the latter, and thus why there is no "Timed out..." on the python side? I think we should get rid of one of the other; if only because someone using "-t 60" from the python script is going to wonder what is going on.

Anyway; comparing tbpl-dev with prod, shows the import-buildbot-data.py patch seems to help at least for some of the cases, eg:

https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=Rev3%20WINNT%206.1%20mozilla-inbound%20debug%20test%20mochitest-other&rev=8522498a85a3

vs

https://tbpl-dev.allizom.org/?tree=Mozilla-Inbound&onlyunstarred=1&jobname=Rev3%20WINNT%206.1%20mozilla-inbound%20debug%20test%20mochitest-other&rev=8522498a85a3

So I think we should push to production again whilst I keep on working on those that still timeout/no longer give suggestions. (After bug 791259, I'm now working on adding more logging & getting the it to also show on non-cli when using getLogExcerpt.php?&debug=1 etc... I'm just away tomorrow, so won't be able to look again until Sunday).
Depends on: 791282
Depends on: 790571
Depends on: 793147
Depends on: 793159
Depends on: 793163
Depends on: 793630
Whiteboard: [sheriff-want]
Depends on: 794116
Depends on: 791389, logspam
Depends on: 794483
I haven't seen any perma-timeouts for a while now - and the "failed to prefetch mochitest-other" situation on tbpl-dev is already a lot better than prod - I believe due to the |ini_set('memory_limit', "2048M")| addition for php-cli, to bring it in line with that in .htaccess (in bug 791259).

However, I'm going to continue working on the dependant bugs in case this comes back / to improve overall performance + ability to debug.
Interesting that our experience would be so different: the only reason I don't see perma-timeouts on pretty much every Moth is because I always open the log in a separate tab without selecting the job, before selecting the job. I just tried selecting the job without opening the log on the Win debug Moth in https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=490abcc57fec, which did time out.
Hmmm... some do take a while to load for me; I had put that down to them not prefetching, and use j/k to skip them and come back to them at the end - at which point they eventually do load. I guess perhaps this is another form of pre-loading the log, similar to your open in a new tab trick?
Do you still get timeouts if you try tbpl-dev? (You may need to wait for some new runs, since I cycled through the failures on tbpl-dev ~10mins ago)

Things definitely seem better on dev for me, so I might request a push to prod in case it improves things for you, whilst I work on this more.
Depends on: 795460
Depends on: 799637
I haven't seen timeouts for ages.

Think we're done enough here that we can close this out.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Webtools → Tree Management
Product: Tree Management → Tree Management Graveyard
You need to log in before you can comment on or make changes to this bug.