Closed
Bug 1054050
Opened 10 years ago
Closed 9 years ago
rsync should have a timeout in new vcs sync
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: pmoore, Assigned: mcc.ricardo)
References
Details
(Whiteboard: [good first bug][lang=python])
In bug 1053960 we had vcs sync stuck due to an rsync getting stuck. We should have a timeout for the rsync commands used in vcs sync (and indeed any other operation that potentially can get stuck).
Reporter | ||
Updated•10 years ago
|
Whiteboard: [good first bug][lang=python]
Assignee | ||
Comment 1•10 years ago
|
||
Hi Pete. I would like to take a shot at this, but I would require some guidance.
Assignee | ||
Updated•10 years ago
|
Flags: needinfo?(pmoore)
Reporter | ||
Comment 2•10 years ago
|
||
That's fantastic! I'm very happy to help out. I'm tied up at a planning workweek this week but I can certainly answer questions you might have now, and next week I can hook up with you to give you e.g. a face-to-face overview over skype etc, if that helps.
It might also be worth hooking up with contacting Kartik from bug 1020613 as he is a new contributor that has just set up a local vcs sync dev environment, so he might be able to help you out with starter questions. Next week my calendar is much emptier so I should have more time. :)
Pete
Flags: needinfo?(pmoore)
Assignee | ||
Comment 3•10 years ago
|
||
Absolutely Pete. I'm pretty swamped myself this week too. But I'll contact Kartik and get this ball running by getting setting up a local dev environment.
Reporter | ||
Comment 4•10 years ago
|
||
Hi Ricardo,
Things are a bit quieter now - let me know if you'd still like to hook up. I'm on build duty next week, but I'm sure we'll be able to find for example a 30 min slot to skype. I am around usually between 00:00 PDT and 09:00 PDT - does this suit your timezone?
Thanks,
Pete
Flags: needinfo?(mcc.ricardo)
Assignee | ||
Comment 5•10 years ago
|
||
Hi Pete,
That would be great. I was also busy, but next week I'll have more time. 00:00 PDT and 09:00 PDT corresponds to 08:00 BST and 17:00 BST. I think we can book a Skype call for 08:30 BST (00:30 PDT). What do you think?
I will be on holiday from the 27/09 to 12/10. I would like to get this ball running before that.
Flags: needinfo?(mcc.ricardo)
Reporter | ||
Comment 6•10 years ago
|
||
That sounds good. Monday I'm not around - shall we say Tuesday 23 Sep at 08:30 BST? My skype name is pete-rwe.
Thanks,
Pete
Assignee | ||
Comment 7•10 years ago
|
||
Perfect Pete. 23/09 8:30 BST.
My Skype name is odracirortsac.
Assignee | ||
Comment 8•10 years ago
|
||
Hi Pete,
Sorry this is taking longer than I anticipated. I've just been swamped with work.
We've been chatting the past few days but I just wanted to leave a note here for reference, for future progress.
I'll be on holiday for the next two weeks, so it might take a little bit longer for this to get completed.
Ricardo
Reporter | ||
Comment 9•10 years ago
|
||
No problem at all Ricardo! Many thanks for your support on this topic! =)
Pete
Assignee | ||
Comment 10•10 years ago
|
||
Hi Pete,
I'm back from holiday. I'll be resuming work on this ticket.
Ricardo
Assignee | ||
Comment 11•10 years ago
|
||
Hey Pete,
I just ran vcs_sync and synced it to this Github repo test. Can you please check it out?
https://github.com/mccricardo/mozharness-test
Ricardo
Reporter | ||
Comment 12•10 years ago
|
||
Hey Ricardo!
It looks pretty good!
Since you were on holiday, I discovered: https://github.com/mozilla/build-puppet/blob/23aa7ab10f14f81c09c8caf67536fa65329b393b/modules/aws_manager/templates/kill_long_running.sh.erb which is in our puppet repo, which might help us solve this bug more generically, .... maybe.
Basically we have a mechanism in our puppet code to set up cron jobs that automatically have watcher cron jobs monitoring them, and killing them if they take to long. This would capture any hangs from any part of the process, I believe, so we wouldn't have to special-case every possible hg command that could hang in the vcs sync code.
For example, see the process_timeout entries in https://github.com/mozilla/build-puppet/blob/23aa7ab10f14f81c09c8caf67536fa65329b393b/modules/aws_manager/manifests/cron.pp where the cron jobs are configured.
We could potentially set this up for our vcs sync jobs too.
I have to go now - but maybe something to think about as a more generic solution to this problem.
Disadvantage: no special handling of problems, such as retry - that said, if vcs sync fails due to a hang, on the next iteration it should retry anyway.
Assignee | ||
Comment 13•10 years ago
|
||
Hey Pete,
Just checking out some of the code.
So the basic idea would be that each job would be monitored by cron job that would have a time out which would eventually kill it after a certain timeout period.
Since I'm not, yet, totally at ease with vcs_sync, let me fire up some questions.
Since I'm running vcs_sync similarly to:
python scripts/vcs-sync/vcs_sync.py -c configs/vcs_sync/build-repos.py
the idea would be to have a cron job watch that execution and kill it after a timeout period?
I believe vcs_sync has to execute it's actions (defined in the config file) and the we should monitor via the cron job it it's hanged up or not.
Assignee | ||
Comment 14•10 years ago
|
||
Hey Pete,
I added the script you mentioned above to our code. I tested it and with this script in place, each time vcs_sync is launched, a cron job can be setup to monitor it's elapsed time, terminating it if it's taking too long.
I ran vcs_sync on my local machine and it took roughly 25min, so I set it up to a 1h timeout. I don't know if that's the most sensible timeout. You probably have a more sensible of notion for a it.
You can find the code here: https://github.com/mccricardo/build-mozharness/compare/vcs-sync-timeout
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → mcc.ricardo
Reporter | ||
Comment 15•10 years ago
|
||
(In reply to Ricardo Castro from comment #13)
> Hey Pete,
>
> Just checking out some of the code.
>
> So the basic idea would be that each job would be monitored by cron job that
> would have a time out which would eventually kill it after a certain timeout
> period.
Yes, that's right.
>
> Since I'm not, yet, totally at ease with vcs_sync, let me fire up some
> questions.
>
> Since I'm running vcs_sync similarly to:
>
> python scripts/vcs-sync/vcs_sync.py -c configs/vcs_sync/build-repos.py
>
> the idea would be to have a cron job watch that execution and kill it after
> a timeout period?
Yes.
>
> I believe vcs_sync has to execute it's actions (defined in the config file)
> and the we should monitor via the cron job it it's hanged up or not.
Yes, exactly.
Reporter | ||
Comment 16•10 years ago
|
||
(In reply to Ricardo Castro from comment #14)
> Hey Pete,
>
> I added the script you mentioned above to our code. I tested it and with
> this script in place, each time vcs_sync is launched, a cron job can be
> setup to monitor it's elapsed time, terminating it if it's taking too long.
>
> I ran vcs_sync on my local machine and it took roughly 25min, so I set it up
> to a 1h timeout. I don't know if that's the most sensible timeout. You
> probably have a more sensible of notion for a it.
>
> You can find the code here:
> https://github.com/mccricardo/build-mozharness/compare/vcs-sync-timeout
Thanks for doing this! This works out quite well, because at the moment, Callek is working on puppetizing vcs sync over in bug 927199 - so when this work is done, we will be in a position to update the cron jobs in puppet to use this mechanism. Therefore I will put 927199 as a dependency for this bug, as we first should get vcs sync running in puppet, in order to take advantage of the puppet module we already have for creating monitoring cron jobs, as per e.g. https://github.com/mozilla/build-puppet/blob/23aa7ab10f14f81c09c8caf67536fa65329b393b/modules/aws_manager/manifests/cron.pp
Assignee | ||
Comment 17•10 years ago
|
||
Excellent Pete!
Assignee | ||
Updated•10 years ago
|
Status: NEW → ASSIGNED
Reporter | ||
Comment 18•10 years ago
|
||
Hey Ricardo,
While we are waiting for bug 927199, would you be interested in getting involved with other bugs? Sorry this one has to stall for a little bit.
Pete
Flags: needinfo?(mcc.ricardo)
Assignee | ||
Comment 19•10 years ago
|
||
Hey Pete,
I'm currently working on Pulse and Mozbench. Do you have any bug you think I might be able to give you a hand?
Flags: needinfo?(pmoore)
Assignee | ||
Updated•10 years ago
|
Flags: needinfo?(mcc.ricardo)
Reporter | ||
Comment 20•10 years ago
|
||
Hey Ricardo, that's very kind. We have some bugs here:
https://bugzilla.mozilla.org/buglist.cgi?cmdtype=runnamed&namedcmd=RelEng%20good%20first%20bugs
For example https://bugzilla.mozilla.org/show_bug.cgi?id=1072026 might be quite straightforward, if you fancy a stab!
If you prefer working on Pulse and Mozbench, no worries. Any contributions you are making to Mozilla are gratefully received. Thanks again for your support! =)
Pete
Flags: needinfo?(pmoore)
Assignee | ||
Comment 21•10 years ago
|
||
Cool :) I'll give it a look!
Reporter | ||
Updated•9 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Updated•8 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•