Closed Bug 1044515 Opened 10 years ago Closed 10 years ago

Legacy vcs sync not able to update legacy mapper quick enough for b2g_build.py, causing b2g builds calls to mapper to timeout

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
Gonk (Firefox OS)
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Assigned: pmoore)

References

()

Details

Attachments

(1 file)

b2g nightly's failed on satusday morning with:

https://tbpl.mozilla.org/php/getParsedLog.php?id=44630761&tree=Mozilla-Central

16:42:57  WARNING - Error: Mapper returned a revision of None; maybe it needs more time.
16:42:57     INFO - Sleeping 30 seconds before retrying
16:43:27  WARNING - Error: Mapper returned a revision of None; maybe it needs more time.
16:43:27     INFO - Sleeping 30 seconds before retrying
16:43:57  WARNING - Error: Mapper returned a revision of None; maybe it needs more time.
16:43:57    FATAL - Giving up on gecko git revision for e07264876182.
Pete, any thoughts on what's going on here ?
Flags: needinfo?(pmoore)
Quick summary
=============

tl;dr: mozilla central hg repo (https://hg.mozilla.org/mozilla-central) was not converted to git within the 14m30s time limit of B2G build reaching the stage where it needed to know the git commit SHA of the converted mozilla central hg changeset on http://git.mozilla.org/?p=releases/gecko.git - probably an hg performance issue.

Explanation of workflow
=======================

So this commit happened in mozilla central:
https://hg.mozilla.org/mozilla-central/graph/196157

This triggered B2G builds, such as:
https://tbpl.mozilla.org/php/getParsedLog.php?id=44630761&tree=Mozilla-Central

For B2G to build, gecko needs to be built from git (since we are using https://code.google.com/p/git-repo/ to build B2G which does not work with hg repos).

To take care of this, legacy vcs sync pushes changes from https://hg.mozilla.org/mozilla-central into http://git.mozilla.org/?p=releases/gecko.git

Shortly after the mozilla central hg repository updated, the legacy vcs sync cron job would have run using this config:
https://hg.mozilla.org/users/hwine_mozilla.com/repo-sync-configs/file/9f59333750d9/releases-gecko

This would have converted this particular hg changeset:
https://hg.mozilla.org/mozilla-central/rev/e07264876182

to this particular git commit:
http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=5315b9b3cccabcda1d177c57b05e6108dff2f990

The B2G build patiently waits for this conversion to happen, querying the "mapper" service every 30 seconds, for a maximum of 30 times (thus 14m30s), to see if the git commit SHA (5315b9b3cccabcda1d177c57b05e6108dff2f990) is known yet, for the given hg changeset SHA (e07264876182) (see http://hg.mozilla.org/build/mozharness/file/c7c9a537ecff/mozharness/mozilla/mapper.py#l18). In other words, from this point, the race is on. Normally it would be done in the allotted 14m30s, but if we are having hg performance issues, it could take longer; therefore probably poor hg performance is the cause of the failure to get the git SHA back in time.

The url it queries, to get the git SHA, is: http://cruncher.build.mozilla.org/mapper/gecko/git/e07264876182 - and as you see, it is now available.

This is the code in b2g_build.py that is querying the git SHA from the hg SHA, and updating the repo manifest (so that it can build b2g using the "repo" tool):
http://hg.mozilla.org/build/mozharness/file/c7c9a537ecff/scripts/b2g_build.py#l561

And this is what the b2g_build.py output should look like (the mozharness build script for building B2G) when it is successful (this working example taken from http://ftp.mozilla.org/pub/mozilla.org/b2g/nightly/2014/07/2014-07-25-16-02-05-mozilla-central/b2g_mozilla-central_nexus-4_eng_nightly-bm70-build1-build12.txt.gz):

21:07:54     INFO - #####
21:07:54     INFO - ##### Running update-source-manifest step.
21:07:54     INFO - #####
21:07:54     INFO - Running main action method: update_source_manifest
21:07:54     INFO - Reading from file /builds/slave/b2g_m-cen_nexus-4_eng_ntly-000/build/sources.xml.original
21:07:54     INFO - Found git_base_url of https://git.mozilla.org/releases/ in manifest.
21:07:54     INFO - Mapping gecko revision to git using http://cruncher.build.mozilla.org/mapper/gecko/git/e07264876182
21:07:54     INFO - Writing to file /builds/slave/b2g_m-cen_nexus-4_eng_ntly-000/build/sources.xml
21:07:54     INFO - Running command: ['diff', '-u', '/builds/slave/b2g_m-cen_nexus-4_eng_ntly-000/build/sources.xml.original', '/builds/slave/b2g_m-cen_nexus-4_eng_ntly-000/build/sources.xml']
21:07:54     INFO - Copy/paste: diff -u /builds/slave/b2g_m-cen_nexus-4_eng_ntly-000/build/sources.xml.original /builds/slave/b2g_m-cen_nexus-4_eng_ntly-000/build/sources.xml
21:07:54     INFO -  --- /builds/slave/b2g_m-cen_nexus-4_eng_ntly-000/build/sources.xml.original	2014-07-25 21:05:35.453884769 -0700
21:07:54     INFO -  +++ /builds/slave/b2g_m-cen_nexus-4_eng_ntly-000/build/sources.xml	2014-07-25 21:07:54.423904421 -0700
21:07:54     INFO -  @@ -141,4 +141,8 @@
21:07:54     INFO -     <project name="platform/hardware/qcom/sensors" path="hardware/qcom/sensors" revision="07c5bcdb36158e22d33bac02eecd83d4ff1fb2f8"/>
21:07:54     INFO -     <project name="platform/hardware/qcom/wlan" path="hardware/qcom/wlan" revision="8bff0c0323c62c34621a818dc1e2f863589aab00"/>
21:07:54     INFO -     <project name="platform/hardware/ril" path="hardware/ril" revision="c57b3eb525db241bd3e6092453f48388bc880447"/>
21:07:54     INFO -  +  <!--Mozilla Info-->
21:07:54     INFO -  +  <!--Mercurial-Information: <remote fetch="https://hg.mozilla.org/" name="hgmozillaorg">-->
21:07:54     INFO -  +  <!--Mercurial-Information: <project name="https://hg.mozilla.org/mozilla-central" path="gecko" remote="hgmozillaorg" revision="e07264876182"/>-->
21:07:54     INFO -  +  <project name="gecko.git" path="gecko" remote="mozillaorg" revision="5315b9b3cccabcda1d177c57b05e6108dff2f990"/>
21:07:54     INFO -   </manifest>
21:07:54     INFO -  \ No newline at end of file
21:07:54     INFO - Return code: 1

Here you see, the repo manifest is updated with the path to the gecko repository, and the revision 5315b9b3cccabcda1d177c57b05e6108dff2f990 is inserted:
    <project name="gecko.git" path="gecko" remote="mozillaorg" revision="5315b9b3cccabcda1d177c57b05e6108dff2f990"/>

I can't see that this repo manifest change gets committed at any point to the b2g manifest repository https://github.com/mozilla-b2g/b2g-manifest - it looks like it exists only as a local diff on the file system in the build directory. But that is no doubt fine, since the hg version is known and recorded, and the git commit SHA is explicitly output to the log.

In short, the B2G build depends on the conversion from hg to git of gecko, and cannot proceed until gecko has been converted and pushed. If 14m30s pass (30 attempts with 30 second interval waits between) and the mapper service has not been updated with the git commit SHA for the given hg changeset SHA, the B2G build fails.

The mapper service itself is updated by the (legacy) vcs sync conversion process, which I believe rsync's the git-mapfile generated by the hggit mercurial plugin during the vcs sync conversion process, over to cruncher. I believe the legacy mapper service reads this map file and is therefore able to respond to e.g. the url request to http://cruncher.build.mozilla.org/mapper/gecko/git/e07264876182.

Please note when bug 946019 lands, the vcs sync conversion process and the mapper service will be switched over to the new mozharness based system. However, this uses the same principles for conversion, so this probably would not help in this case - I believe here the issue was just the time delay to convert the repo, which is done in the same way in old and new vcs sync (using the hggit mercurial plugin - http://hg-git.github.io/).
Flags: needinfo?(pmoore)
Wow, that's a lot of detail, thanks! One thing you didn't cover is that it takes the build some time to need the mapping. In this case we have

hg push at 15:59:01 Pacific
https://hg.mozilla.org/mozilla-central/rev/e07264876182

build starts at 16:14:16
'starttime: 1406330056.2' in log

first mapping attempt at 16:29:18
INFO - Mapping gecko revision to git using http://cruncher.build.mozilla.org/mapper/gecko/git/e07264876182

give up at 16:43:57   
FATAL - Giving up on gecko git revision for e07264876182.

So 45 minutes all up. Can we look at the logs to determine how long it took e07264876182 to get to git.m.o ? And if hg was slow then ?

1600 is a relatively bad time-of-day for load on hg.m.o, as there can still be plenty of jobs from US based devs, and quite a few b2g nightlies firing at 1600. And it's a merge from m-i so it's a relatively large amount of changes to convert. If hg was well, perhaps it is possible to adjust the timings and configuration to do more short cycles to avoid the lag here.
Summary: B2g Nightlys failed due to wrapper error → Legacy vcs sync not able to update legacy mapper quick enough for b2g_build.py, causing b2g builds calls to mapper to timeout - possible hg performance issue
(In reply to Nick Thomas [:nthomas] from comment #3)
> Wow, that's a lot of detail, thanks! One thing you didn't cover is that it
> takes the build some time to need the mapping.

I think I did, but it got a bit buried in the tl;dr ("time limit of B2G build reaching the stage where it needed to know the git commit SHA") and I also didn't realise it was such a significant length of time. Thanks for looking that up - 45 min should really be enough.

> In this case we have
> 
> hg push at 15:59:01 Pacific
> https://hg.mozilla.org/mozilla-central/rev/e07264876182
> 
> build starts at 16:14:16
> 'starttime: 1406330056.2' in log
> 
> first mapping attempt at 16:29:18
> INFO - Mapping gecko revision to git using
> http://cruncher.build.mozilla.org/mapper/gecko/git/e07264876182
> 
> give up at 16:43:57   
> FATAL - Giving up on gecko git revision for e07264876182.
> 
> So 45 minutes all up. Can we look at the logs to determine how long it took
> e07264876182 to get to git.m.o ? And if hg was slow then ?
> 
> 1600 is a relatively bad time-of-day for load on hg.m.o, as there can still
> be plenty of jobs from US based devs, and quite a few b2g nightlies firing
> at 1600. And it's a merge from m-i so it's a relatively large amount of
> changes to convert. If hg was well, perhaps it is possible to adjust the
> timings and configuration to do more short cycles to avoid the lag here.

Unfortunately we only keep last 100 logs, and it is cron'd every minute - so we typically have a little less than two hours' worth of log files... I guess we can look at increasing this. Bug 929336 is on file to put these log files on s3 storage - at that point, we might be able to bump it up. Currently disk space is a little limited on vcs sync machines. The log files are also sync'd to my people repo, until we have them publishing to S3: http://people.mozilla.org/~pmoore/vcs2vcs/gecko-dev/logs/
Pete, are you ok with "owning" (being assigned) this bug, to get it out of the buildduty daily queue until a root cause/remediation is found?
Flags: needinfo?(pmoore)
Sure! :)
Assignee: nobody → pmoore
Flags: needinfo?(pmoore)
So I spoke to :hwine about this yesterday, and he advised me that the process of rsyncing the mapfiles over to cruncher, and then the mapper service importing the data is somewhat staggered, and dependent on running times of a chained set of cronjobs across machines. Therefore, it may be resolvable entirely by tuning the timings of the associated cron jobs.

Hal, can you provide the details in this ticket about which cron jobs on which boxes (and where those cron jobs are defined - /var/spool/cron/ or /etc/cron.d/ or under a user's crontab (cron -l) etc) and then we can look at tuning those timings, or at least investigating how long the data takes to flow from start to finish.

Thanks,
Pete
Status: NEW → ASSIGNED
Flags: needinfo?(hwine)
Summary: Legacy vcs sync not able to update legacy mapper quick enough for b2g_build.py, causing b2g builds calls to mapper to timeout - possible hg performance issue → Legacy vcs sync not able to update legacy mapper quick enough for b2g_build.py, causing b2g builds calls to mapper to timeout
I propose for the short term, while we look into fine tuning the cron timings, that we increase the timeout in b2g_build.py for waiting for the mapping. The problem seems to be with nightlies, which is no doubt exacerbated by the fact that typically a merge is now done just before the nightly runs, which means vcs sync has more work to do (and thus probably takes longer).

Will attach a patch shortly.
Trivial change, but untested. If you want me to set up a test environment, let me know.
Attachment #8464645 - Flags: review?(catlee)
Comment on attachment 8464645 [details] [diff] [review]
bug1044515_mozharness_v1.patch

Review of attachment 8464645 [details] [diff] [review]:
-----------------------------------------------------------------

looks ok...but 60m is a long time. do we have a reasonable upper bound on how long we should wait for this to finish? remember that if something is actually wrong, we'll spend 60m just sitting idle.
Attachment #8464645 - Flags: review?(catlee) → review+
(In reply to Chris AtLee [:catlee] from comment #11)
> Comment on attachment 8464645 [details] [diff] [review]
> bug1044515_mozharness_v1.patch
> 
> Review of attachment 8464645 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> looks ok...but 60m is a long time. do we have a reasonable upper bound on
> how long we should wait for this to finish? remember that if something is
> actually wrong, we'll spend 60m just sitting idle.

In the short term we can land this just to avoid retriggers - but in the real case there was a problem (e.g. vcs sync is broken) we would already get notification emails and alerts to say it is broken - so the build could not complete without vcs sync, so I think we don't lose anything. In other words - worse case - vcs sync totally breaks. We get alerts, and it takes us the full 60 mins to fix plus the lead time before mapper is needed by the build - let's say 90m. In this case, it would still be right for the build to wait for vcs sync, because it can't complete without it. If it failed out, this would probably just cause more "noise" and interrupts, whereas we should already have nagios alerts if vcs sync server goes awry or vcs sync notification emails if something is totally broken.

In any case, we are very soon migrating to the new mapper, where there are not the chained set of cron jobs we have in legacy mapper, so when new mapper goes live this summer, the interface will be changed and this 60m wait will no longer apply.

If you're not happy with this - let me know.
(In reply to Pete Moore [:pete][:pmoore] from comment #10)
> Created attachment 8464645 [details] [diff] [review]
> bug1044515_mozharness_v1.patch
> 
> Trivial change, but untested. If you want me to set up a test environment,
> let me know.

Since this could potentially break b2g, I set up a test environment locally, and tested it. All fine. Will land shortly...

Pete
60 minutes does seem long, here's the current intervals for generating a mapfile
entry:
    lag in conversion hg -> git:
        current cycle time is minimum 10 min (no update); 15-16 min when changes

    lag in propagation of map file to cruncher:
        map files are combined on gd3 every 5 minutes (cron job on gd3)
        map files are pulled to cruncher every 5 minutes (cron job on cruncher)
        map files are pulled into old mapper (very quick iirc)

So, if we "just miss" every cron window, and there's a large conversion needed (per comment 9 -- allowing 20 minutes), we'd be taking 40 minutes the way things are currently tuned.

The following tuning options are available on legacy vcs-sync:
 - run mapfile cron jobs every minute (saves 8 minutes)
 - reduce cycle time on repository polling for key repositories (saves 9 minutes)

Implementing both of those would made the "expected max" time approx 23 minutes.
Flags: needinfo?(hwine)
(In reply to Hal Wine [:hwine] (use needinfo) from comment #15)
> 60 minutes does seem long, here's the current intervals for generating a
> mapfile
> entry:
>     lag in conversion hg -> git:
>         current cycle time is minimum 10 min (no update); 15-16 min when
> changes
> 
>     lag in propagation of map file to cruncher:
>         map files are combined on gd3 every 5 minutes (cron job on gd3)
>         map files are pulled to cruncher every 5 minutes (cron job on
> cruncher)
>         map files are pulled into old mapper (very quick iirc)
> 
> So, if we "just miss" every cron window, and there's a large conversion
> needed (per comment 9 -- allowing 20 minutes), we'd be taking 40 minutes the
> way things are currently tuned.
> 
> The following tuning options are available on legacy vcs-sync:
>  - run mapfile cron jobs every minute (saves 8 minutes)
>  - reduce cycle time on repository polling for key repositories (saves 9
> minutes)
> 
> Implementing both of those would made the "expected max" time approx 23
> minutes.

Thanks Hal.

Since this has resolved the production problems, and legacy mapper should die soon, I'll close this now, as I think more effort on this would not make sense at this point. We can reopen if this resurfaces, or problems occur due to the 60m timeout.

Thanks,
Pete
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: