Closed
Bug 1462323
Opened 7 years ago
Closed 7 years ago
Intermittent Exception: Could not find any candidate pushheads in the last 50 revisions.
Categories
(Developer Services :: Mercurial: hg.mozilla.org, defect, P2)
Developer Services
Mercurial: hg.mozilla.org
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: intermittent-bug-filer, Assigned: gps)
References
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell disable-recommended])
Filed by: nbeleuzu [at] mozilla.com
https://treeherder.mozilla.org/logviewer.html#?job_id=178963580&repo=mozilla-inbound
https://queue.taskcluster.net/v1/task/BxHpemi3SI26PQUNtJZjGg/runs/0/artifacts/public/logs/live_backing.log
Fail started on a backout from inbound
https://hg.mozilla.org/integration/mozilla-inbound/rev/786865568ed76134a3f9724956949e5d48f34210
[task 2018-05-17T12:20:14.836Z] 12:20:14 INFO - Ignoring exception unpickling cache file /builds/worker/.mozbuild/package-frontend/artifact_url-cache.pickle: IOError(2, 'No such file or directory')
[task 2018-05-17T12:20:14.836Z] 12:20:14 INFO - hg suggested 500 candidate revisions
[task 2018-05-17T12:20:14.836Z] 12:20:14 INFO - Ignoring exception unpickling cache file /builds/worker/.mozbuild/package-frontend/pushhead_cache-cache.pickle: IOError(2, 'No such file or directory')
[task 2018-05-17T12:20:14.836Z] 12:20:14 INFO - Attempting to find a pushhead containing 786865568ed76134a3f9724956949e5d48f34210 on mozilla-central.
[task 2018-05-17T12:20:14.837Z] 12:20:14 INFO - Attempting to find a pushhead containing 786865568ed76134a3f9724956949e5d48f34210 on integration/mozilla-inbound.
[task 2018-05-17T12:20:14.837Z] 12:20:14 INFO - Attempting to find a pushhead containing 786865568ed76134a3f9724956949e5d48f34210 on releases/mozilla-beta.
[task 2018-05-17T12:20:14.837Z] 12:20:14 INFO - Attempting to find a pushhead containing 786865568ed76134a3f9724956949e5d48f34210 on integration/autoland.
[task 2018-05-17T12:20:14.837Z] 12:20:14 INFO - Error running mach:
[task 2018-05-17T12:20:14.837Z] 12:20:14 INFO - ['--log-no-times', 'artifact', 'install']
[task 2018-05-17T12:20:14.838Z] 12:20:14 INFO - The error occurred in code that was called by the mach command. This is either
[task 2018-05-17T12:20:14.838Z] 12:20:14 INFO - a bug in the called code itself or in the way that mach is calling it.
[task 2018-05-17T12:20:14.838Z] 12:20:14 INFO - You should consider filing a bug for this issue.
[task 2018-05-17T12:20:14.838Z] 12:20:14 INFO - If filing a bug, please include the full output of mach, including this error
[task 2018-05-17T12:20:14.838Z] 12:20:14 INFO - message.
[task 2018-05-17T12:20:14.838Z] 12:20:14 INFO - The details of the failure are as follows:
[task 2018-05-17T12:20:14.838Z] 12:20:14 INFO - Exception: Could not find any candidate pushheads in the last 50 revisions.
[task 2018-05-17T12:20:14.838Z] 12:20:14 INFO - Search started with 786865568ed76134a3f9724956949e5d48f34210, which must be known to Mozilla automation.
[task 2018-05-17T12:20:14.838Z] 12:20:14 INFO - see https://developer.mozilla.org/en-US/docs/Artifact_builds
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
Comment 3•7 years ago
|
||
It's not a dupe of bug 1382982. The other one has been fixed 2 months ago and this is still happening.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Comment 4•7 years ago
|
||
This is happening pretty frequently on try. The failing jobs are finding the try push head to be public and basing the artifact search on that, which fails. It even happens on some jobs but not others on certain pushes: https://treeherder.mozilla.org/#/jobs?repo=try&revision=aa22e5da77fbc431c5ec537f086bce9554fd27e4
The failing jobs are finding aa22e5da77fbc431c5ec537f086bce9554fd27e4 returned from this revset: https://searchfox.org/mozilla-central/rev/dc6d85680539cb7e5fe2843d38a30a0581bfefe1/python/mozbuild/mozbuild/artifacts.py#997 which would appear to indicate the repos on those builders are inadvertently publishing.
Comment 5•7 years ago
|
||
gps, are you aware of any recent change from hg that might be causing comment 4?
Flags: needinfo?(gps)
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 7•7 years ago
|
||
We upgraded the ssh://hg.mozilla.org endpoint from Mercurial 4.2 this week. First to 4.3 then to 4.4 a day or two later.
Mercurial 4.4 did provide a new facility in the wire protocol for exchanging phases via bundle2. But we disabled this feature on the SSH servers. No meaningful upgrades were performed on the HTTP machines this week.
According to https://hg.mozilla.org/try/json-rev/aa22e5da77fbc431c5ec537f086bce9554fd27e4, that changeset is draft. The first public changeset in its ancestry is 9055d9d89a4b, which seems correct. I've confirmed all servers are advertising it as draft.
The decision task for the try push linked in comment #4 looks like it has sane phases:
[task 2018-05-24T15:07:57.340Z] Querying version control for metadata: https://hg.mozilla.org/try/json-automationrelevance/aa22e5da77fbc431c5ec537f086bce9554fd27e4
[task 2018-05-24T15:07:57.340Z] attempt 1/10
[task 2018-05-24T15:07:57.341Z] retry: calling get_automationrelevance, attempt #1
[task 2018-05-24T15:08:02.026Z] 3 commits influencing task scheduling:
[task 2018-05-24T15:08:02.026Z] e223c9ad093b Bug 1463682 [wpt PR 11116] - Run autopep8 on *.py in cors, fetch and xhr, a=testonly
[task 2018-05-24T15:08:02.026Z] 43f6c7166cc1 Bug 1463682 [wpt PR 11116] - Update wpt manifest
[task 2018-05-24T15:08:02.031Z] aa22e5da77fb try: -b do -p win32,win64,linux64,linux ...
So the bug appears to be that phases in the clones on the build workers are being bumped to public when they shouldn't be. That's very odd. And I'm not sure how upgrading Mercurial on the SSH servers could have triggered that. We do generate the pre-generated bundles on hgssh. But those bundles only contain the base revisions (which are public).
It should not be possible for a phase to revert to draft after it has been set to public. So unless the replication system on hg.mo is somehow triggering that, I can't explain how machines are getting public phases for changesets on Try that are clearly draft on the server.
I'll keep the needinfo to test some things...
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 11•7 years ago
|
||
I suspect the failure in comment #0 is not related to the bulk of the other failures.
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-05-20&endday=2018-05-29&tree=all&bug=1462323 clearly indicates that mass failures started on May 24, with the first batch of failures being reported at 2018-05-24 13:40:18.
Mercurial 4.4.2 was deployed to hgssh at ~16:55 on May 24. However, as I wrote previously, this *shouldn't* have impacted phases of changesets. Plus, failures starting happening *before* any code changes. So surely there's no cause and effect there.
Bundles for mozilla-unified were regenerated at ~0722 and ~1816 UTC on May 24. So I don't think those are related.
For a moment, I thought that this was only occurring on tasks that used a pre-existing store and only performed an incremental pull. That could indicate a kind of "cache poisoning" issue where a Try push that performed `hg phase --public` or some such contaminated the store, leading to this issue. However, failures like https://treeherder.mozilla.org/logviewer.html#?job_id=180693489&repo=try&lineNumber=1162 seem to be a counterexample since they involve a fresh clone.
I was thinking that this might be a race condition in the replication system somehow. Maybe the artifact task runs before replication has finished. But on the slowest machines, replication doesn't take any longer than ~30s in the worst case. And a decision task seems to always take longer than the worst replication delay. So I don't think there's a race here.
I was thinking that maybe we were writing a message into the replication log that forcefully reverted phases from public to draft. But I can find no evidence of any such messages in the replication system. I'm quite confident that phases on the hg.mo servers are pushed as draft and staying as draft. i.e. the phases are never advertised as public.
It's worth emphasizing that these issues seem to predate any code deployments to hg.mo. I want to believe that a service deployment changed behavior. But the timelines just don't line up.
The logs and the artifact code seem to indicate that `hg log -r 'last(public() and ::., {num})'` and therefore Mercurial are saying the push head is public instead of draft. I just have no friggin clue how the phase is getting set to public, especially in the case of fresh clones. If we didn't have logs with fresh clones, I'd say this was a cache poisoning issue of sorts. But since it reproduces on fresh clones with a modern Mercurial 4.5.2 client, something is *really* amiss. Arrows seem to point to the server somehow. But, again, the timelines don't seem to line up to any code changes. This is most weird.
Assignee | ||
Comment 12•7 years ago
|
||
Since OrangeFactor is complaining about this being a frequent failure, this is more important than a P5.
Priority: P5 → P2
Assignee | ||
Comment 13•7 years ago
|
||
I was also thinking this could be a race condition involving writing of the .hg/store/phaseroots file non-atomically. i.e. a writer could be updating the file at the same time a reader pulls it and the reader doesn't see an entry in the file and thinks it is public. But, Mercurial writes out this file to a temporary file in the same directory then does an atomic rename. So there should be no file-level reader inconsistency involved.
Assignee | ||
Comment 14•7 years ago
|
||
Augie: does this issue ring a bell to you?
tl;dr it looks like 4.5.2 clients are pulling a draft changeset from a 4.4.2 server via hgweb on mod_wsgi but local `hg log -r 'public()'` queries say that changeset is public. There's no obvious trigger from a code deploy and I'm really grasping at straws trying to explain this.
Note that this is happening on our Try repository, which has thousands of draft heads. The .hg/store/phaseroots file is 7,566,822 bytes and growing. The repo has 65,126 heads and growing.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 17•7 years ago
|
||
This is badly affecteing the wpt sync, which tends to conservatively assume that broken builds are a non-ignorable problem. In some cases e.g. https://treeherder.mozilla.org/#/jobs?repo=try&revision=120dc1e5b11a7d7bb3c080965f66e0bf1be23df5&selectedJob=180987268 it has required multiple retriggers to get a working build, which suggests that whatever the problem here is, it isn't a race condition happening at the time of the initial push.
In the absence of some idea about the underlying problem, how difficult would it be to make the revision selection code depend less on the phase metadata?
Assignee | ||
Comment 18•7 years ago
|
||
Adding needinfo flag, which I apparently missed in comment 14.
Flags: needinfo?(raf)
Comment 19•7 years ago
|
||
That doesn't ring a bell, but 65,126 heads makes me wonder if there's some integer rounding problems in some revlog C code...
(I'm in the middle of some other debugging at work, but can unwind my stack and make time for you if you want my undivided attention)
Flags: needinfo?(raf)
Assignee | ||
Comment 20•7 years ago
|
||
I somehow managed to reproduce this locally with a local clone of the try repo. I suspect it is a Mercurial bug of some sort. I'll keep digging this afternoon.
Assignee: nobody → gps
Status: REOPENED → ASSIGNED
Component: General → Mercurial: hg.mozilla.org
Flags: needinfo?(gps)
Product: Firefox Build System → Developer Services
Version: Version 3 → unspecified
Assignee | ||
Comment 21•7 years ago
|
||
I was able to somewhat reliably reproduce the failure by performing an `hg pull` which pulled down new changesets at the same time the server was also processing an `hg pull` to replicate changesets. However, I wasn't able to reproduce if the server wasn't performing an `hg pull` to replicate a new push.
This steered me in the direction of a race condition.
I was never able to trigger the race condition when all the servers active in the load balancer were identical. But as soon as I put the slower servers in the load balancer configuration, I was able to hit the race condition.
After staring at Mercurial's code for a while and adding debug statements in a number of places, the client was seeing a snapshot of the repository that shouldn't have been possible.
I poked around in the load balancer configs and found the settings related to HTTP/1.1 Keep-Alive. There are 2 settings that were relevant:
1) keepalive: Whether or not the pool should maintain HTTP keepalive connections to the nodes.
2) keepalive!non_idempotent: Whether or not the pool should maintain HTTP keepalive connections to the nodes for non-idempotent requests.
keepalive was enabled. keepalive!non_idempotent was not.
What this meant was that HTTP GET and HEAD requests (the "idempotent" methods as defined by the HTTP spec) resulted in HTTP keepalive connections between the load balancer and node running Mercurial. However, other HTTP methods (notably POST) would kill the persistent HTTP connection.
We have our Mercurial servers configured to use HTTP POST for wire protocol communication.
Mercurial clients do establish a persistent keepalive HTTP connection to the server and attempt to issue all HTTP requests against that single connection. And from the perspective of the clients, this was working: the TCP connection between the Mercurial client and the load balancer was persisted and reused.
However, as soon as an HTTP POST was issued, the load balancer killed the connection between it and the hgweb server. The next HTTP POST from the Mercurial client resulted in a *new* HTTP connection being established between the load balancer and a *random* hgweb server.
So essentially what was happening was the Mercurial client issued a POST to ask for data. That request was serviced fine. The next POST had a >50% chance of hitting a *different* hgweb server. And if the subsequent request hit a different server that wasn't exactly in sync with the original server/request, then the server would advertise inconsistent data. This confused the Mercurial client in such a way that it promoted a bunch of changesets from draft to public.
Why exactly the Mercurial client reacted in this way, I'm not sure. I suspect it has something to do with seeing data for a node that it doesn't know about or not seeing data for a node it should be receiving data for. I never made it that far down the rabbit hole.
I don't think the behavior of the Mercurial client is worth investigating too much. At least not by me (at this time). The reason is that the load balancer was behaving nonsensically by routing requests to different servers and therefore exposing an inconsistent snapshot of a repository to a client's persistent HTTP connection.
I flipped the load balancer setting for keepalive!non_idempotent so non-idempotent HTTP requests maintain a keepalive HTTP connection. This should result in all HTTP requests from clients maintaining a persistent connection being routed to the same origin server. And that should eliminate the race condition that led to this bug.
There is an outside chance that flipping keepalive!non_idempotent will have non-desirable consequences (e.g. connections getting in a bad state). If that occurs, I think we should change the load balancer so it treats connections as TCP streams instead of HTTP connections. My experience with load balancers over the years is that attempting to speak OSI Layer 7 protocols (like HTTP) only results in hard-to-debug wonkiness (like this bug). Unless you need the network appliance to do nice things at OSI Layer 7 (like HTTP caching and filtering), it's better to eliminate the surface area of bugs that speaking OSI Layer 7 protocols brings to the table.
I'm calling this bug closed. Please reopen if we still see issues. We /may/ see some latent failures on retriggers and on pushes on top of previous pushes where there were draft changesets. But all new pushes where the first parent of the first pushed changeset is public should be fine.
Setting a needinfo on fubar so he can read the comment and weigh in as appropriate. (I'm not sure if there is history behind why keepalive!non_idempotent wasn't enabled.)
Status: ASSIGNED → RESOLVED
Closed: 7 years ago → 7 years ago
Flags: needinfo?(klibby)
Resolution: --- → FIXED
Assignee | ||
Comment 22•7 years ago
|
||
https://taskcluster-artifacts.net/ObGB2aqiTA-P3fGW9kcKTA/0/public/logs/live_backing.log appears to be a failure after the load balancer change where a full clone is involved. So this appears to not be fully fixed and apparently my assertions about the load balancer's keepalive settings are not grounded in reality :(
Can we force the load balancer to route all HTTP requests from a given connections to a single node? We can obviously do this by disabling OSI Layer 7 on the load balancer. Is there a way to enable server persistence at the connection level? I see a bunch of options for enabling session persistence via things like cookies. We don't want want or need that complexity: we just want simple connection persistence.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 23•7 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #21)
>
> There is an outside chance that flipping keepalive!non_idempotent will have
> non-desirable consequences (e.g. connections getting in a bad state). If
> that occurs, I think we should change the load balancer so it treats
> connections as TCP streams instead of HTTP connections. My experience with
> load balancers over the years is that attempting to speak OSI Layer 7
> protocols (like HTTP) only results in hard-to-debug wonkiness (like this
> bug). Unless you need the network appliance to do nice things at OSI Layer 7
> (like HTTP caching and filtering), it's better to eliminate the surface area
> of bugs that speaking OSI Layer 7 protocols brings to the table.
>
> Setting a needinfo on fubar so he can read the comment and weigh in as
> appropriate. (I'm not sure if there is history behind why
> keepalive!non_idempotent wasn't enabled.)
It's likely that keepalive!non_idempotent was just left at the default when keepalive was enabled. Enabling should be fine, modulo your followup...
(In reply to Gregory Szorc [:gps] from comment #22)
>
> Can we force the load balancer to route all HTTP requests from a given
> connections to a single node? We can obviously do this by disabling OSI
> Layer 7 on the load balancer. Is there a way to enable server persistence at
> the connection level? I see a bunch of options for enabling session
> persistence via things like cookies. We don't want want or need that
> complexity: we just want simple connection persistence.
It offers straight IP-based session persistence (catalogs->persistence). I think we want to not jump too quickly into using the streaming model; ISTR that we lose some ability in using TrafficScript rules and other things that made dealing with issues more difficult, but it's been a while.
Flags: needinfo?(klibby)
Assignee | ||
Comment 24•7 years ago
|
||
My concern with IP-based session persistence is that clients behind a proxy come from the same IP and therefore all get routed to the same server. This can result in off balance traffic weighting. e.g. if you get a flood of requests from the same IP, you have only 1 server to process those instead of N.
Consumers like Taskcluster's Node.js-based pushlog polling client literally issue a few dozen requests effectively simultaneously from the same IP. I'm pretty confident that this client can overwhelm the number of worker slots on an individual server.
Assignee | ||
Comment 25•7 years ago
|
||
Because it is a Friday and I don't want to be making significant operational changes and because this is an ongoing issue that is adversely impacting people, I reactivated hgweb11 & 12 in the load balancer and took out hgweb15 & 16. All servers now active in the load balancer are using identical hardware. The window where replicated data is inconsistent will now be much smaller and the chances of us encountering this issue should be significantly diminished.
Comment hidden (Intermittent Failures Robot) |
Comment 27•7 years ago
|
||
As it looks like the problem does no longer exist. The last failure happened at 2018-06-01 14:22:46, and there were many more try pushes over the weekend.
Comment 28•7 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #25)
> Because it is a Friday and I don't want to be making significant operational
> changes and because this is an ongoing issue that is adversely impacting
> people, I reactivated hgweb11 & 12 in the load balancer and took out hgweb15
> & 16. All servers now active in the load balancer are using identical
> hardware. The window where replicated data is inconsistent will now be much
> smaller and the chances of us encountering this issue should be
> significantly diminished.
We need to be able to run on those old systems, so what do we need to do to make that happen (before the end of the all hands, as the hardware has to move the week of the 18th).
Flags: needinfo?(gps)
Assignee | ||
Comment 29•7 years ago
|
||
I switched the load balancer to use IP-based session persistence. I removed hgweb11 and 12 and added hgweb15 and 16 back in.
Please be on the lookout for the failure in the bug summary again.
Flags: needinfo?(gps)
Assignee | ||
Comment 30•7 years ago
|
||
For the record, I'm reasonably certain I uncovered a bug in the load balancer's handling of IP-based session persistence. Bug 1467553 has been filed.
Comment 31•7 years ago
|
||
It started again for my try build here:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=72304f0e021277cd684dc8ee6514a83e48fb4fe5
Flags: needinfo?(gps)
Assignee | ||
Comment 32•7 years ago
|
||
Also saw it recently in e.g. https://treeherder.mozilla.org/logviewer.html#?job_id=182384324&repo=try.
So, it appears things are still broken even when apparently using IP-based session persistence. Ugh.
I've disabled IP session persistence and have reverted back to hgweb11-14 in the pool. This was the known working state from over the weekend and the beginning of this week.
Flags: needinfo?(gps)
Assignee | ||
Comment 33•7 years ago
|
||
We're unfortunately at the end of our ability to keep the physical servers backing hg.mozilla.org consistent, as hgweb11 and 12 need to be physically moved to another datacenter this week.
Earlier today, I removed hgweb11 and hgweb12 from the load balancer and reinserted hgweb15 and hgweb16. So, this intermittent failure will start reappearing again.
We've landed bug 1469351 to add a workaround for the issue on Try for Linux and macOS builds. If you have 64e9597a6e97 in your repository history, tasks encountering this issue on Try should abort and retry automatically.
https://github.com/mozilla-releng/OpenCloudConfig/pull/155 is open against OpenCloudConfig so the `hg robustcheckout` changes are deployed to *all* Windows builders in CI. However, because TaskCluster's workers don't have feature parity (bug 1469402), Windows tasks encountering this issue will abort and *not* automatically retry. But at least the error will occur sooner and the error will contain a message that more clearly identifies the problem as residing with VCS.
I'm optimistic that the status quo is "good enough" and we can ride out the remaining intermittent failures until after the hg.mozilla.org datacenter move. I believe that is scheduled to complete by July 22.
There are some potentially solutions we can explore to further mitigate the problem. I believe all solutions to fix the problem require substantial time investment and/or have drawbacks, which is why we aren't actively pursuing them now. I am keen to improve the workarounds in CI, especially on Windows.
emorley: since apparently Windows tasks in TC can't automatically retry by specifying an exit code, is there something we could add to Treeherder to trigger automatic retries? Treeherder could perhaps look for the new text added in 64e9597a6e97?
Flags: needinfo?(emorley)
Comment 34•7 years ago
|
||
Hi! Other than user-initiated client side task actions, Treeherder doesn't do anything to do with task management currently -- and since we don't have any TC credentials server-side (intentionally), it's not something we could support. This sounds like something that needs to be handled somewhere in Taskcluster / some other automation.
Flags: needinfo?(emorley)
Assignee | ||
Comment 35•7 years ago
|
||
Boo :( Thanks for the info though: now it looks like all the answers are in the realm of Taskcluster.
bstack: could you please read comment 33 and comment 34 and assess accuracy of my statements regarding retrying after exit codes on non-docker-worker and whether any easy alternatives are available to us in generic-worker land.
Flags: needinfo?(bstack)
Comment 36•7 years ago
|
||
Afaiu you are correct in thinking that workers do not offer retry-on-failure. I think classically we've relied on in-tree configuration to handle this sort of thing. It would probably be possible to build a service that retriggers tasks based on failure notifications from pulse but I'm not sure that's the route we want to go down. I'm assuming we want to fix this outside of the gecko's task configuration because we want to fix it for all branches?
Flags: needinfo?(bstack)
Assignee | ||
Comment 37•7 years ago
|
||
Building a service to retrigger tasks based on failure notifications (which must be produced by the task itself or from something parsing its results/artifacts) feels a lot more complicated than having the TC platform retrigger tasks based on an exit code. Maybe there are cases where the task itself can not emit the proper exit code and we need such a higher-level service. But the things I have in mind can all be addressed by the "simple" solution [which is already implemented on docker-worker]. So I think bug 1469402 is the proper thing for TC to focus on.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 40•7 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #37)
> Building a service to retrigger tasks based on failure notifications (which
> must be produced by the task itself or from something parsing its
> results/artifacts) feels a lot more complicated than having the TC platform
> retrigger tasks based on an exit code. Maybe there are cases where the task
> itself can not emit the proper exit code and we need such a higher-level
> service. But the things I have in mind can all be addressed by the "simple"
> solution [which is already implemented on docker-worker]. So I think bug
> 1469402 is the proper thing for TC to focus on.
Is there an advantage to do this via onExitStatus feature rather than in the in-tree code that performs the hg robustcheckout?
There should be less overhead in implementing an exponential backoff in the in-tree code than triggering new task runs.
Using onExitStatus feature will require updating gecko's task generation to include this in the task payload, so if we are touching /taskcluster in gecko to fix this, it should be possible to update it in the code that calls hg robustcheckout, right?
Another option (not sure if it discussed above) is that we only listen for hg changesets once the webhead syncs have all completed. This would be a change to the way we interrogate hg for changesets. Bug 1225243 seems related.
Flags: needinfo?(gps)
See Also: → 1225243
Comment 41•7 years ago
|
||
Not sure if hg already sends pulse events when changesets have sync'd across webheads, but if it did, bug 1225243 might be the way to go to integrate with taskcluster.
Assignee | ||
Comment 42•7 years ago
|
||
We already have backoff and retry code in `hg robustcheckout`. It generally works.
The main problem with this issue is that we end up soft corrupting the local repo when we encounter the issue. That's why we want to blow away the local caches and retry the task completely. It is very much a hack.
Anyway, I'm actively working on bug 1470606 to hopefully resolve the "mirrors are inconsistent for a period of time" issue once and for all. It should hopefully eliminate this entire class of failures. This is my #1 priority right now and I hope to get it deployed to production next week. Although with all the holidays abound, that may not happen.
Even though the inconsistency window will go away, Taskcluster should still change bug 1225243 to use Pulse or SNS for primary scheduling. Its pushlog polling of a few dozen repos every few seconds is a bit annoying. Keying off notifications from Pulse or SNS will result in lower latency for job scheduling and drastically reduce query volume towards hg.mozilla.org.
Flags: needinfo?(gps)
Comment 44•7 years ago
|
||
Bug 1464219 has some discussion of retrying decision tasks on certain kinds of failures.
Assignee | ||
Comment 45•7 years ago
|
||
With bug 1470606 being resolved, changesets won't become visible on https://hg.mozilla.org/ until they have been replicated to all active servers, even if the underlying replication completes at different times. The inconsistency window between when one server in the cluster exposes a new push from all other servers is now <10ms. ~3ms is common. This should be shorter than the RTT + HTTP request processing time for nearly every client/request. Which means that for all intents and purposes consumers of https://hg.mozilla.org/ should now see atomic repository state.
i.e. the underlying issue causing this bug has been fixed and we can close this bug.
We are tracking one remaining issue with the "automationrelevance" API. I'll file a new bug against 1470606 after lunch and try to debug it today. Since decision tasks retry after json-automationrelevance errors, I'm pretty confident the impact will be minimal. There's a chance we may see a small percentage of decision tasks fail if it gets unlucky and hits the failure N times. Retriggers should eventually work.
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 46•7 years ago
|
||
I temporarily disabled the changes made in this bug because they were causing other issues (bug 1475656) and I perceived those other problems to be worse than the problems this was causing.
I'll track the lingering issues in bug 1475656. Let's keep this bug closed. Just know we may start seeing the artifact build failures in the bug summary start occurring again.
Comment hidden (Intermittent Failures Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•