Closed Bug 1617552 Opened 1 year ago Closed 1 month ago

aws t-linux-large testers increased rate of machine failing all tasks assigned with timeouts

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INACTIVE

People

(Reporter: aryx, Unassigned)

References

(Blocks 2 open bugs)

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

There is an increased level of aws t-linux-large machines which fail all tasks assigned to them due to timeouts (machine likely too slow for reasons unknown at the moment). There are often several machines affected at the same time, e.g. 2020-02-22 00:00 UTC til 03:00 UTC, same time a day before.

List of some identified machines: https://sql.telemetry.mozilla.org/queries/65355/source#166269

Enter the machine name at https://sql.telemetry.mozilla.org/queries/64421/source?p_machine_name_64421=i-07ee49671dae1e1ce#164380 to get the list of tasks they ran.

The failures are often classified as bug 1414495 or bug 1411358. There are ~100 known test failures for the last week with ~40 more in these bugs plus an unknown amount in Try pushes.

Can anybody investigate and reach out to AWS if necessary? Thank you.

Flags: needinfo?(dustin)

Looking at the query, it appears that most if not all of these are ubuntu 18.04.

Bob: Thanks for the pointer. Edwin, can you check the first runs on failing machines if anything in the log is unexpected?

Flags: needinfo?(egao)

Edwin is on pto until next week. I did enable Ubuntu 18.04 browser-chrome tests late last week in Bug 1613983.

Flags: needinfo?(dustin)

I looked at linux.*64 testfailed for 2020-02-14 to 2020-02-25. Assuming I did things correctly, Ubuntu 16.04 is actually less likely to suffer a Task timeout and more likely to suffer an application timeout than Ubuntu 18.04. It does not appear that the issue is specific to Ubuntu 18.04.

Flags: needinfo?(egao)

That makes it more likely that the machines are provided with degraded performance from the start (also due to the issue not being spread similar to testing load). coop, can you open a ticket with AWS to find out what's going on with t-linux-large (and have them ideally fix the issue or not provision the machines).

Flags: needinfo?(coop)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #5)

That makes it more likely that the machines are provided with degraded performance from the start (also due to the issue not being spread similar to testing load). coop, can you open a ticket with AWS to find out what's going on with t-linux-large (and have them ideally fix the issue or not provision the machines).

I will open a ticket with them, and will link it here once I do.

The underlying m5.large instance type seems too new for AWS to be deprecating it, but we have gone through this dance with AWS before.

Flags: needinfo?(coop)

(In reply to Chris Cooper [:coop] pronoun: he from comment #6)

I will open a ticket with them, and will link it here once I do.

I still plan to do this, however before I do, I felt it prudent to terminate the ~40 t-linux-large instances (across all regions) mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1600071#c4 to make sure they're not the cause of this.

If we're still seeing these failures on Thursday, I'll follow-up with AWS.

(In reply to Chris Cooper [:coop] pronoun: he from comment #7)

If we're still seeing these failures on Thursday, I'll follow-up with AWS.

I'm looking at the updated query results from comment #0 (https://sql.telemetry.mozilla.org/queries/65355/source#166269).

Aryx: is this better, worse, or the same in your estimation?

Flags: needinfo?(aryx.bugmail)
Flags: needinfo?(aryx.bugmail)

We are still seeing 20-40 failures/day based on machines which failed at least 3 tasks on production branches or in a row. The maximum count of tasks a machine ran was ~13 and there were no huge gaps between the executions of the tasks.

(In reply to Chris Cooper [:coop] pronoun: he from comment #7)

If we're still seeing these failures on Thursday, I'll follow-up with AWS.

I filed an AWS case today: https://console.aws.amazon.com/support/home?region=us-east-1#/case/?displayId=6855470431&language=en

I'll be on PTO next week, so I cc-ed :bstack on the AWS case (and on this bug) for follow-up.

Here's the initial response from AWS support:

Hello Coop,

My name is Luke and I'm from the AWS Linux Support team and will be looking into your case.

I understand you are seeing a lot of timeouts running your test jobs on m5.large instances types. You would like review to see if there is any commonalities between the instances that you provided.

I focused the response on the m5.large instances(i-03bb3c305fe29770f, i-04dc5b3e0f42c3a6f, i-03d8d270819160d2c, i-0eb0734fd390d4dc0) types. On review of these instances I found there was no issue with the underlying hardware or networking(dropped packets, packet loss). Console on these instances looked good. On review of the CloudWatch metrics everything looks good too. Although, I did see for these nodes there was a few large spikes for Network In/Out traffic. For example NetworkOut[1] for i-03d8d270819160d2c shows spikes at the following times:

  • 2020-03-02T05:33:00.000Z
  • 2020-03-02T06:33:00.000Z
  • 2020-03-02T08:38:00.000Z

There are a recommendations I suggest from review these instances. These updates will help with your stability of the operating system to help rule out issue:

  1. Kernel 4.4.0-1014-aws on ubuntu 14 which is a few revisions behind the latest one available. The latest available Ubuntu 14 is 4.4.0-1044.47 which is released on the 2019-05-16. We also suggest migrating to a newer version of ubuntu with a more up-to-date kernel too.
  1. ENA driver version 1.3.0K. ENA driver is past version 2.

So to get to the bottom of what is happening we will need to get some more information. I would like the following information from 1 or 2 current instances that are facing the issue:

  1. Instance IDs

  2. Timestamp + timezone where you see the failures.

  3. I see that the m5 instance types also have instance store volumes. At the time of your issue can you please run iostat every 1 second to see how the instance is working with the Volumes that are attached to it:
    $iostat -mydtxz 1

  4. Can you provide logging form the Application and Operating System side(like file /var/log/messages) of things at the time of the issue?

  5. Do you have any instances that are running your workload that do not require termination? If so can you provide us with a couple of instance IDs and we can review these too.

Finally, I was taking a look at https://bugzilla.mozilla.org/show_bug.cgi?id=1617552 and there was a mention of this issue occurring on Ubuntu 18 images. From this review I see from the console that these instances provided are using Ubuntu 14 images provided from AMI ami-03788cad4724efdbc .

In summary, this review is not conclusive to the issue you are facing, so we would like some more information to assist with further troubleshooting. Please let me know if you have any further questions or concerns and we will be happy to help. Have a great day !

References:
[1] i-03d8d270819160d2c - NetworkOut:
https://console.aws.amazon.com/cloudwatch/home?region=us-west-2#metricsV2:graph=~%28region~%27us-west-2~metrics~%28~%28~%27AWS*2fEC2~%27NetworkOut~%27InstanceId~%27i-03d8d270819160d2c%29%29~period~300~stat~%27Sum~start~%272020-03-01T03*3a58*3a59Z~end~%272020-03-06T03*3a58*3a59Z%29

There's a lot in that previous comment, so let me summarize:

  1. We should move off of Ubuntu 14.04: Already planned (bug 1580575), but also non-trivial due to the need to establish new baselines for everything post-migration.
  2. Whether we stay on 14.04 or not, we should make sure our ENA drivers is up-to-date.
  3. If we think this is also affecting instances running 18.04, we'll need to find some specific examples of 18.04 timing out for AWS to look at. All the recent examples I provided were running 14.04.
  4. Catching an instance in the act will help a lot. Maybe the sheriffs could be asked to respond immediately to timeouts that look like this and log the relevant data? The longer we wait for a single instance to accumulate a dodgy history, the more chance that instance gets recycled out from under us.

(In reply to Chris Cooper [:coop] pronoun: he from comment #13)

  1. Catching an instance in the act will help a lot. Maybe the sheriffs could be asked to respond immediately to timeouts that look like this and log the relevant data? The longer we wait for a single instance to accumulate a dodgy history, the more chance that instance gets recycled out from under us.

Yes, sheriffs have been notified to ping if they see a machine failing with 1580652 (task timeout) or bug 1414495 (application timed out). I will reach out to 'taskcluster' on Matrix once an affected running machine has been identified.

(In reply to Chris Cooper [:coop] pronoun: he from comment #13)

  1. If we think this is also affecting instances running 18.04, we'll need to find some specific examples of 18.04 timing out for AWS to look at. All the recent examples I provided were running 14.04.

There were a bunch of failures on 18.04 today, so I sent a list of 4 instance IDs to AWS. It would still be great to catch one of these in the act though.

:aryx linked some instance IDs in matrix today, and was even able to quarantine one that had just failed: i-0494bcc5f8c72bae

I've updated the AWS case with those new IDs, highlighting the quarantined one.

We received a very detailed response from AWS over the weekend with a bunch of potential avenues for exploration. It's long, so I've added it as an attachment.

There are 30 total failures in the last 7 days.

debug: linux1804-64, linux1804-64-qr, 
opt: linux1804-64-asan, linux1804-64-shippable, linux1804-64-shippable-qr, windows7-32-shippable

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=293385281&repo=autoland&lineNumber=1367

Flags: needinfo?(aryx.bugmail)
Flags: needinfo?(aryx.bugmail)

The failures stopped 2 weeks ago, the last failed task of which we know started at Sun, Mar 29, 21:17:44 UTC.

3 bad machines yesterday.

Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.