Open Bug 1581693 Opened 6 years ago Updated 3 years ago

Intermittent [ FAILED ] DatagramFragment13/TlsFragmentationAndRecoveryTest.DropFirstHalf/0, where GetParam() = true (1202 ms)

Categories

(NSS :: Libraries, defect, P3)

Tracking

(Not tracked)

REOPENED

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Regression)

Details

(Keywords: regression)

Marcus, can you poke at why this test is now failing?

Assignee: nobody → marcus.apb
Severity: normal → critical
Priority: P5 → P1
Regressed by: 1579290

I took a brief look and couldn't reproduce it. One place to start would be to add -v to the invocation of ssl_gtest so that we get more diagnostic output from these test runs.

I couldn't reproduce too, but generated some debug information to analyse and understand what happened. I will try to update here soon.

I tried a lot of ways to reproduce this problem:

  • Change timeouts;
  • Increase the fragments;
  • Manipulate the handshaking;
  • Increase the load of my local machine;

The only and closest way to reproduce this problem was manually braking the handshake in a way that one agent never send the last ACK.
At this point I was pretty sure that was a infrastructure problem.
Analysing these failed tasks and many other successful tasks with the same code, I could realize that the MAC instances looked very loaded during the failures, with high times to finish the gtests.

So, I crossed the performance of the tests in mac instances in a interval time before and after the Bug 1579290.
I couldn't find any relationship with the patch.

My conclusion was that these failures were caused by some infrastructure problem, related to high load of the mac instances.
Now, looks that this problem is not appearing anymore. We can leave this BUG open for some days or weeks to confirm that was a temporary problem. I will continue following...

Thanks

Looks stable until now. Wait until next Tuesday to conclude.

One more report 10 days after the first.
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-09-01&endday=2019-10-01&tree=all&bug=1581693

The second looks pretty similar the first. I continue believing that are external problems.
To be sure of that, some analysis in that specific instances are necessary.
I will continue following that.

Almost 20 days without the problem.
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-09-14&endday=2019-10-14&tree=all&bug=1581693

I am closing this BUG with WORKSFORME status as it couldn't be reproduced and no evidence was found suggesting something related to code BUGs.
Looks to be some temporary infrastructure problem.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
Assignee: marcus.apb → nobody
Status: RESOLVED → REOPENED
Priority: P1 → P3
Resolution: WORKSFORME → ---
See Also: → 1473245
Has Regression Range: --- → yes

In the process of migrating remaining bugs to the new severity system, the severity for this bug cannot be automatically determined. Please retriage this bug using the new severity system.

Severity: critical → --
Severity: -- → S3
You need to log in before you can comment on or make changes to this bug.