Closed Bug 1296819 Opened 3 years ago Closed 2 years ago

Intermittent valgrind-test | Invalid write of size 4 at mozilla:: / _pt_root / start_thread / clone

Categories

(Core :: General, defect, P3)

defect

Tracking

()

VERIFIED FIXED
mozilla61
Tracking Status
firefox50 --- wontfix
firefox51 --- wontfix
firefox52 --- wontfix
firefox-esr52 --- wontfix
firefox-esr60 --- fixed
firefox60 --- wontfix
firefox61 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: jseward)

References

Details

(Keywords: bulk-close-intermittents, intermittent-failure, Whiteboard: [ele:1b][stockwell fixed:product])

Attachments

(1 file, 1 obsolete file)

Attachment #8783427 - Attachment is obsolete: true
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
Am I reading the stack right that this looks like an issue with the hang monitor?
Flags: needinfo?(jseward)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #5)
> Am I reading the stack right that this looks like an issue with the hang
> monitor?

Well, I don't know if it is an issue with the hang monitor per se.
The invalid write to 0x0 is this, at the end of "void RunWatchdog(void* arg)"
(the watchdog thread main function):

    // Shutdown is apparently dead. Crash the process.
    MOZ_CRASH("Shutdown too long, probably frozen, causing a crash.");

So the question that comes to mind is: why is this timing out now when it
didn't previously?
Flags: needinfo?(jseward)
Mass wontfix for bugs affecting firefox 52.
Hello, I'm asking your help with an experiment with making decisions on bugs. You've been needinfo'ed on this bug. I'd like you to take one action to help this bug make progress toward a decision. The things you can do include:

* If you know or have a good guess of which product and component this bug belongs to, change the product and component of the bug
* If you know of the right person to ask about this bug, redirect the needinfo to them
* If you cannot reproduce the bug, close it

All we need you to do is one thing that will help us make a decision on the bug or resolve it.

Thank you for your help with this. If you have questions, please contact emma@mozilla.com.
Flags: needinfo?(valentin.gosu)
Whiteboard: [ele:1b]
Flags: needinfo?(valentin.gosu) → needinfo?(hchang)
Not sure why I am ni'ed but I am trying to give feedback as far as I know:

1) There seems to be a wrong review commit (from me) pushed to this bug but it shouldn't
   be relevant to this bug.

2) I didn't find the stack but that reminds me a safebrowsing "long hang crash" which
   has been resolved for a while.
Flags: needinfo?(hchang)
Thanks for that. It also seemed to me that this hasn't been reproduced in a while, but I thought I would ask someone who looked at it previously. Seems I should have looked more carefully.

Closing as WORKSFORME
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WORKSFORME
Status: REOPENED → RESOLVED
Closed: 3 years ago2 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
This has become a frequent failure. First failure after a long break was on 2018-02-14.

Julian, can you take a look at this or redirect to an appropriate contact, please?
Flags: needinfo?(jseward)
(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #21)
> Julian, can you take a look at this or redirect to an appropriate contact,
> please?

Looking.  At first glance this looks like a shutdown hang which 
results in a MOZ_CRASH, causing Valgrind to report an invalid
write at address zero.  I wonder if we need to increase some
timeout value for the running-on-valgrind case.
Flags: needinfo?(jseward)
There have been 63 failures in the last 7 days.
Almost all the failures occur on Linux x64 /opt.

Here is a relevant log file and a snippet with the failure:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=166912132

[task 2018-03-09T03:14:39.963Z] 03:14:39     INFO -  14:09.27 --31681-- WARNING: Serious error when reading debug info
[task 2018-03-09T03:14:39.963Z] 03:14:39     INFO -  14:09.27 --31681-- When reading debug info from /dev/shm/org.chromium.GwrzjR (deleted):
[task 2018-03-09T03:14:39.963Z] 03:14:39     INFO -  14:09.27 --31681-- failed to stat64/stat this file
[task 2018-03-09T03:15:56.146Z] 03:15:56     INFO -  15:25.45 ==31681== Thread 63 Shutdow~minator:
[task 2018-03-09T03:15:56.146Z] 03:15:56     INFO -  15:25.45 TEST-UNEXPECTED-FAIL | valgrind-test | Invalid write of size 4 at mozilla:: / _pt_root / start_thread / clone
[task 2018-03-09T03:15:56.146Z] 03:15:56     INFO -  15:25.45 ==31681== Invalid write of size 4
[task 2018-03-09T03:15:56.146Z] 03:15:56     INFO -  15:25.45 ==31681==    at 0x114EC2CD: mozilla::(anonymous namespace)::RunWatchdog(void*) (nsTerminator.cpp:206)
[task 2018-03-09T03:15:56.146Z] 03:15:56     INFO -  15:25.45 ==31681==    by 0x642CB80: _pt_root (ptthread.c:201)
[task 2018-03-09T03:15:56.146Z] 03:15:56     INFO -  15:25.45 ==31681==    by 0x4E3AB4F: start_thread (pthread_create.c:304)
[task 2018-03-09T03:15:56.146Z] 03:15:56     INFO -  15:25.45 ==31681==    by 0x5D51FBC: clone (clone.S:112)
[task 2018-03-09T03:15:56.147Z] 03:15:56     INFO -  15:25.45 ==31681==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

:jseward any updates on this?
Flags: needinfo?(jseward)
(In reply to Tiberius Oros[:tiberius_oros] from comment #30)
> :jseward any updates on this?

I have been looking at this in detail this week (sorry, I should
have mentioned this earlier).  It's not easy to figure out the root
cause.  I currently have 2 bugs in progress: bug 1442603 and bug 1443499
which I believe should improve the situation.  I hope to have them
landed by early next week at the latest.
Flags: needinfo?(jseward)
(In reply to Julian Seward [:jseward] from comment #31)
> (In reply to Tiberius Oros[:tiberius_oros] from comment #30)
> > :jseward any updates on this?
> [..]
> I currently have 2 bugs in progress: bug 1442603 and bug 1443499

These have both now landed.  They reduce the worst-case BHR resource
usage, which I suspect may be a factor here.  Let's see if that
makes a difference.  If not, I will look into increasing the shutdown
hang timeouts when running on valgrind.
Whiteboard: [ele:1b][stockwell disable-recommended] → [ele:1b][stockwell fixed:product]
https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → INCOMPLETE
This bug is back on mozilla-release and mozilla-esr60, failing on recent pushes 40% of all runs (?).

Log https://treeherder.mozilla.org/logviewer.html#?job_id=176808362&repo=mozilla-release

[task 2018-05-03T18:57:12.162Z] 18:57:12     INFO -  14:49.54 TEST-UNEXPECTED-FAIL | valgrind-test | Invalid write of size 4 at mozilla:: / _pt_root / start_thread / clone
[task 2018-05-03T18:57:12.162Z] 18:57:12     INFO -  14:49.54 ==30231== Invalid write of size 4
[task 2018-05-03T18:57:12.163Z] 18:57:12     INFO -  14:49.54 ==30231==    at 0x11111EED: mozilla::(anonymous namespace)::RunWatchdog(void*) (nsTerminator.cpp:206)
[task 2018-05-03T18:57:12.163Z] 18:57:12     INFO -  14:49.54 ==30231==    by 0x642B60C: _pt_root (ptthread.c:201)
[task 2018-05-03T18:57:12.163Z] 18:57:12     INFO -  14:49.54 ==30231==    by 0x4E3AB4F: start_thread (pthread_create.c:304)
[task 2018-05-03T18:57:12.163Z] 18:57:12     INFO -  14:49.54 ==30231==    by 0x5D51FBC: clone (clone.S:112)
[task 2018-05-03T18:57:12.163Z] 18:57:12     INFO -  14:49.54 ==30231==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
[task 2018-05-03T18:57:12.163Z] 18:57:12     INFO -  14:49.54 ==30231==
[task 2018-05-03T18:57:12.163Z] 18:57:12     INFO -  14:49.54 {
[task 2018-05-03T18:57:12.163Z] 18:57:12     INFO -  14:49.54    <insert_a_suppression_name_here>
[task 2018-05-03T18:57:12.163Z] 18:57:12     INFO -  14:49.54    Memcheck:Addr4
[task 2018-05-03T18:57:12.164Z] 18:57:12     INFO -  14:49.54    fun:_ZN7mozilla12_GLOBAL__N_111RunWatchdogEPv
[task 2018-05-03T18:57:12.164Z] 18:57:12     INFO -  14:49.54    fun:_pt_root
[task 2018-05-03T18:57:12.164Z] 18:57:12     INFO -  14:49.54    fun:start_thread
[task 2018-05-03T18:57:12.164Z] 18:57:12     INFO -  14:49.54    fun:clone
[task 2018-05-03T18:57:12.164Z] 18:57:12     INFO -  14:49.54 }

This sounds similar to comment 33. Julian, can you take a look at this, please?
Status: RESOLVED → VERIFIED
Flags: needinfo?(jseward)
Increases the shutdown timeout by a factor of 3 when built --enable-valgrind
*and* actually running on Valgrind.  Otherwise no effect.
Flags: needinfo?(jseward)
Attachment #8981448 - Flags: review?(amarchesini)
Comment on attachment 8981448 [details] [diff] [review]
bug1296819-increase-shutdown-timeout-1.diff

Review of attachment 8981448 [details] [diff] [review]:
-----------------------------------------------------------------

looks good to me. Thanks.
Attachment #8981448 - Flags: review?(amarchesini) → review+
Pushed by jseward@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/17ad383212f4
Intermittent valgrind-test | Invalid write of size 4 at mozilla:: / _pt_root / start_thread / clone.  r=amarchesini@mozilla.com.
Comment on attachment 8981448 [details] [diff] [review]
bug1296819-increase-shutdown-timeout-1.diff

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration:

 For automation runs on Valgrind, Valgrind slows down progress so much
 that the Shutdown Hang Reporter sometimes kills the process unnecessarily,
 leading the test to fail.  This patch increases the SHR timeouts
 only for builds which are configured --enable-valgind and which are actually
 running on Valgrind.

User impact if declined: 

 Zero, because it requires (1) config with --enable-valgind and (2) actually
 running on Valgrind.

Fix Landed on Version:

 62

Risk to taking this patch (and alternatives if risky): 

  Very low risk (no end-user effect)

String or UUID changes made by this patch: 

 None.

See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.
Attachment #8981448 - Flags: approval-mozilla-esr60?
Comment on attachment 8981448 [details] [diff] [review]
bug1296819-increase-shutdown-timeout-1.diff

Per bug 1443499 comment 9, taking for ESR60.
Attachment #8981448 - Flags: approval-mozilla-esr60? → approval-mozilla-esr60+
Assignee: nobody → jseward
Resolution: INCOMPLETE → FIXED
Target Milestone: --- → mozilla61
See Also: → 1469002
You need to log in before you can comment on or make changes to this bug.