Add telemetry reporting of late writes

RESOLVED FIXED in mozilla19

Status

()

defect
RESOLVED FIXED
7 years ago
7 years ago

People

(Reporter: espindola, Assigned: espindola)

Tracking

(Depends on 1 bug)

unspecified
mozilla19
x86_64
Linux
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite -

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 11 obsolete attachments)

7.45 KB, patch
justin.lebar+bug
: review+
Details | Diff | Splinter Review
We have support for poisoning writes on OS X. We currently use it only on debug builds where it causes a crash. To have it on nightly builds too we should be able to do a telemetry report instead of a crash.
Posted patch Early patch (obsolete) — Splinter Review
Hi Natan,
I am sending this for an early review since there are some design decisions I would like comments on. In particular:

* What do you think of the idea of naming these files with checksums? I only thought of this after implementing the current patch with tmpnam, but I really like it as it also avoid the possibility of cluttering the profile directory too much. Maybe we need to also store object offsets to avoid getting different results each run because of address space randomization.

* Should we filter to only include the DSOs that are used? I guess so, specially if we go with the idea of recording offsets.

* Do you see any big advantage to using a single file?
Assignee: nobody → respindola
Status: NEW → ASSIGNED
Attachment #646596 - Flags: review?(nfroyd)
Comment on attachment 646596 [details] [diff] [review]
Early patch

Review of attachment 646596 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Rafael Ávila de Espíndola (:espindola) from comment #1)
> * What do you think of the idea of naming these files with checksums? I only
> thought of this after implementing the current patch with tmpnam, but I
> really like it as it also avoid the possibility of cluttering the profile
> directory too much. Maybe we need to also store object offsets to avoid
> getting different results each run because of address space randomization.

If we can uniquify the stacks on the client side cheaply like this, I think we should do it.  I do wonder how easy it will be to get an easy-to-use md5-esque checksum routine (NSS need not apply) here, though.

If we're going to be storing a "lot" of data, maybe we should store these files in a subdirectory of the profile directory.

> * Should we filter to only include the DSOs that are used? I guess so,
> specially if we go with the idea of recording offsets.

You mean only listing the DSOs that appear in the stack trace, correct?  The only scenario I can imagine listing all DSOs would be useful is a binary addon triggering late writes through async SQL (or similar); I don't think addresses from the library would show up in the trace, but knowing the library was there would be useful for data analysis.  Whether that's a plausible scenario, I don't know.

I'd say just listing DSOs in the stack trace would be fine for now.

> * Do you see any big advantage to using a single file?

I do not understand the question.  A single file for all late write stack traces versus a file per late write stack trace, or something else?

::: xpcom/build/mozPoisonWriteMac.cpp
@@ +157,5 @@
>  
> +
> +static void RecordStackWalker(void *aPC, void *aSP, void *aClosure) {
> +    FILE *f = static_cast<FILE*>(aClosure);
> +    fprintf(f, "%lx\n", reinterpret_cast<uintptr_t>(aPC));

Any reason to not just use "%p\n" here?
Attachment #646596 - Flags: review?(nfroyd) → feedback+
> If we're going to be storing a "lot" of data, maybe we should store these
> files in a subdirectory of the profile directory.

If we do the merging I don't expect it to be a lot. We should only move the poison/exit point when our own tests are unable to find anything.


> I do not understand the question.  A single file for all late write stack
> traces versus a file per late write stack trace, or something else?
>

Yes, a single file for all stacks.

> Any reason to not just use "%p\n" here?

Just a tiny bit easier to parse back.

I will try to compute the offsets on the client side, unique the stacks and upload a new patch.

Thanks.
(In reply to Rafael Ávila de Espíndola (:espindola) from comment #3)
> > I do not understand the question.  A single file for all late write stack
> > traces versus a file per late write stack trace, or something else?
> 
> Yes, a single file for all stacks.

I am ambivalent about this.
Vladan, You are the one responsible for the HangMonitor, right?

On the reporting of late writes I would like to unique the stack on the client, which means turning the absolute pc values of the stack into offsets in the libraries.

I would also like to reuse as much of the ChromeHang reporting as possible. Ideally, the only difference between the two reports should be a bit saying if it is a hang or a late write.

So the question is, would it be ok to process the ChromeHang stacks locally too?  We already filter the libraries to keep only the used ones, so computing the offsets should be trivial.
Posted patch Don't use abort (obsolete) — Splinter Review
This patch is getting really big and noisy, so I started splitting it up.

This first part replaces uses of abort. They are OK for debug builds, but since we don't want to crash releases we have to forward all write calls.
Attachment #646596 - Attachment is obsolete: true
Attachment #648362 - Flags: review?
Attachment #648362 - Flags: review? → review?(nfroyd)
Whiteboard: [leave open]
Comment on attachment 648362 [details] [diff] [review]
Don't use abort

Review of attachment 648362 [details] [diff] [review]:
-----------------------------------------------------------------

This patch would have been easier to review if it hadn't involved a complete rewrite of how the functions were defined. :)  Then again, maybe shuffling things around was a necessary evil.

Regardless, I think this patch is mostly code motion, so I feel comfortable r+'ing even as a non-XPCOM peer.  You may want to get someone more qualified for future reviews, though.

::: xpcom/build/mozPoisonWriteMac.cpp
@@ +43,5 @@
> +
> +typedef ssize_t (*pwrite_t)(int fd, const void *buf, size_t nbyte, off_t offset);
> +template<FuncData &foo>
> +ssize_t wrap_pwrite_temp(int fd, const void *buf, size_t nbyte, off_t offset) {
> +    MOZ_ASSERT(0); // We have not seen it before, so just assert/report it.

Move this comment before the declaration, above, and make it parallel with the aio_write comment.

@@ +48,5 @@
> +    pwrite_t old_write = (pwrite_t) foo.Buffer;
> +    return old_write(fd, buf, nbyte, offset);
> +}
> +
> +// Define a FuncData for a pwrite like functions.

Nit: "for a pwrite-like function"

@@ +76,5 @@
> +    writev_t old_write = (writev_t) foo.Buffer;
> +    return old_write(fd, iov, iovcnt);
> +}
> +
> +// Define a FuncData for a writev like functions.

Nit: "for a writev-like function"
Attachment #648362 - Flags: review?(nfroyd) → review+
> This patch would have been easier to review if it hadn't involved a complete
> rewrite of how the functions were defined. :)  Then again, maybe shuffling
> things around was a necessary evil.

Yes, sorry about that. This is part of the reason why I split the patch.

The code was using just abort() for writes we have never seen, but since we don't want to crash nighties, I had to add support for finishing up every write.
https://hg.mozilla.org/integration/mozilla-inbound/rev/c994ae546da0
Posted patch Record the last write (obsolete) — Splinter Review
This patch enables write poisoning in release builds, process stacks and library information and records it to disk.

With this patch only the information on the last write survives. In a following patch I will change the name of the final file to use md5, but that is better done in an independent review.

This patch also duplicates code from the chrome hang. We already decided to move chrome hang to computing offset locally too, so in a followup patch I will move the class that processes the stack and change chrome hang to use it.
Attachment #648446 - Flags: review?(nfroyd)
Posted patch Record the last write (obsolete) — Splinter Review
Sorry, there was a missing s/MOZ_ASSERT/ValidWriteAssert/ on the last patch.
Attachment #648446 - Attachment is obsolete: true
Attachment #648446 - Flags: review?(nfroyd)
Attachment #648476 - Flags: review?(nfroyd)
Comment on attachment 648476 [details] [diff] [review]
Record the last write

Review of attachment 648476 [details] [diff] [review]:
-----------------------------------------------------------------

I think the logic looks good here.  Since I'm not an XPCOM peer, I'm flagging Justin for review on this one.  (I guess Justin's not an official XPCOM peer either, but he's closer than I am!)

::: xpcom/build/mozPoisonWriteMac.cpp
@@ +83,5 @@
> +
> +        size_t moduleIndex = 0;
> +        size_t stackIndex = 0;
> +        bool unreferencedModule = true;
> +        while (stackIndex < mStack.size() && moduleIndex < mModules.GetSize()) {

I think this loop would be slightly clearer written like so:

  while (moduleIndex < mModules.GetSize()) {
    // Optional check for efficiency's sake.
    if (stackIndex == mStack.size()) {
      break;
    }

    uintptr_t moduleStart, moduleEnd = ...;
    bool moduleReferenced = false;
    while (stackIndex < mStack.size()) {
      uintptr_t pc = mStack[stackIndex].mPC;
      if (moduleStart <= pc && pc <= moduleEnd) {
        moduleReferenced = true;
        ...;
      } else if (pc > moduleEnd) {
        break;
      } else {
        // PC does not belong to any module.
        ...
      }
    }

    if (!moduleReferenced) {
      // No PCs in the address range of this module.
      mModules.RemoveEntries(...);
    } else {
      // Found all PCs in this module, move to the next one.
      moduleIndex++;
    }
  }

I think it is slightly less efficient, but the control flow seems more obvious.  WDYT?

@@ +117,5 @@
> +        if (moduleIndex + 1 < mModules.GetSize()) {
> +            mModules.RemoveEntries(moduleIndex + 1, mModules.GetSize());
> +        }
> +
> +        std::sort(mStack.begin(), mStack.end(), CompareByIndex);

Is it provable that we've assigned indices to every stack entry at this point?  We should -1-ize the relevant entries if not and maybe even if so. ;)

@@ +163,5 @@
> +    const char *tempSuffix = "/Telemetry.LateWriteTmpXXXXXX";
> +    name.insert(name.end(), tempSuffix, tempSuffix + strlen(tempSuffix) + 1);
> +    int fd = mkstemp(&name[0]);
> +    MozillaRegisterDebugFD(fd);
> +    FILE *f = fdopen(fd, "w+");

Nit: could get away with just "w" here.

@@ +172,5 @@
> +        const char *name = stack.GetModuleName(i);
> +        fprintf(f, "%s\n", name ? name : "");
> +    }
> +
> +    for (int i = 0, n = stack.GetStackSize(); i < n; ++i) {

I think it would not be a bad idea to dump n to the file here.  I realize that you don't have the same issues for stack frames that you do with modules--can't have empty entries for stack frames, whereas you can for modules--but data formats that are somewhat self-describing are a good thing.

@@ +407,5 @@
>      if (OldCleanup)
>          OldCleanup();
> +    if (ProfileDirectory)
> +        PL_strfree(ProfileDirectory);
> +    ProfileDirectory = NULL;

nullptr?  Or NULL for consistency with the rest of the file?

@@ +443,5 @@
> +      if (NS_SUCCEEDED(rv)) {
> +          ProfileDirectory = PL_strdup(nativePath.get());
> +      }
> +  }
> +

Nit: This entire hunk looks badly indented relative to surrounding code.
Attachment #648476 - Flags: review?(nfroyd)
Attachment #648476 - Flags: review?(justin.lebar+bug)
Attachment #648476 - Flags: review+
> I think the logic looks good here.  Since I'm not an XPCOM peer, I'm
> flagging Justin for review on this one.  (I guess Justin's not an official
> XPCOM peer either, but he's closer than I am!)

OK, thanks.


> I think this loop would be slightly clearer written like so:
> 
>   while (moduleIndex < mModules.GetSize()) {
>     // Optional check for efficiency's sake.
>     if (stackIndex == mStack.size()) {
>       break;
>     }
> 
>     uintptr_t moduleStart, moduleEnd = ...;
>     bool moduleReferenced = false;
>     while (stackIndex < mStack.size()) {
>       uintptr_t pc = mStack[stackIndex].mPC;
>       if (moduleStart <= pc && pc <= moduleEnd) {
>         moduleReferenced = true;
>         ...;
>       } else if (pc > moduleEnd) {
>         break;
>       } else {
>         // PC does not belong to any module.
>         ...
>       }
>     }
> 
>     if (!moduleReferenced) {
>       // No PCs in the address range of this module.
>       mModules.RemoveEntries(...);
>     } else {
>       // Found all PCs in this module, move to the next one.
>       moduleIndex++;
>     }
>   }
> 
> I think it is slightly less efficient, but the control flow seems more
> obvious.  WDYT?

We would then need a "tail loop" for the stack, but yes, it looks better. BTW, I found a bug, the range of a module is [moduleStart, moduleEnd), not [moduleStart, moduleEnd]. I am testing a patch with this change and will upload in a minute.


> > +        std::sort(mStack.begin(), mStack.end(), CompareByIndex);
> 
> Is it provable that we've assigned indices to every stack entry at this
> point?  We should -1-ize the relevant entries if not and maybe even if so. ;)

It is, the struct is private and it is created only in

    void AddStackFrame(uintptr_t aPC) {
        MOZ_ASSERT(!mProcessed);
        StackFrame Frame = {aPC, static_cast<uint16_t>(mStack.size()), 0 };
        mStack.push_back(Frame);
    }

> @@ +163,5 @@
> > +    const char *tempSuffix = "/Telemetry.LateWriteTmpXXXXXX";
> > +    name.insert(name.end(), tempSuffix, tempSuffix + strlen(tempSuffix) + 1);
> > +    int fd = mkstemp(&name[0]);
> > +    MozillaRegisterDebugFD(fd);
> > +    FILE *f = fdopen(fd, "w+");
> 
> Nit: could get away with just "w" here.

On this patch we can, but once we add md5 the computation would have to be mixed with the file writing. Something like using snprintf + md5update + fwrite. Just "w" would work on this patch. I now actually think that is better, will switch to 'w'.

> I think it would not be a bad idea to dump n to the file here.  I realize
> that you don't have the same issues for stack frames that you do with
> modules--can't have empty entries for stack frames, whereas you can for
> modules--but data formats that are somewhat self-describing are a good thing.

Good point, it does make it easier to parse. Will do it.
 
> > +    ProfileDirectory = NULL;
> 
> nullptr?  Or NULL for consistency with the rest of the file?

nullptr for sure. Thanks!

> @@ +443,5 @@
> > +      if (NS_SUCCEEDED(rv)) {
> > +          ProfileDirectory = PL_strdup(nativePath.get());
> > +      }
> > +  }
> > +
> 
> Nit: This entire hunk looks badly indented relative to surrounding code.

Fixed. Thanks.
(In reply to Rafael Ávila de Espíndola (:espindola) from comment #14)
> > I think this loop would be slightly clearer written like so:
> > ...
> > I think it is slightly less efficient, but the control flow seems more
> > obvious.  WDYT?
> 
> We would then need a "tail loop" for the stack, but yes, it looks better.
> BTW, I found a bug, the range of a module is [moduleStart, moduleEnd), not
> [moduleStart, moduleEnd]. I am testing a patch with this change and will
> upload in a minute.
> 
> 
> > > +        std::sort(mStack.begin(), mStack.end(), CompareByIndex);
> > 
> > Is it provable that we've assigned indices to every stack entry at this
> > point?  We should -1-ize the relevant entries if not and maybe even if so. ;)
> 
> It is, the struct is private and it is created only in
> 
>     void AddStackFrame(uintptr_t aPC) {
>         MOZ_ASSERT(!mProcessed);
>         StackFrame Frame = {aPC, static_cast<uint16_t>(mStack.size()), 0 };
>         mStack.push_back(Frame);
>     }

Sorry, I should have been clearer here.  We give everything a modIndex of 0 here, but we also assign -1 to all entries without an associated module in the loop.  So it's possible to wind up with frames that weren't explicitly associated with a module, but are implicitly associated with a module, having a modIndex of 0, even after we've processed all the frames.  (I think this is what you were referring to when you cited the need for a cleanup loop in the rewritten loop I suggested.)

Perhaps we should just let modIndex be -1 in the constructor and then not bother setting modIndex and mPC to -1 if the frame isn't associated with a module?  That way we don't lose information about the PC, but it's obvious that all frames will either get associated with a module (non-negative modIndex) or not be associated with a module (-1 modIndex).  Hopefully we don't ever load MAX_UINT16T modules, either!
> Sorry, I should have been clearer here.  We give everything a modIndex of 0
> here, but we also assign -1 to all entries without an associated module in
> the loop.  So it's possible to wind up with frames that weren't explicitly
> associated with a module, but are implicitly associated with a module,
> having a modIndex of 0, even after we've processed all the frames.  (I think
> this is what you were referring to when you cited the need for a cleanup
> loop in the rewritten loop I suggested.)
> 
> Perhaps we should just let modIndex be -1 in the constructor and then not
> bother setting modIndex and mPC to -1 if the frame isn't associated with a
> module?  That way we don't lose information about the PC, but it's obvious
> that all frames will either get associated with a module (non-negative
> modIndex) or not be associated with a module (-1 modIndex).  Hopefully we
> don't ever load MAX_UINT16T modules, either!

Oh, I see. We still need the code to set mPC to -1, but yes, it is better. Thanks.
(In reply to Rafael Ávila de Espíndola (:espindola) from comment #16)
> Oh, I see. We still need the code to set mPC to -1, but yes, it is better.

Why is this?  Can't we just let a -1/0xffff modIndex seen while processing the file indicate that the frame should be tossed out (or information not looked up for it, or whatever)?  Or does the chromehang stuff already use a -1 PC to mean something special, and we're just following their lead?
Posted patch update patch with the review (obsolete) — Splinter Review
Attachment #648476 - Attachment is obsolete: true
Attachment #648476 - Flags: review?(justin.lebar+bug)
Attachment #648747 - Flags: review?(justin.lebar+bug)
Attachment #648747 - Flags: feedback?(nfroyd)
> Why is this?  Can't we just let a -1/0xffff modIndex seen while processing
> the file indicate that the frame should be tossed out (or information not
> looked up for it, or whatever)?  Or does the chromehang stuff already use a
> -1 PC to mean something special, and we're just following their lead?

It is for having stable addresses. Theses are probably from the JIT, so we don't want to have two different md5's just because the jit buffer was allocated in a different position in two runs.
(In reply to Rafael Ávila de Espíndola (:espindola) from comment #19)
> > Why is this?  Can't we just let a -1/0xffff modIndex seen while processing
> > the file indicate that the frame should be tossed out (or information not
> > looked up for it, or whatever)?  Or does the chromehang stuff already use a
> > -1 PC to mean something special, and we're just following their lead?
> 
> It is for having stable addresses. Theses are probably from the JIT, so we
> don't want to have two different md5's just because the jit buffer was
> allocated in a different position in two runs.

Ah yes, that's a good point.
Attachment #648747 - Flags: feedback?(nfroyd) → feedback+
Comment on attachment 648747 [details] [diff] [review]
update patch with the review

Review of attachment 648747 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/build/mozPoisonWriteMac.cpp
@@ +104,5 @@
> +                    mStack[stackIndex].mPC -= moduleStart;
> +                    mStack[stackIndex].mModIndex = moduleIndex;
> +                } else
> +                    // PC does not belong to any module
> +                    mStack[stackIndex].mPC = -1;

It might be worth noting why we do this, rather than just leaving it alone, given our earlier discussion.  Maybe also use std::numeric_limits here too, as you do in AddStackFrame?
Posted patch second review round (obsolete) — Splinter Review
Attachment #648747 - Attachment is obsolete: true
Attachment #648747 - Flags: review?(justin.lebar+bug)
Attachment #648768 - Flags: review?(justin.lebar+bug)
Attachment #648768 - Flags: feedback?(nfroyd)
Comment on attachment 648768 [details] [diff] [review]
second review round

Review of attachment 648768 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/build/mozPoisonWriteMac.cpp
@@ +105,5 @@
> +                    mStack[stackIndex].mModIndex = moduleIndex;
> +                } else
> +                    // PC does not belong to any module. It is probably from the JIT. Use a fixed
> +                    // mPC so that we don't get different stacks on different runs.
> +                    mStack[stackIndex].mPC = std::numeric_limits<uint16_t>::max();;

Watch the line length on your comments (can't measure in splinter, but looks rather long).  Remove the double semicolon.

Also you probably want uintptr_t instead of uint16_t here (even though it doesn't *really* matter)?

@@ +117,5 @@
> +        }
> +
> +        for (;stackIndex < stackSize; ++stackIndex)
> +            // These PCs are past the last module.
> +            mStack[stackIndex].mPC = std::numeric_limits<uint16_t>::max();

uintptr_t instead of uint16_t here too.
Attachment #648768 - Flags: feedback?(nfroyd) → feedback+
Posted patch third review round (obsolete) — Splinter Review
Thanks for catching the wrong type!
Attachment #648768 - Attachment is obsolete: true
Attachment #648768 - Flags: review?(justin.lebar+bug)
Attachment #648777 - Flags: review?(justin.lebar+bug)
Attachment #648777 - Flags: feedback?(nfroyd)
Where is telemetry involved here?
(In reply to Justin Lebar [:jlebar] from comment #25)
> Where is telemetry involved here?

I will be in a follow up patch. Since we are detecting late writes, it is too late to use telemetry at this point for sending the information. What we do is write the data to disk if telementry is enabled. In the next run telemetry will read the data and report it.
Please post patches with 8 lines of context.  git-bz will do this for you, if you're posting from git.

https://github.com/jlebar/git-bz-moz

This whole file does not follow one of Gecko's Holy Conventions, to start class and function braces on a new line.  I don't really care except for the sake of consistency; this is one rule that is pretty well followed.

>@@ -31,12 +38,183 @@ struct FuncData {
>+                if (pc >= moduleStart) {
>+                    // If the current PC is within the current module, mark
>+                    // module as used
>+                    moduleReferenced = true;
>+                    mStack[stackIndex].mPC -= moduleStart;
>+                    mStack[stackIndex].mModIndex = moduleIndex;
>+                } else
>+                    // PC does not belong to any module. It is probably from
>+                    // the JIT. Use a fixed mPC so that we don't get different
>+                    // stacks on different runs.
>+                    mStack[stackIndex].mPC =
>+                        std::numeric_limits<uintptr_t>::max();

Even though this is one statement, please put it inside braces; it's quite long.  (A less-followed but still common Gecko convention is to always brace if statements.)

>+            }
>+
>+            if (moduleReferenced)
>+                ++moduleIndex;
>+            else
>+                // Remove module if no PCs within its address range
>+                mModules.RemoveEntries(moduleIndex, moduleIndex + 1);

Nit: I'd prefer braces here, too, and, to match, on the if.

>+        }
>+
>+        for (;stackIndex < stackSize; ++stackIndex)
>+            // These PCs are past the last module.
>+            mStack[stackIndex].mPC = std::numeric_limits<uintptr_t>::max();

And braces here.

>+char *ProfileDirectory = NULL;
+
>+bool ValidWriteAssert(bool ok) {
>+    // On a debug build, just crash.
>+    MOZ_ASSERT(ok);
>+
>+    if (ok || !ProfileDirectory || !Telemetry::CanRecord())
>+        return ok;
>+
>+    // Write the stack and loaded libraries to a file. We can get here
>+    // concurrently from many writes, so we use multiple temporary files.
>+    ProcessedStack stack;
>+    NS_StackWalk(RecordStackWalker, 0, reinterpret_cast<void*>(&stack), 0);
>+    stack.Process();
>+
>+    std::vector<char> name(ProfileDirectory,
>+                           ProfileDirectory + strlen(ProfileDirectory));
>+    const char *tempSuffix = "/Telemetry.LateWriteTmpXXXXXX";
>+    name.insert(name.end(), tempSuffix, tempSuffix + strlen(tempSuffix) + 1);
>+    int fd = mkstemp(&name[0]);

Can you just do
  
  nsPrintfCString tempfile("%s%s", ProfileDirectory, "/Telemetry.LateWriteTmpXXXXXX");
  int fd = mkstemp(tempfile.get());

?

>+    MozillaRegisterDebugFD(fd);
>+    FILE *f = fdopen(fd, "w");
>+
>+    size_t numModules = stack.GetNumModules();
>+    fprintf(f, "%zd\n", numModules);

%zu?  size_t is unsigned...

>+    for (int i = 0; i < numModules; ++i) {
>+        const char *name = stack.GetModuleName(i);
>+        fprintf(f, "%s\n", name ? name : "");
>+    }
>+
>+    size_t numFrames = stack.GetStackSize();
>+    fprintf(f, "%zd\n", numFrames);

%zu?

>+    for (int i = 0; i < numFrames; ++i) {

Comparison between signed and unsigned int?  Make this size_t instead?

>+        const ProcessedStack::ProcessedStackFrame &frame = stack.GetFrame(i);
>+        // FIXME: while the offset we write is correct, on OS X the atos tool
>+        // expects a value with the virtual address added. For example,
>+        // the firefox binary has
>+        //      cmd LC_SEGMENT_64
>+        //      cmdsize 632
>+        //      segname __TEXT
>+        //      vmaddr 0x0000000100000000
>+        // so to print the line matching the offset 123 one has to run
>+        // atos -o firefox 0x100000123. Should we try to include the
>+        // virtual adress in this dump? Probably not, as it is trivial to get it
>+        // on the server.

Let's resolve this one way or another before we check this in.  I'd prefer just to put the virtual address in the dump; it's one line, and means you can process the file without looking something up somewhere.

>+        fprintf(f, "%d %lx\n", frame.mModIndex, frame.mOffset);

%lx (long) isn't necessarily the same as uintptr_t (although of course it works fine on mac).  Does mac support %jx?

>+    fclose(f);
>+    MozillaUnRegisterDebugFD(fd);

Unregister needs to come before fclose, otherwise it's a race condition on the fd.

>+    // FIXME: For now we just record the last write. We should write the files to
>+    // filenames that include the md5. That will provide a simple early deduplication if
>+    // the same bug is found in multiple runs.

I'd prefer if this was fixed in the version you land, because without this
change, we can see many fewer late-write bugs at a time, limiting our choices
of which bugs to fix.

>+    std::string finalName = ProfileDirectory;
>+    finalName += "/Telemetry.LateWriteFinal-last";

You could also use nsPrintfCString here.

>@@ -184,38 +362,47 @@ void AbortOnBadWrite(int fd, const void *wbuf, size_t count) {
> 
>     struct stat buf;
>     int rv = fstat(fd, &buf);
>-    MOZ_ASSERT(rv == 0);
>+    if (!ValidWriteAssert(rv == 0))
>+        return;
> 
>     // FIFOs are used for thread communication during shutdown.
>     if ((buf.st_mode & S_IFMT) == S_IFIFO)
>         return;
> 
>-    MyAutoLock lockedScope;
>+    {
>+        MyAutoLock lockedScope;
> 
>-    // Debugging FDs are OK
>-    std::vector<int> &Vec = getDebugFDs();
>-    if (std::find(Vec.begin(), Vec.end(), fd) != Vec.end())
>-        return;
>+        // Debugging FDs are OK
>+        std::vector<int> &Vec = getDebugFDs();
>+        if (std::find(Vec.begin(), Vec.end(), fd) != Vec.end())
>+            return;
>+    }
> 
>     // For writev we pass NULL in wbuf. We should only get here from
>     // dbm, and it uses write, so assert that we have wbuf.
>-    MOZ_ASSERT(wbuf);
>+    if (!ValidWriteAssert(wbuf))
>+        return;
> 
>     // As a really bad hack, accept writes that don't change the on disk
>     // content. This is needed because dbm doesn't keep track of dirty bits
>     // and can end up writing the same data to disk twice. Once when the
>     // user (nss) asks it to sync and once when closing the database.
>     void *wbuf2 = malloc(count);
>-    MOZ_ASSERT(wbuf2);
>+    if (!ValidWriteAssert(wbuf2))
>+        return;
>     off_t pos = lseek(fd, 0, SEEK_CUR);
>-    MOZ_ASSERT(pos != -1);
>+    if (!ValidWriteAssert(pos != -1))
>+        return;
>     ssize_t r = read(fd, wbuf2, count);
>-    MOZ_ASSERT(r == count);
>+    if (!ValidWriteAssert(r == count))
>+        return;
>     int cmp = memcmp(wbuf, wbuf2, count);
>-    MOZ_ASSERT(cmp == 0);
>+    if (!ValidWriteAssert(cmp == 0))
>+        return;

These early returns leak wbuf2.

>     free(wbuf2);
>     off_t pos2 = lseek(fd, pos, SEEK_SET);
>-    MOZ_ASSERT(pos2 == pos);
>+    if (!ValidWriteAssert(pos2 == pos))
>+        return;

>@@ -226,6 +413,9 @@ extern "C" void (*__cleanup)();
> void FinalCleanup() {
>     if (OldCleanup)
>         OldCleanup();
>+    if (ProfileDirectory)
>+        PL_strfree(ProfileDirectory);
>+    ProfileDirectory = nullptr;
>     delete &getDebugFDs();
>     PR_DestroyLock(MyAutoLock::getDebugFDsLock());
> }
>@@ -250,12 +440,18 @@ extern "C" {
> 
> namespace mozilla {
> void PoisonWrite() {
>-    // For now only poison writes in debug builds.
>-#ifndef DEBUG
>-    return;
>-#endif

This enables write poisoning on all builds with telemetry enabled.  That's not
what you want, right?
Comment on attachment 648777 [details] [diff] [review]
third review round

r- for some unresolved things in the patch that I'd like resolved.  We can address them in a follow-up patch in this bug, if you'd prefer.
Attachment #648777 - Flags: review?(justin.lebar+bug) → review-
(In reply to Justin Lebar [:jlebar] from comment #27)
> Please post patches with 8 lines of context.  git-bz will do this for you,
> if you're posting from git.
> 
> https://github.com/jlebar/git-bz-moz

Interesting. Looks like it doesn't quiet match my workflow since I keep a branch per bug and so just merge (or amend) when I am ready to transfer it to hg. I will try to remember to use -U8.
 
> This whole file does not follow one of Gecko's Holy Conventions, to start
> class and function braces on a new line.  I don't really care except for the
> sake of consistency; this is one rule that is pretty well followed.

I will update the new code to follow this and the other conventions you noted. 

> Can you just do
>   
>   nsPrintfCString tempfile("%s%s", ProfileDirectory,
> "/Telemetry.LateWriteTmpXXXXXX");
>   int fd = mkstemp(tempfile.get());
> 

Maybe. I have been avoiding nsPrintfCString since bug 743650, but it looks like it doesn't silently truncate anymore, is that correct?

> 
> >+    MozillaRegisterDebugFD(fd);
> >+    FILE *f = fdopen(fd, "w");
> >+
> >+    size_t numModules = stack.GetNumModules();
> >+    fprintf(f, "%zd\n", numModules);
> 
> %zu?  size_t is unsigned...
> 
> >+    for (int i = 0; i < numModules; ++i) {
> >+        const char *name = stack.GetModuleName(i);
> >+        fprintf(f, "%s\n", name ? name : "");
> >+    }
> >+
> >+    size_t numFrames = stack.GetStackSize();
> >+    fprintf(f, "%zd\n", numFrames);
> 
> %zu?
> 
> >+    for (int i = 0; i < numFrames; ++i) {
> 
> Comparison between signed and unsigned int?  Make this size_t instead?
> 
> >+        const ProcessedStack::ProcessedStackFrame &frame = stack.GetFrame(i);
> >+        // FIXME: while the offset we write is correct, on OS X the atos tool
> >+        // expects a value with the virtual address added. For example,
> >+        // the firefox binary has
> >+        //      cmd LC_SEGMENT_64
> >+        //      cmdsize 632
> >+        //      segname __TEXT
> >+        //      vmaddr 0x0000000100000000
> >+        // so to print the line matching the offset 123 one has to run
> >+        // atos -o firefox 0x100000123. Should we try to include the
> >+        // virtual adress in this dump? Probably not, as it is trivial to get it
> >+        // on the server.
> 
> Let's resolve this one way or another before we check this in.  I'd prefer
> just to put the virtual address in the dump; it's one line, and means you
> can process the file without looking something up somewhere.

Do you know an easy way to get it from within firefox? I think we currently only record the load address the size and the offset in the file. Note that the only file without 0 value is the firefox binary. All the libraries use 0.

> >+        fprintf(f, "%d %lx\n", frame.mModIndex, frame.mOffset);
> 
> %lx (long) isn't necessarily the same as uintptr_t (although of course it
> works fine on mac).  Does mac support %jx?
> 
> >+    fclose(f);
> >+    MozillaUnRegisterDebugFD(fd);
> 
> Unregister needs to come before fclose, otherwise it's a race condition on
> the fd.

fclose calls write. Looks like we need a MozillaUnRegisterDebugFDUnlocked or check if fsync avoids the write. I will give it a try.

> >+    // FIXME: For now we just record the last write. We should write the files to
> >+    // filenames that include the md5. That will provide a simple early deduplication if
> >+    // the same bug is found in multiple runs.
> 
> I'd prefer if this was fixed in the version you land, because without this
> change, we can see many fewer late-write bugs at a time, limiting our choices
> of which bugs to fix.

We are not even reading them back right now. The reason for wanting to commit this separately is that it gives us some coverage while we discuss where to get the md5 from.

> >+    std::string finalName = ProfileDirectory;
> >+    finalName += "/Telemetry.LateWriteFinal-last";
> 
> You could also use nsPrintfCString here.
> 
> >@@ -184,38 +362,47 @@ void AbortOnBadWrite(int fd, const void *wbuf, size_t count) {
> > 
> >     struct stat buf;
> >     int rv = fstat(fd, &buf);
> >-    MOZ_ASSERT(rv == 0);
> >+    if (!ValidWriteAssert(rv == 0))
> >+        return;
> > 
> >     // FIFOs are used for thread communication during shutdown.
> >     if ((buf.st_mode & S_IFMT) == S_IFIFO)
> >         return;
> > 
> >-    MyAutoLock lockedScope;
> >+    {
> >+        MyAutoLock lockedScope;
> > 
> >-    // Debugging FDs are OK
> >-    std::vector<int> &Vec = getDebugFDs();
> >-    if (std::find(Vec.begin(), Vec.end(), fd) != Vec.end())
> >-        return;
> >+        // Debugging FDs are OK
> >+        std::vector<int> &Vec = getDebugFDs();
> >+        if (std::find(Vec.begin(), Vec.end(), fd) != Vec.end())
> >+            return;
> >+    }
> > 
> >     // For writev we pass NULL in wbuf. We should only get here from
> >     // dbm, and it uses write, so assert that we have wbuf.
> >-    MOZ_ASSERT(wbuf);
> >+    if (!ValidWriteAssert(wbuf))
> >+        return;
> > 
> >     // As a really bad hack, accept writes that don't change the on disk
> >     // content. This is needed because dbm doesn't keep track of dirty bits
> >     // and can end up writing the same data to disk twice. Once when the
> >     // user (nss) asks it to sync and once when closing the database.
> >     void *wbuf2 = malloc(count);
> >-    MOZ_ASSERT(wbuf2);
> >+    if (!ValidWriteAssert(wbuf2))
> >+        return;
> >     off_t pos = lseek(fd, 0, SEEK_CUR);
> >-    MOZ_ASSERT(pos != -1);
> >+    if (!ValidWriteAssert(pos != -1))
> >+        return;
> >     ssize_t r = read(fd, wbuf2, count);
> >-    MOZ_ASSERT(r == count);
> >+    if (!ValidWriteAssert(r == count))
> >+        return;
> >     int cmp = memcmp(wbuf, wbuf2, count);
> >-    MOZ_ASSERT(cmp == 0);
> >+    if (!ValidWriteAssert(cmp == 0))
> >+        return;
> 
> These early returns leak wbuf2.
> 
> >     free(wbuf2);
> >     off_t pos2 = lseek(fd, pos, SEEK_SET);
> >-    MOZ_ASSERT(pos2 == pos);
> >+    if (!ValidWriteAssert(pos2 == pos))
> >+        return;
> 
> >@@ -226,6 +413,9 @@ extern "C" void (*__cleanup)();
> > void FinalCleanup() {
> >     if (OldCleanup)
> >         OldCleanup();
> >+    if (ProfileDirectory)
> >+        PL_strfree(ProfileDirectory);
> >+    ProfileDirectory = nullptr;
> >     delete &getDebugFDs();
> >     PR_DestroyLock(MyAutoLock::getDebugFDsLock());
> > }
> >@@ -250,12 +440,18 @@ extern "C" {
> > 
> > namespace mozilla {
> > void PoisonWrite() {
> >-    // For now only poison writes in debug builds.
> >-#ifndef DEBUG
> >-    return;
> >-#endif
> 
> This enables write poisoning on all builds with telemetry enabled.  That's
> not
> what you want, right?

It enables write poisoning on *all* builds which is what for now. Eventually releases will call _exit(0) at the point that nightly and debug builds poison write .

I will upload a new patch in a minute.
> Interesting. Looks like it doesn't quiet match my workflow since I keep a
> branch per bug and so just merge (or amend) when I am ready to transfer it
> to hg. I will try to remember to use -U8.

What is the problem, exactly?  Maybe I can fix it for you.

I also have a branch per bug in my workflow, although I'm not sure what you're merging/ammending to.

> Maybe. I have been avoiding nsPrintfCString since bug 743650, but it looks
> like it doesn't silently truncate anymore, is that correct?

Correct.

> > >+        const ProcessedStack::ProcessedStackFrame &frame = stack.GetFrame(i);
> > >+        // FIXME: while the offset we write is correct, on OS X the atos tool
> > >+        // expects a value with the virtual address added. For example,
> > >+        // the firefox binary has
> > >+        //      cmd LC_SEGMENT_64
> > >+        //      cmdsize 632
> > >+        //      segname __TEXT
> > >+        //      vmaddr 0x0000000100000000
> > >+        // so to print the line matching the offset 123 one has to run
> > >+        // atos -o firefox 0x100000123. Should we try to include the
> > >+        // virtual adress in this dump? Probably not, as it is trivial to get it
> > >+        // on the server.
> > 
> > Let's resolve this one way or another before we check this in.  I'd prefer
> > just to put the virtual address in the dump; it's one line, and means you
> > can process the file without looking something up somewhere.
> 
> Do you know an easy way to get it from within firefox? I think we currently
> only record the load address the size and the offset in the file. Note that
> the only file without 0 value is the firefox binary. All the libraries use 0.

No; maybe glandium can help?  If you can't get the virtual address, than that's a simple resolution to the problem.  :)

> > >+    // FIXME: For now we just record the last write. We should write the files to
> > >+    // filenames that include the md5. That will provide a simple early deduplication if
> > >+    // the same bug is found in multiple runs.
> > 
> > I'd prefer if this was fixed in the version you land, because without this
> > change, we can see many fewer late-write bugs at a time, limiting our choices
> > of which bugs to fix.
> 
> We are not even reading them back right now. The reason for wanting to
> commit this separately is that it gives us some coverage while we discuss
> where to get the md5 from.

Okay.

> > This enables write poisoning on all builds with telemetry enabled.  That's
> > not
> > what you want, right?
> 
> It enables write poisoning on *all* builds which is what for now.

Except AbortOnBadWrite will do nothing if telemetry is disabled, correct?

In what way is the behavior of Firefox different with this patch on a machine with telemetry disabled?
> What is the problem, exactly?  Maybe I can fix it for you.

Thanks!

If I read the manual correctly, it is supposed to work by uploading one git commit to bugzilla, right? The way I work is by having a pr777122 branch where I commit small changes. For example, changing the open braces is a commit. To send a patch to bugzilla I then do git diff -U8 master pr777122.
 
> > Do you know an easy way to get it from within firefox? I think we currently
> > only record the load address the size and the offset in the file. Note that
> > the only file without 0 value is the firefox binary. All the libraries use 0.
> 
> No; maybe glandium can help?  If you can't get the virtual address, than
> that's a simple resolution to the problem.  :)

I guess the FIXME is not very clear. This number is a property of the file, not the process. To find the line, what the server has to do is

* read and address from the dump
* run "otool -l file (XUL, firefox, etc)" and find the vmaddr line of the __TEXT segment.
* add the two and run "atos -o file (XUL, firefox, etc) 0xsum_in_hex"

> > 
> > It enables write poisoning on *all* builds which is what for now.
> 
> Except AbortOnBadWrite will do nothing if telemetry is disabled, correct?
> 
> In what way is the behavior of Firefox different with this patch on a
> machine with telemetry disabled?

Hopefully it is not changed, it just gives the write poisoning more testing.
Posted patch fourth review round (obsolete) — Splinter Review
Things that are still missing:

* The FIXME about what to do with mach-o load commands with a non-zero vmaddr.
* The use of vector<char> was not replaced with nsPrintfCString since its get method returns a "const char *".
Attachment #648777 - Attachment is obsolete: true
Attachment #648777 - Flags: feedback?(nfroyd)
Attachment #648825 - Flags: review?(justin.lebar+bug)
> > >+    fclose(f);
> > >+    MozillaUnRegisterDebugFD(fd);
> > 
> > Unregister needs to come before fclose, otherwise it's a race condition on
> > the fd.
> 
> fclose calls write. Looks like we need a MozillaUnRegisterDebugFDUnlocked or
> check if fsync avoids the write. I will give it a try.

Looks like fflush works. I opened bug 780272 to track the existing cases in the codebase. Thanks!
> If I read the manual correctly, it is supposed to work by uploading one git commit to bugzilla, 
> right? The way I work is by having a pr777122 branch where I commit small changes. For example, 
> changing the open braces is a commit.

Ah, I see.  I've run into this problem before too and worked around it as you have.

I've updated git-bz-moz with something that seems to work.  If your commit message starts with "FOLD", we'll fold that commit into the previous one when attaching to bugzilla.

Can you try it and tell me if it works for you?  It's the "attach-fold" branch in git-bz-moz, which is a subrepo of git-bz.
> * The use of vector<char> was not replaced with nsPrintfCString since its get method returns a 
> "const char *".

Ah, right.  Try GetMutableData().

> To find the line, what the server has to do is
>
> * read and address from the dump
> * run "otool -l file (XUL, firefox, etc)" and find the vmaddr line of the __TEXT segment.
> * add the two and run "atos -o file (XUL, firefox, etc) 0xsum_in_hex"

I see.  I guess since you need to have the file/lib in order to do atos, there
isn't much to be gained by having the vmaddr ahead of time.  Could you
un-todo-ify the comment, if you don't plan on addressing it?

>+char *ProfileDirectory = NULL;

sProfileDirectory, please.  (It's in an anonymous namespace, so it's effectively static.)

>     void *wbuf2 = malloc(count);
>-    MOZ_ASSERT(wbuf2);
>+    if (!ValidWriteAssert(wbuf2))
>+        return;
>     off_t pos = lseek(fd, 0, SEEK_CUR);
>-    MOZ_ASSERT(pos != -1);
>+    if (!ValidWriteAssert(pos != -1))
>+        return;
>     ssize_t r = read(fd, wbuf2, count);
>-    MOZ_ASSERT(r == count);
>+    if (!ValidWriteAssert(r == count))
>+        return;
>     int cmp = memcmp(wbuf, wbuf2, count);
>-    MOZ_ASSERT(cmp == 0);
>+    bool ok = ValidWriteAssert(cmp == 0);
>     free(wbuf2);
>+    if (!ok)
>+        return;
>     off_t pos2 = lseek(fd, pos, SEEK_SET);
>-    MOZ_ASSERT(pos2 == pos);
>+    if (!ValidWriteAssert(pos2 == pos))
>+        return;

This still leaks wbuf2 on the early returns before the free.

>+    nsCOMPtr<nsIFile> mozFile;
>+    NS_GetSpecialDirectory(NS_APP_USER_PROFILE_50_DIR, getter_AddRefs(mozFile));
>+    if (mozFile) {
>+        nsCAutoString nativePath;
>+        nsresult rv = mozFile->GetNativePath(nativePath);
>+        if (NS_SUCCEEDED(rv)) {
>+            ProfileDirectory = PL_strdup(nativePath.get());

You could do StaticAutoPtr<nsCString> sProfileDirectory; and then do
ClearOnShutdown(&sProfileDirectory) here, if you wanted to avoid C strings.

r=me, but please fix the leak!
Attachment #648825 - Flags: review?(justin.lebar+bug) → review+
This patch changes the final filename to include the md5. It has one very visible problem: it uses the md5 implementation from breakpad.

On OS X the only MD5 that I could find in XUL is the breakpad one. I see 3 ways we  can fix this:

* Use it from its current location. This is really ugly, but it works.
* Copy the code to xpcom.
* Move the code (i.e. have a local patch to breakpad) to a common place that can be used by breakpad and by xpcom.
Attachment #648362 - Attachment is obsolete: true
Attachment #648825 - Attachment is obsolete: true
Attachment #649906 - Flags: review?(justin.lebar+bug)
Attachment #649906 - Flags: feedback?(nfroyd)
> It has one very visible problem: it uses the md5 implementation from breakpad.

What about nsICryptoHash?
(In reply to Justin Lebar [:jlebar] from comment #38)
> > It has one very visible problem: it uses the md5 implementation from breakpad.
> 
> What about nsICryptoHash?

That is only exported via xpcom, no? I don't think I can call do_CreateInstance during ValidWriteAssert as that would be way too late. I would call it early when poisoning writes, but destructing it late would still cause warning for "XPCOM objects created/destroyed from static ctor/dtor".
Comment on attachment 649906 [details] [diff] [review]
Name files with the md5 of the content

Review of attachment 649906 [details] [diff] [review]:
-----------------------------------------------------------------

Doesn't feel right f+'ing a patch that's...well, ugly, but I think the basic idea is sound.

What about using Adler32 or CRC{32,64}?  Collisions are somewhat more likely (though I'd still expect them to be unusual) and I think it's better than making XPCOM depend on breakpad (!).  And I think nsICryptoHash is unusable for the reasons you cite as well.
Attachment #649906 - Flags: feedback?(nfroyd)
> What about using Adler32 or CRC{32,64}?  Collisions are somewhat more likely
> (though I'd still expect them to be unusual) and I think it's better than
> making XPCOM depend on breakpad (!).  And I think nsICryptoHash is unusable
> for the reasons you cite as well.

I think my preferences would be in order
* Move the md5 implementation to some directory that both xpcom and breakpad can depend on.
* Copy the md5 implementation
* Use a weaker hash
* This patch.

The only reason I implemented the last one is that it is the simplest and allows review of the rest of the code while we decide what to do.
> What about using Adler32 or CRC{32,64}?  Collisions are somewhat more likely (though I'd still 
> expect them to be unusual)

I would not be in favor of using a non-collision-resistant hash where we want collision resistance, personally.  I'd rather use breakpad's MD5, or just import a SHA1 implementation from the interwebs.  In fact, you could stick it in MFBT and make us all very happy...
OK, I will try to find a bsd/mit sha1 implementation and add it to mfbt.
Tip from IRC:

<glandium> jlebar, espindola: if you're going to add sha1 to mfbt, you should probably use Linus's from git, iirc, it's the fastest, and (iirc, bis) it has a liberal license
Actually, it looks like it is gpl. I will try to move security/nss/lib/freebl/sha_fast.c to mfbt. Can nss depend on mfbt? Should we have a local patch for it?
> I will try to move security/nss/lib/freebl/sha_fast.c to mfbt. Can nss depend on mfbt? Should we 
> have a local patch for it?

No, and I don't think so.

You could copy it from NSS, but I don't think you should move it.
(I understand that changing NSS in any way is basically impossible, because the code has been certified.)

Comment 49

7 years ago
(In reply to comment #48)
> (I understand that changing NSS in any way is basically impossible, because the
> code has been certified.)

It's not impossible but it's very hard.  If you can avoid having to change NSS, do that!  :-)
I will just copy and switch it to c++ then. I is just sad that we have 3 copies of md5 (search for fd987193) and will now have two copies of sha1 (search for c3d2e1f0).
Comment on attachment 649906 [details] [diff] [review]
Name files with the md5 of the content

Cancelling this review, since we're waiting on the SHA-1 business.
Attachment #649906 - Flags: review?(justin.lebar+bug)
Posted patch New version using sha1 (obsolete) — Splinter Review
Attachment #649906 - Attachment is obsolete: true
Attachment #670110 - Flags: review?(justin.lebar+bug)
Posted patch correct version (obsolete) — Splinter Review
Sorry, had attached a version with a really small buffer I was using for testing.
Attachment #670110 - Attachment is obsolete: true
Attachment #670110 - Flags: review?(justin.lebar+bug)
Attachment #670112 - Flags: review?(justin.lebar+bug)
8 lines of context, please.

I did modify git-bz so that if a commit message begins with "FOLD", it will be
folded into the commit above before exporting to bugzilla.  :)

https://github.com/jlebar/moz-git-tools/

>diff --git a/xpcom/build/mozPoisonWriteMac.cpp b/xpcom/build/mozPoisonWriteMac.cpp
>--- a/xpcom/build/mozPoisonWriteMac.cpp
>+++ b/xpcom/build/mozPoisonWriteMac.cpp
>@@ -39,6 +40,32 @@ struct FuncData {
>                            // 'Function' after it has been replaced.
> };
> 
>+class SHA1Stream

Please add a comment above briefly explaining what this class does.

>+{
>+public:
>+    SHA1Stream(FILE *aFile) : mFile(aFile) { }
>+    void printf(const char *aFormat, ...)

Printf(), please.  It's ugly, but at least it's consistent.

>+    {
>+        va_list list;
>+        va_start(list, aFormat);

>+        char buffer[32];
>+        nsFixedCString str(buffer, sizeof(buffer));

I think this is the same as nsAutoCString (only less idiomatic); can you use
nsAutoCString instead?  If not, please add a comment explaining why you can't.

>+        str.AppendPrintf(aFormat, list);
>+        va_end(list);
>+        const char* cstr = str.get();
>+        unsigned len = strlen(cstr);
>+        mSHA1.update(reinterpret_cast<const uint8_t*>(cstr), len);
>+        fwrite(cstr, 1, len, mFile);

I'd prefer

  mSHA1.update(reinterpret_cast<const uint8_t*>(str.get()), str.Length());
  fwrite(str.get(), 1, str.Length());.

Save you a strlen, too.

>+    }
>+    void Finish(uint8_t hash[20])
>+    {
>+        mSHA1.finish(hash);
>+    }

Can you add MOZ_ASSERTs that we never call printf() after Finish()?  You could
even null out mFile, I'd guess.

>+private:
>+    FILE *mFile;
>+    SHA1Sum mSHA1;
>+};

>@@ -46,6 +73,13 @@ void RecordStackWalker(void *aPC, void *aSP, void *aClosure)
>     int fd = mkstemp(name);
>     MozillaRegisterDebugFD(fd);
>     FILE *f = fdopen(fd, "w");
>+    SHA1Stream sha1Stream(f);

Can you please put a comment here reminding us not to write to |f| directly?

>     size_t numModules = stack.GetNumModules();
>-    fprintf(f, "%zu\n", numModules);
>+    sha1Stream.printf("%u\n", (unsigned)numModules);

Sucks that NSPR doesn't support %zu.  :(

>+    // Note: These files should be deleted by telemetry once it reads them. If
>+    // there were no telemery runs by the time we shutdown, we just add files
>+    // to the existing ones instead of replacing them. Given that each of these
>+    // files is a bug to be fixed, that is probably the right thing to do.

s/shutdown/shut down/

>+
>+    // We append the sha1 of the contents to the file name. This provides a simple
>+    // client side deduplication.
>+    nsPrintfCString finalName("%s%s", sProfileDirectory, "/Telemetry.LateWriteFinal-");
>+    for (int i = 0; i < 20; ++i) {
>+        finalName.Append(HexChar((sha1[i] & 0xf0) >> 4));
>+        finalName.Append(HexChar(sha1[i] & 0x0f));
>+    }

Why can't you use AppendPrintf("%x") here?
Comment on attachment 670112 [details] [diff] [review]
correct version

Mind if I take another look after you fix up this patch?
Attachment #670112 - Flags: review?(justin.lebar+bug)
Posted patch New patchSplinter Review
Sorry for the delay, I fixing the nss miscompilation was a lot more work than I was expecting :-(
Attachment #670112 - Attachment is obsolete: true
Attachment #673957 - Flags: review?(justin.lebar+bug)
Interestingly, I'm using SHA1Sum in bug 803684, and I had to write the same loop as you to convert the digest to a string.

It sounds like we should incorporate that into the SHA1Sum code somehow.
(In reply to Justin Lebar [:jlebar] from comment #57)
> Interestingly, I'm using SHA1Sum in bug 803684, and I had to write the same
> loop as you to convert the digest to a string.
> 
> It sounds like we should incorporate that into the SHA1Sum code somehow.

Do you want me to do it in this patch or should it be another bug?
> Do you want me to do it in this patch or should it be another bug?

Another bug would be good, so we can get Waldo's eyes on it.  It doesn't have to block your landing this; you've waited long enough here.
...actually, bug 804326 may be the right place; perhaps we should start there.
Oh, sorry; I did a review of this yesterday, but I must have forgotten to post it (or my Internet connection died, or something).  Let me see if I can dig it up, if not, I'll re-review right now.
Comment on attachment 673957 [details] [diff] [review]
New patch

>+// This a wrapper over a file descriptor that provides a Printf method and
>+// computes the sha1 of the data that passes through it.
>+class SHA1Stream
>+{
>+public:
>+    SHA1Stream(int aFd) {
>+        MozillaRegisterDebugFD(aFd);
>+        mFile = fdopen(aFd, "w");
>+    }
>+    void Printf(const char *aFormat, ...)
>+    {
>+        MOZ_ASSERT(mFile);
>+        va_list list;
>+        va_start(list, aFormat);
>+        nsAutoCString str;
>+        str.AppendPrintf(aFormat, list);
>+        va_end(list);
>+        mSHA1.update(reinterpret_cast<const uint8_t*>(str.get()), str.Length());

I think this is going to hit bug 804272, which will cause problems (crashes?) on ARM.

>@@ -62,53 +97,61 @@ bool ValidWriteAssert(bool ok)
> 
>     NS_StackWalk(RecordStackWalker, 0, reinterpret_cast<void*>(&rawStack), 0);
>     Telemetry::ProcessedStack stack = Telemetry::GetStackAndModules(rawStack, true);
> 
>     nsPrintfCString nameAux("%s%s", sProfileDirectory,
>                             "/Telemetry.LateWriteTmpXXXXXX");
>     char *name;
>     nameAux.GetMutableData(&name);
>+
>+    // We want the sha1 of the entire file, so please don't write to fd
>+    // directly, use sha1Stream.

Nit: s/,/;/

r=me, but we need to figure out whether bug 804272 is a problem for you.
Attachment #673957 - Flags: review?(justin.lebar+bug) → review+
> r=me, but we need to figure out whether bug 804272 is a problem for you.

https://hg.mozilla.org/integration/mozilla-inbound/rev/597981ef2ef5

No, so far this is OS X only.
This bug is already really large, I removed the leave-open and will open new bugs for the followup patches.
Whiteboard: [leave open]
Backed both out in http://hg.mozilla.org/integration/mozilla-inbound/rev/791d6f9b8ec8 - I'm not sure just what it's complaining about, but Windows is complaining about "one or more multiply defined symbols found" a la https://tbpl.mozilla.org/php/getParsedLog.php?id=16415411&tree=Mozilla-Inbound
(In reply to Phil Ringnalda (:philor) from comment #66)
> Backed both out in
> http://hg.mozilla.org/integration/mozilla-inbound/rev/791d6f9b8ec8 - I'm not
> sure just what it's complaining about, but Windows is complaining about "one
> or more multiply defined symbols found" a la
> https://tbpl.mozilla.org/php/getParsedLog.php?id=16415411&tree=Mozilla-
> Inbound

Thanks. I am guessing that someone beat me to linking mftb into xul. Adding it a second time should work on OSX and Linux (the linker ignores it), but not windows it looks like.

I did a build only push to try without the changes to toolkit/library/Makefile.in:

https://tbpl.mozilla.org/?tree=Try&rev=0cadb5661ca0
Aha, I found the review I lost!

I landed bug 803692, so you don't need to the cast anymore in

>+        mSHA1.update(reinterpret_cast<const uint8_t*>(str.get()), str.Length());
https://tbpl.mozilla.org/?tree=Try&rev=ca63211ba50b

This one builds locally on OS X. If windows is happy I will push back to m-i.
Erm, MFBT_API() isn't kosher in C++98 or C89, because you're invoking a macro with the wrong number of arguments:

In file included from /home/jwalden/moz/slots/mfbt/SHA1.cpp:6:0:
./dist/include/mozilla/SHA1.h:47:14: warning: invoking macro MFBT_API argument 1: empty macro arguments are undefined in ISO C90 and ISO C++98 [enabled by default]
./dist/include/mozilla/SHA1.h:47:14: warning: invoking macro MOZ_EXPORT_API argument 1: empty macro arguments are undefined in ISO C90 and ISO C++98 [enabled by default]

mfbt/GuardObjects.h just puts MOZ_EXPORT_API() on the entire class.  Is there a reason that won't work here?
https://hg.mozilla.org/mozilla-central/rev/20c0647bdacc
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Flags: in-testsuite-
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
> mfbt/GuardObjects.h just puts MOZ_EXPORT_API() on the entire class.  Is
> there a reason that won't work here?

Yes, the use of __attribute__((weak)). See bug 805096. I will try to refactor the code to avoid the macro with no arguments, but I am starting to think that just dropping -z,defs when we actually want a library to have undefs would be the best.
I landed the mfbt/SHA1.h changes here on Aurora, because they're needed for bug 803684.

https://hg.mozilla.org/releases/mozilla-aurora/rev/5844c438d077
And backed out of Aurora because we're missing some mfbt build glue (see bug 803684).

https://hg.mozilla.org/releases/mozilla-aurora/rev/27e470b01865
You need to log in before you can comment on or make changes to this bug.