Closed Bug 1909747 Opened 3 months ago Closed 2 months ago

Crash in helper.exe on post update firewall setup

Categories

(Firefox :: Installer, defect, P3)

Firefox 130
defect

Tracking

()

RESOLVED FIXED
131 Branch
Tracking Status
firefox130 - wontfix
firefox131 --- fixed

People

(Reporter: agashlin, Assigned: nipunshukla002)

References

Details

(Whiteboard: [fidedi-ope])

Attachments

(4 files)

For the last week or so on my Windows 10 machine (64-bit 10.0.19045), when Nightly updates on startup I have been getting the Just In Time Debugger popup informing me that helper.exe crashed. The command line is C:\Program Files\Firefox Nightly\uninstall\helper.exe /PostUpdate.

Digging into it today I saw that the crash is coming from this line in the liteFirewall NSIS plugin, which is attempting to print an error result in an exception catch block. But this is a bug: It is trying to print the return value of _com_error::Error() as a string ("%s"), but it is really an HRESULT. This explains why the crash is attempting to read address 0x800401F3, which is CO_E_CLASSSTRING.

I don't see any recent changes to how the installer/post-update helper deal with the firewall config, though I haven't been paying close attention. It's likely that something recently changed on my machine to cause this error recovery path to be reached only now. My Update History shows two update failures (20200722214643 and 20240723211328, patch apply failed), but then there are also successful patches after those, so maybe my install is corrupted somehow. I don't recall making any firewall configuration changes recently, though there was a big Windows update not long before this started happening.

Please let me know if there's any more info I can provide or tests I should try before I blow away my Nightly install and firewall settings. I can provide the minidump from helper.exe if it would be useful.

The bug is marked as tracked for firefox130 (nightly). We have limited time to fix this, the soft freeze is in 7 days. However, the bug still isn't assigned.

:Amir, could you please find an assignee for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(ahabibi)
Duplicate of this bug: 1910033
Assignee: nobody → nshukla
Status: NEW → ASSIGNED

Thanks for your work on fixing this Nipun!

Interesting that this did affect someone else, hm. I'm still trying to figure out why this just started showing up. Some notes without a conclusion:

I did a semi-clean install to a different directory and let it update, it crashed the same way, so it isn't something specific to my install, though could be related to the shared maintenance service or other settings on the same machine. The firewall rule was set up correctly by the installer. Manually running the helper with /PostUpdate from an admin (or SYSTEM) command line works fine (it will recreate the firewall rule if it was missing), so the basic functionality is still working. But if I let it update itself, helper crashes. So far I haven't been able to attach to the crashing process to step through the failures leading to the crash.

I don't think it's related to the PostUpdate fixes in bug 1896944 and bug 1898792, though the call to CallFirewallEntries is suspiciously close to the change to $RegHive I don't see what difference it would make. And the timing isn't quite right, since that went into Nightly mid-June and I've been updating it mostly daily since then, though maybe I misremember how long I've been ignoring the debugger popup?

The report from bug 1910033 shows this happening in the 128.0 release build so I think being able to get into this catch block has been possible for a while. I took a quick look through some recent-ish PostUpdate work but couldn't find anything definitive. It could be that this just happens stochastically based on some weird system configuration intricacies but I haven't had the time to prove/disprove that yet.

Flags: needinfo?(ahabibi)

I used a filter on Windows Event Logs, and discovered that this issue started happening for me with the following information:

Faulting application name: helper.exe, version: 1.0.0.0, time stamp: 0x60fc9250
Faulting module name: msvcrt.dll, version: 7.0.20348.1, time stamp: 0x3e9086b0
Exception code: 0xc0000005
Fault offset: 0x0007d2c2
Faulting process id: 0x2289c
Faulting application start time: 0x01dadc2f08ad66f0
Faulting application path: C:\Program Files\Firefox Nightly\uninstall\helper.exe
Faulting module path: C:\Windows\System32\msvcrt.dll
Report Id: 56c20809-f127-4023-86a9-823da5951aed
Faulting package full name:
Faulting package-relative application ID:

and it started on 22 Jul 2024 8:22 PM UTC+0800.

Hope this helps narrow down what code to fix.

Severity: -- → S3
Priority: -- → P3
Whiteboard: [fidedi-ope]
Attached file liteFirewallW.dll

The liteFirewall plugin cannot be built with clang on the try server, as attempted here, due to its use of #import statements for included .tlb files. This is a known and longstanding bug as documented here. As such I am attaching a locally compiled dll for signing.

I've been seeing similar crashes since ~2024-07-20, at least that is the earliest crash dump I have. They all look similar:

msvcrt.dll!__output_l()
msvcrt.dll!_printf()
liteFirewallW.dll!10001292()
[Frames below may be incorrect and/or missing, no symbols loaded for liteFirewallW.dll]
[External Code]

I..e this is very likely the same issue. (I don't have debug symbols for helper.exe here so Visual Studio doesn't show much more info.)

(In reply to Dimitry Andric from comment #9)

(I don't have debug symbols for helper.exe here so Visual Studio doesn't show much more info.)

helper.exe is written in NSIS which I do not think has any kind of capability to generate debug symbols.

FWIW https://bitbucket.org/lsun/litefirewall/src/master/ seems to be the current canonical "upstream" for the NSIS liteFirewall plugin, but it looks like the last activity was ~12 years ago. I would consider it abandonware. (The links on https://nsis.sourceforge.io/LiteFirewall_Plugin are no longer working.)

That is, I would attempt to fix it in the simplest way possible, then move on. Alternatively, we could try to figure out which particular COM error is being thrown? In my Windows Event Log I see a number of Event ID 1140 error messages from Firefox Nightly Browser Agent, containing not much more than "The COM+ registry database detected a system error", and the following details:

<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Firefox Nightly Default Browser Agent" />
    <EventID Qualifiers="32785">1140</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2024-08-15T09:13:16.5488257Z" />
    <EventRecordID>340480</EventRecordID>
    <Correlation />
    <Execution ProcessID="0" ThreadID="0" />
    <Channel>Application</Channel>
    <Computer>kilchoman</Computer>
    <Security />
  </System>
  <EventData>
    <Data>0x80110474 in RegisterTask:92</Data>
  </EventData>
</Event>

Indeed, 0x80110474 is COMADMIN_E_REGDB_SYSTEMERR, see https://learn.microsoft.com/en-us/windows/win32/com/com-error-codes-5; but it does not explain at all what the error means, unfortunately. There is a post in the Cygwin mailing list that refers to this error code: https://sourceware.org/pipermail/cygwin-apps/2009-October/025891.html, quoting from there:

I've read a bit more about COM threading. In mklink2.cc, it says:

/* Initialized in WinMain. This is required under Windows 7. If
CoCreateInstance gets called from here, it fails to create the
instance with an undocumented error code 0x80110474.
FIXME: I have no idea why this happens. */

Well, I have a theory. Moving the CoCreate call into WinMain guarantees it
is being called by the same thread that called CoInitEx, and therefore has an
apartment allocated to it - whether single or multi isn't the point here.

Now, when the CoCreate call was in make_link_2, it would be called from
whoever called DesktopSetupPage::OnFinish().

That ought to be the main message loop, which ought to be the main thread,
which ought to be the same thread that called CoInitEx. But if it was a
different thread, that had never called CoInitEx, the CoCreate call would fail.

So I guess this problem is something similar?

Pushed by nshukla@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/91ec2056c292 Modify liteFirewall plugin to prevent crashes upon catching error r=bytesized,bhearsum

(In reply to Pulsebot from comment #13)

Pushed by nshukla@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/91ec2056c292
Modify liteFirewall plugin to prevent crashes upon catching error
r=bytesized,bhearsum

Note that this diff looks strange: the actual modification to other-licenses/nsis/Contrib/liteFirewall/liteFirewall.cpp has:

-			printf("%s", e.Error());
+			/* Start Mozilla modification */
+			printf("0x%lx", e.Error());
+			/* End Mozilla modification */

while the other-licenses/nsis/Contrib/liteFirewall/mozilla_customizations.diff uses ErrorMessage() instead, which is incorrect:

+-			printf("%s", e.Error());
++			/* Start Mozilla modification */
++			printf("0x%lx", e.ErrorMessage());
++			/* End Mozilla modification */

I.e. ErrorMessage() gives you a string (as per https://learn.microsoft.com/en-us/cpp/cpp/com-error-errormessage?view=msvc-170), while Error() gives you a HRESULT, effectively a DWORD.

Thanks for catching that. It looks like an oversight when I changed the error output from a string representation of the error to a DWORD while it was still a WIP and neglected to update the diff to match it. I'll submit a follow up fixing it as soon as I get the chance.

Setting leave-open so this doesn't get resolved when the first patch merges to central. Note that we're building the Fx130 RC next week, so we'll want to get this uplifted to Beta ASAP.

Pushed by nshukla@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/949f7cf159da Fix liteFirewall mozilla_customizations.diff regarding error representation r=nrishel
Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED

Bugs get resolved when they're merged to mozilla-central. What we should do here is remove the leave-open keyword from it once rev 91ec2056c292 gets to mozilla-central since that'll probably happen in a different merge than rev 949f7cf159da. Feel free to ping me on Slack or Matrix to discuss if it's unclear what I'm saying.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

No, that makes perfect sense. With respect to uplifting, I'm not sure how necessary that is. This looks like a problem that doesn't impede normal functioning of the feature (setting correct Windows firewall rules for Firefox) and one that doesn't seem to be extremely common. In my opinion I don't think there's much harm in letting it ride the train.

And after all that, they ended up merging together in the end! :-)

With respect to uplifting this to Beta, it got set to tracking+ for Fx130 about a month ago shortly after the bug was filed. But maybe that was before the scope was better understood. If the impact of this bug isn't as severe as originally feared, I'll defer to your judgement on whether to nominate or not.

Status: REOPENED → RESOLVED
Closed: 2 months ago2 months ago
Flags: needinfo?(nshukla)
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → 131 Branch

I don't think it's severe enough to warrant uplifting as it currently stands.

Flags: needinfo?(nshukla)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: