Open Bug 241081 Opened 20 years ago Updated 2 years ago

Cleanup for mozilla/security/nss/lib/freebl/mpi/mpi_x86.s

Categories

(NSS :: Libraries, defect, P3)

x86
Linux

Tracking

(Not tracked)

People

(Reporter: tenthumbs, Unassigned)

References

Details

(Whiteboard: awaiting consensus of interested parties)

Attachments

(3 files, 2 obsolete files)

It's missing some .size directives which help the linker. A patch is attached.

Also, if speed really is a consideration then s_mpv_div_2dx1d at least could be
a gcc inline function which would be even faster. Something like this.

static inline mp_err s_mpv_div_2dx1d (mp_digit nhi, mp_digit nlo, mp_digit den,
                                      mp_digit *qp, mp_digit *rp)
{
    __asm__ __volatile__
    (
        "divl   %4              \n\t"
        :"=a" (*qp), "=d" (*rp)
        : "a" (nlo), "d" (nhi), "rm" (den)
        : "cc"
    );

    return 0;
}
Attached patch patch (obsolete) — Splinter Review
Assignee: wchang0222 → MisterSSL
Comment on attachment 146587 [details] [diff] [review]
patch

>- # Magic indicating no need for an executable stack
>-.section .note.GNU-stack, "", @progbits
>-.previous

Did you intentionally delete these?  You did not
describe this change in the bug report.
> Did you intentionally delete these?

Yes I did. They would affect an entire executable/shared object that contained 
this file. Are you sure you want to do that in an obscure utility file? I 
would think you should allow the compiler to control this. Leave it in if you 
want. You can just delete the .previous line, it's uesless at the end of a 
file.
What is the net effect of this change?  smaller shared lib?  or ?

Is there any reason NOT to disable executable stacks?  Executable stacks
strike me as a vulnerability.  mpi is linked into nss's libsoftoken, 
which has no need of an executable stack.  
Priority: -- → P3
Target Milestone: --- → 3.10
The .size directive gives hints to ld which may or may not use the info to 
move the code to a more efficient location. It's highly platform dependent but 
gcc always emits it. There's no reason mot to and it's harmless.

As for GNU-stack, it's not at all clear it actually does anything. See 
<http://gcc.gnu.org/ml/gcc-patches/2003-06/msg00302.html>  and 
<http://gcc.gnu.org/ml/gcc-patches/2003-11/msg00975.html>. It's just a hint to 
the kernel. In fact, the linker in binutils-2.15.90.0.2-2.15.90.0.3 from 
kernel.org just discards the section. Since you have no control over any 
linker but your own it guarantess nothing. And can you guarantee that this 
won't adversely affect some gcc version out there that might want an 
executable stack?

And, of course, there's nothing preventing some other part of a particular 
process from changing the permissions on a peice of memory.

It's just a couple of lines that make you feel good without necessarily doing 
anything.
Comment on attachment 146587 [details] [diff] [review]
patch

># Magic indicating no need for an executable stack
>.section .note.GNU-stack, "", @progbits
>.previous

Just wanted to provide the history of these three
lines.	I received these three lines from a Red Hat
Linux developer (via Chris Blizzard).
note.GNU-stack is very much a Redhat thing so it's not surprising their
people think it's the usual thing to do. This is all part of the "exec
shield" package but that's not part of the official Linux kernel (at least
not yet). Other distributions may choose to do things differently so there's
the potential for conflicts.

Since the stack belongs to the application putting this is a shared library
is probably superfluous. Linking against a static library might be a problem
if the app needs an executable stack but your library says no.

This isn't worth getting too excited about. It's probably safe to leave it
if you want as long as you realize that it doesn't do much for you and that
you might have some problems.
Comment on attachment 146587 [details] [diff] [review]
patch


>- # Magic indicating no need for an executable stack
>-.section .note.GNU-stack, "", @progbits

My guess is that this doesn't force a non-executable stack, but 
rather avoids forcing an executable stack.  I'm inclined to 
leave it in if RH recommends it and it doesn't hurt anyone else.

Wan-Teh, if you approve this patch, I'll check it in on the trunk.
Attachment #146587 - Flags: review?(wchang0222)
Those 2 lines did hurt OS/2 (see bug 240775 ).
Comment on attachment 146587 [details] [diff] [review]
patch

r=wtc, assuming the executable stack stuff at
the end of this file is left alone.

I don't know the syntax of the .size directive,
so I don't know if this patch is correct.  I'll
trust the submitter of this patch.

If I compile the following C file using
"gcc -c -O2 -S foo.c":

int incr(int a)
{
    return ++a;
}

int decr(int a)
{
    return --a;
}

the compiler generates the following assembly code:

	.file	"foo.c"
	.text
	.p2align 2,,3
.globl incr
	.type	incr,@function
incr:
	pushl	%ebp
	movl	%esp, %ebp
	movl	8(%ebp), %eax
	incl	%eax
	leave
	ret
.Lfe1:
	.size	incr,.Lfe1-incr
	.p2align 2,,3
.globl decr
	.type	decr,@function
decr:
	pushl	%ebp
	movl	%esp, %ebp
	movl	8(%ebp), %eax
	decl	%eax
	leave
	ret
.Lfe2:
	.size	decr,.Lfe2-decr
	.ident	"GCC: (GNU) 3.2.2 20030222 (Red Hat Linux 3.2.2-5)"

So my question is what "." means in a .size directive?
Attachment #146587 - Flags: review?(wchang0222) → review+
I think Julien would like to see those last 3 lines removed from the source, 
(as this patch presently does), and Wan-Teh would like to see them remain 
in the source.  I just wanna do whatever makes the .s usable on most platforms,
or else make another source file for platforms with odd requirements.
Please advise.
I am not really qualified to review this
patch, so you can ignore my comments.

I tried to find documentation of .size, but
what I found is not very helpful:

http://www.gnu.org/software/binutils/manual/gas-2.9.1/html_chapter/as_7.html#SEC123
Wan-Teh Chang: the '.' is the current location counter which, in this case, is 
the location immediately after the ret. The compiler always puts a local label 
at this point which always has the same value as '.'. The label is 
superfluous.

Nelson Bolyard: if you use a special make rule for this file you can use "gcc 
-x assembler-with-cpp" as I suggested and the wrap the two lines in an ifdef. 
You may just need to redefine make's AS variable.
That last message was confusing. I didn't mention gcc is this bug. The idea is 
that gcc will pass the file through the preprocessor so you can use the usual 
tricks.
Comment on attachment 146587 [details] [diff] [review]
patch

Thank you for the explanation, tenthumbs.

I agree the .size directives in this patch
are correct.  We should check in those changes.
As for removing the executable stack stuff,
it is a good idea to have Chris Blizzard find
a right person to review it.

Does anyone know why we had "nop" instructions
after "ret" instructions?
Wan-Teh :

Maybe the nops after ret were there to align the code of the next function on
some optimized boundary which causes faster execution ? They would probably have
to be changed whenever the asm code does.
In answer to comment 15:
If I recall correctly, I first wrote mpi_x86.s, and then derived 
mpi_x86.asm and (much later) mpi_i86.s from it.  

On the linux platform, I started by compiling a c implementation with 
optimization, and examining the compiler's assembly code output, to see 
what pseudo-ops were appropriate to use.  Then I replaced most of the 
assembly code, following the style of the original pseudo ops.  

When I was nearly done, I linked my assembly code into a shared library, 
and then disassembled the result to see if the assembler had reordered 
instructions or done anything else unexpected.  There I found those NOPs 
had been put in, even though they were not in my original code.  
So, figuring that whatever tool put them there knew what it was doing, 
I copied them into the source, so that the source exactly matched the 
assembled object code.  

I never did figure out why those NOPs were added (presumably by the 
linker).  I figured they didn't hurt and might even help in some way, 
(e.g. alignment) so I left them in.

Here are some guesses as to how/why those NOPs might have been put in:

a. Align the functions to begin on "paragraph" boundaries (cache line
boundaries), to minimize the number of cache line fetches required to 
run the function.

b. If the processor does instruction pre-fetching and partial decoding 
of instructions following the current instruction, NOPs minimize the
amount of work done there.  (it would be pretty bad branch-prediction 
logic that failed to notice a return!)

I was looking through some disassemblies of linux libc code today, and I
noticed that immediately after nearly every unconditional jump or return
there was some form of nop instruction.  among the opcodes I saw were:
    nop                 (1 byte)
    mov  %esi,%esi      (2 bytes)
    lea  0(%esi),%esi   (3 bytes)
In all cases, the instruction following the nop was aligned on a 4-byte 
boundary.  I found one case where the instruction immediately after a 
return was already on a 4-byte boundary, and there was no nop after 
that return.  Of course, the next non-nop instruction after a return or 
unconditional jump is a jump target.  So, I think the intent is clear.   
It is to put jump targets on 4-byte boundaries. 

Our assembler sources should probably do something to force alignment of
the first function, and then add nops to keep all subsequent jump targets
4-byte aligned.  
Attached patch patch with align directives (obsolete) — Splinter Review
Aligning is easy. This is the previous patch + suitable directives. These are
the ones gcc uses when tuning for an i486. If you tune for a pentium4 then gcc
doesn't emit them. They will increase the code size slightly but should be
otherwise harmless.
Attachment #146587 - Attachment is obsolete: true
The latest patch just aligns the functions. It's not hard to align loops but
you should really test if loop alignment actually does any good on your cpu of 
choice.
Attached patch patch 3Splinter Review
Yet more alignment. This reproduces what gcc does. There's no gurantee this
will actually improve the speed. You need to test it.
Attachment #146848 - Attachment is obsolete: true
Looking through the gcc source code shows that it jumps through hoops to 
figure out what the assembler is capable of. I doubt the alignment stuff will 
work on all platforms, like the BSD's. It may be better not to try.
Attached patch patch 4Splinter Review
If anyone still cares (and I don't mind if they don't) here's a patch with some
ifdefs to change the symbol names.
A completely different approach, a common file with the function bodies and
some small wrappers that change names and make them global. No ifdefs. This
works on Linux but that's the only one I can test.
tenthumbs: we have already figured out how to make
gcc use the same symbol names on OS/2.  So we only
need to delete the non-executable stack directive
in order to share mpi_x86.s between Linux and OS/2.
Folks,  Thank you all for your contributions to this bug.  
I'd like to see you all collaborate and produce a single patch that 
solves the issues of this bug (bug 241081) and of bug 240775.  
When we have one patch with which everyone here is satisfied, 
I will be happy to check it in.

I'm not clear at the moment about the resolutions of several aspects of 
these bugs:

1. Do we want to keep the non-executable stack directive at all?  
Perhaps only on certain platforms?  I was under the impression that 
Wan-Teh wanted to keep it at least for red hat.

2. Do we want the .size directives?  

3. Do we want any alignment directives?

4. Do we want any no-ops?
If you're using the same file on multiple platforms then I think you want to 
include as much information as possible so that the platform can perform as 
many optimizations as it can. That's why I suggested the .size directives.

The executable stack thing is really an question of whether a given platform 
supports arbitrary code sections, ELF does and OS/2 apparently does not. Given 
that you're using gcc for the rest of the code (which will insert such things 
if it wants to) it's probably not necessary for this one file.

Nops and alignment directives are really the same issue. Whether they have any 
effect depends on the run-time cpu (which you don't know). The gas docs say 
that different platform assemblers behave differently so it's difficult to do 
the right thing. I don't think it's worth the trouble. Presumably, the 
assembler code is already significantly faster than gcc code so I don't think 
it's worth getting too greedy.
Blocks: 240775
Tenthumbs - you hit a hot button issue here . The RSA code can always be faster,
especially in NSS server applications. I don't have time to benchmark the effect
of alignment on the various CPUs. However, presumably, alignment, whether taken
care of through NOPs or through alignment directives, would not hurt the
performance. It could only benefit for certain CPUs. So I would be in favor of
having it done one way or the other. Perhaps we would align on the largest
possible boundary needed for current CPUs to run code optimally (eg. maybe a 64
bit boundary ?), which wouldn't hurt on the other CPU that don't need it -
except for a few wasted NOP bytes of memory, which never get executed anyway
(might as well use DB FF ).

Since the alignment directives aren't portable, there are 2 solutions:

1. split the mpi_x86.s source files into two - one with directives, one with
manual alignment done through NOPs . This might also allow using the other
directives.

2. just use one source file with manual alignment done through NOPs, and don't
use any directives

I exchanged email with Chris Blizzard on Monday.
He said that we should keep the non-executable
stack directive for Linux.

So we need to either add #ifdef LINUX to mpi_x86.s
and assemble it using gcc (so we can use C
preprocessor macros), or make a copy of mpi_x86.s
for OS/2.
If speed is a major issue then measurements are essential. I've found that 
predictions are almost always wrong. People just don't think like machines.

As for alignment, yes nops can hurt. They take time and space. You
absolutely have to measure the performance difference on typical data. An
improvement on a loop of 1000 may have little (or a bad) effect a loop of
100.

Then there's the cpu. For a 32-bit Intel chip and clones, gcc-3.4.0 can tune
for 17 different flavors. Ultimate speed would mean a lot of files.
 
If you assume that the Intel Pentium4 is the most common chip sold and used
these days and if you believe gcc's optimizations for that chip then you
don't want any alignment at all.

Inserting manual nops is a bad idea. You wind up with code that's impossible
to maintain. If you put in big nasty warnings about why the code does what
it does then you wind up with code that no one will touch and it doesn't get
updated when it should. Or someone will ignore the warnings and screw up all
the careful work.

Obviously I think this is a bad idea. If performance is critical then you 
should consider gcc's inline assembly to avoid function calls but that's 
harder than you think.

I'm not a big fan of ifdefs. In one of the K&R books there's a caution 
noting that N ifdefs imply 2^N complexity. I would prefer separate files. 
Here you have assembler code for non-ELF, ELF on Linux, and ELF for everyone 
else. You may well need three different versions. See my last patch, though. 
On Linux you can include one assembler file in another so you can keep the 
Linux-specific stuff separate.

BTW, there are a few tweaks that could improve speed marginally; e.g., the 
usual idiom for testing whether a register contains 0 is to use the test 
instruction rather than comparing with 0. I can do that if anyone's 
interested.
tenthumbs, n ifdefs only imply 2^n complexity if each ifdef is for a different
condition. I don't like the ifdefs either much in this case, but I think we are
only talking about using them in order to hide the directives that aren't
supported on some platforms; ie. there would be only two states - directives
supported or not. But of course once you start doing this you open the door to
more states.

Your points on measurements are well taken. I think later this year I will have
cycles to do more performance measurements for RSA on x86 platforms. Most likely
x86-64 will be involved, and there will be another body of assembly code for it,
though.
*** Bug 240775 has been marked as a duplicate of this bug. ***
Folks, the reason this bug isn't progressing, is that the various 
contributors (y'all) seem to be in disagreement about how to proceed.  

The questions I asked in comment 26 are still unsettled.  
As soon as consensus emerges about those 4 questions, I will proceed.
Whiteboard: awaiting consensus of interested parties
All I wanted to do originally was to add size information as an aid to the 
linker. The question then arose whether this code could/should be shared 
among various platforms which, in turn, depends on whether a given platform 
handles arbitrary sections and the platform naming conventions. We then got 
diverted into an optimization discussion.

As I see it, there are just a few real issues here. Assuming this is a good 
idea then the first question is whether there should be separate files for 
each platform. If so, then we're essentially finished. If not, then the next 
question is how to mangle the files. The usual method is to use the C 
preprocessor. That's straightforward too.

Someone has to decide on which approach to use.

The other stuff probably belongs in another bug.
QA Contact: bishakhabanerjee → jason.m.reid
3.11 is designated as a performance enhancement release, so I'd like to
resolve this bug once and for all in that release.  
Target Milestone: 3.10 → 3.11
QA Contact: jason.m.reid → libraries
Any chance of getting a freebl/Makefile patch in? Perhaps in a separate bug?
I'm trying to get a 2.0 Firefox build on x64 and ran into this problem (and a few
others of course).
Assignee: nelson → nobody
Target Milestone: 3.11 → ---
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: