Closed Bug 968385 Opened 10 years ago Closed 10 years ago

sec review for python pip replacement peep

Categories

(mozilla.org :: Security Assurance: Review Request, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lonnen, Assigned: ygjb)

Details

https://github.com/erikrose/peep

peep is a library that wraps pip and provides extra assurances about downloaded libraries. WebDevs are looking at it as a way to handle dependencies while avoiding vendoring libraries or running a pip mirror. Before we push for wider adoption, potentially in playdoh, we'd like to have security check over the lib and our approach.
:kang please take a look at this but don't spend a ton of time, and when your done ping :yvan for a further look
Assignee: nobody → gdestuynder
OS: Mac OS X → All
Hardware: x86 → All
I've seen this before :)
The main feature is that it's adding sha256 hashes to the packages for integrity checking in the requirement field. this ensure that you get the correct package and also that you're getting the same package as the author intended.

The assurance of a single sh256 isn't as great as a gpg signature for example, because it's not tied to an identity (what gpg does really is just cryptographically tying a similar hash, i.e. signature, to an identity)

Due to this, and while it's not as good as gpg signing (which other package managers often do), this is definitely a good improvement over the current model. Basically, while I would prefer gpg signatures (or similar) eventually, I don't see anything wrong in the design of the tool.

Ops should fail hard when this exits with 1 status (checksum failed) and figure out what when wrong.

The code and the concept are straight-forward, too.

Passing to yvan
Assignee: gdestuynder → yboily
Actually, this is intended to be better than a GPG signature. A GPG signature from a package author tells us only that the package author (or whoever stole his keys) vouches for a certain package. It does nothing for deployment repeatability. Peep is specifically designed to "stop the presses" in situations where a package author sneakily changes a package without updating its version number. It trusts the committers to the top-level project, not the authors of the packages it uses.
During the build phase of a project, any command exiting non-zero will abort the project. Peep getting grumpy about discrepancies between package and payload should abort builds, then.

Is it wise to start recommending peep widely to our python-based projects? There is another webdev superteam meeting coming up and it would be a good place to encourage adoption.
Can you have one of the opsec folks have a look at this as well?  I am going to spend some time on it this morning, but this impacts opsec more than appsec.
Flags: needinfo?(jstevensen)
(In reply to Yvan Boily [:ygjb][:yvan] from comment #5)
> Can you have one of the opsec folks have a look at this as well?  

See comment 2
Flags: needinfo?(jstevensen)
Yvan asked me to take a a look at this request. Here are some thoughts.

I think this is a great addition to pip. Having hashes in your requirements.txt is useful to prove that you are actually getting what you asked for.

I find both the security and repeatable builds aspects of this project compelling.

As a developer I think the repeatable builds are the most interesting thing. Specially if there are doubts about PyPi's version management.

As a security engineer I think there are some catches though.

The main problem is initial installation of packages. Python devs install packages all day. I probably setup virtualenvs multiple times a day. Both on my laptop and on servers. And I download code into those envs all the time. Not just as part of projects but also ad-hoc with pip install from the command line.

Personally I don't really read all package code that I install and use. And I don't think that is an uncommon thing. Those packages are essentially coming from sources that we only trust by reputation.

So, basically anything goes. The code in a library could be malicious or even the code in the setup.py could be doing something nasty.

Note that this is not a Python problem. For example NPM, CPAN and RubyGems suffer from the same problem.

I think if you want to trust your dependencies more, without doing code reviews of every package, that you simply have to stop downloading stuff from the package archives and just rely on those packages that ship with your distro. It is more likely that those have had review, or at least better usage, in a controlled environment. Where malicious activity would be noticeable. (Specially RHEL I think - I know .. outdated packages :-/)


Another option would be to setup a Mozilla hosted subset of PyPi that only contains vetted and reviewed packages. For those packages we can have published hashes. This is a lot of work, but if we really think the risk is very high then we can consider that.


An additional thing to consider is to stop using source packages and start investigating Python Wheels. Wheels contains just the python and native artifacts without setup scripts. So there would probably be less to review. Hashes can be applied in the same way. http://pythonwheels.com


Does anyone know some history about malicious code being put on package archives? I can't really remember a serious case happening recently.
Thanks for taking a look, Stefan!

> Personally I don't really read all package code that I install and use. And I
> don't think that is an uncommon thing. Those packages are essentially coming
> from sources that we only trust by reputation.

Yes, all peep handles is the repeatability. Notice how I left the how-to-vet-your-packages issue an open question in the documentation. :-)

As now, with our common practice of checking vendor libs into source control, there's no magic, automatic way of vetting; there are heuristics, manual review, and, as you say, punting to reputation and the many-eyes hypothesis. We do some of each today. Maybe someday we'll have provably sandboxed code, but that'll be a different language, a different time, and a different conversation. What peep does now is eliminate the need to run private PyPI servers and deal with their ACLs; the vetting you'd need to do before uploading a package is conserved. Alternatively, it eliminates the difficulty of fooling around with version control to update a vendor package.

> I think if you want to trust your dependencies more, without doing code reviews
> of every package, that you simply have to stop downloading stuff from the
> package archives and just rely on those packages that ship with your distro. It
> is more likely that those have had review, or at least better usage, in a
> controlled environment. Where malicious activity would be noticeable.
> (Specially RHEL I think - I know .. outdated packages :-/)

I'm not sure a distro-shipped package has any more eyes on it than one off PyPI. It may even have fewer, since, as you say, they tend to get outdated fast, driving people to other sources. The upside is that somebody at the distro signs them, but, on the other hand, I bet they don't read them before signing.

> Another option would be to setup a Mozilla hosted subset of PyPi that only
> contains vetted and reviewed packages. For those packages we can have published
> hashes. This is a lot of work, but if we really think the risk is very high
> then we can consider that.

That's what I was trying to avoid, since it's a lot of overhead running a server, figuring out who has review access, etc. Peep enables a more decentralized workflow, letting each team determine their own risk tolerance. I think that's a good tradeoff.

That's not to say there aren't clever ways we could benefit from each other's review efforts: a really lightweight way might be to start a repo someplace and check in a big text file with hashes of various versions of libs that we had vetted. Then you could check that before doing your own vetting and duplicating work. We could even GPG-sign the commits if we liked. Better, we could have a little bot crawl all the Mozilla (or even other) project trees and freak out if two projects have different peep hashes for a lib; that would catch some targeted attacks. (It's the same idea as the EFF's SSL Observatory.)

> An additional thing to consider is to stop using source packages and start
> investigating Python Wheels. Wheels contains just the python and native
> artifacts without setup scripts. So there would probably be less to review.

Wheels are a good thing. They remove the possibility of malice at install time (which should be limited by server privs anyway). But they do nothing for runtime, nor do they lighten the review burden much. They are signed by the package authors, which is nice but just shifts trust onto them and their key- and version-management habits (which I actually *have* been bitten by in the wild).

> Does anyone know some history about malicious code being put on package
> archives? I can't really remember a serious case happening recently.

I know of no cases in the wild; I just want to sleep well, use recent packages, and avoid maintaining our own servers where possible.

In summary, peep currently gives us equivalent functionality to storing vendor libs in version control or running a private PyPI, with a much lighter maintenance burden. Vetting gets no harder and no easier. So, as a win on one dimension and a no-change on the other, I move we approve peep as a way of deploying Python dependencies. Who has the signet ring here? :-)
* Not being able to use whatever version you want of a package might be secure but I suspect it's so detrimental to developer productivity that pretty soon developers will find desperate ways around it. 

* A potential risk I can see if when the developer adds the very first checksum of a package. Primarily to you Erik: Is there a risk that a developer does an unsafe `pip install` and calculates the checksum of that and thus poisoning the well already at inception of the new dependency?

* To be honest, when someone adds a patch that combines adding a new package into the vendor directory I have never reviewed that. I glance over it in the Pull Request and just focus on the new code that uses this new dependency. 

* I'd LOVE to hear some real war stories where PyPi packages have been man-in-the-middled leading to nasty hacks.
The potential risk is no different than what we have now. So, yes--that's a risk just like a developer adding a poisoned package to vendor/ right now is. peep doesn't make that better or worse. All it tells you is that the package you downloaded today is the same as the package you downloaded originally.

I haven't had a MITM issue with PyPI, but I have had an issue with some jackass who pushed a package for version x, then a few weeks later made some changes and pushed a *new* package for the same version x which had bugs and sucked. So pulling down that package for version x from one month to the next yielded different code.

I know that sort of thing was brought up at one point during the discussions to fix PyPI last year, but I don't know offhand if that was fixed or not with the recent PyPI revamp.
What willkg said.

Also...

> Primarily to you Erik: Is there a risk that a developer does an unsafe `pip install` and calculates the checksum

pip is no more or less safe than peep the first time. In either case, you have the assurances SSL can provide (shaky in these days of CDNs and the NSA) and no other. It's the second time around that peep's guarantees kick in.
If a specific team feels that this is a proper tool for them then they should absolutely use it.

If nobody objects then I will close this bug with WONTFIX because I don't think there is any action that should or should be taken.
(In reply to Stefan Arentz [:st3fan] from comment #12)
> If a specific team feels that this is a proper tool for them then they
> should absolutely use it.
> 
> If nobody objects then I will close this bug with WONTFIX because I don't
> think there is any action that should or should be taken.

The objective of this bug is a blessing from you guys to say that we can run `peep install -r requirements.txt` as part of the deployment script on the IT run servers. At the moment, we don't have that so we put all dependencies into vendor/ and then we have RPM hand-curated builds for various python binary packages.
(In reply to Peter Bengtsson [:peterbe] from comment #13)
> (In reply to Stefan Arentz [:st3fan] from comment #12)
> > If a specific team feels that this is a proper tool for them then they
> > should absolutely use it.
> > 
> > If nobody objects then I will close this bug with WONTFIX because I don't
> > think there is any action that should or should be taken.
> 
> The objective of this bug is a blessing from you guys to say that we can run
> `peep install -r requirements.txt` as part of the deployment script on the
> IT run servers. At the moment, we don't have that so we put all dependencies
> into vendor/ and then we have RPM hand-curated builds for various python
> binary packages.

When I read the discussion in this bug back then I think the conclusion is that this improves both build repeatability and security.

I don't think you need the blessing or approval of the security team to improve your processes.
Great! Thanks for the reviews, all.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
So, let's rewrite Playdoh!
(In reply to Peter Bengtsson [:peterbe] from comment #17)
> So, let's rewrite Playdoh!

Done. https://github.com/mozilla/sugardough
You need to log in before you can comment on or make changes to this bug.