Bug 358591 (amo-mxr)

Ability to view and search source code of all extensions on addons.mozilla.org (MXR for add-ons)

RESOLVED FIXED

Status

mozilla.org Graveyard
Server Operations
P4
normal
RESOLVED FIXED
11 years ago
2 years ago

People

(Reporter: Jesse Ruderman, Assigned: justdave)

Tracking

Details

(Whiteboard: [sg:want P2])

Attachments

(2 attachments)

4.10 KB, application/octet-stream
Details
88.96 KB, application/octet-stream
Details
(Reporter)

Description

11 years ago
It would be nice to be able to search the JavaScript source code of all extensions on addons.mozilla.org.  I might use this to:

* Search for things that look like security holes, and notify authors about them.

* Search for explicit use of unsafe (non-wrapped) access to a page's DOM, so I can figure out if the API needs to be better to support specific patterns.  This could also improve the security of extensions.

* Get an idea of how many extensions will break when bug 47903 is fixed, and notify the authors of affected extensions.
(Reporter)

Updated

11 years ago
Whiteboard: [sg:want P5]
(Reporter)

Comment 1

11 years ago
This would be great for some quick memory-leak-pattern auditing, too.

Comment 2

11 years ago
Sync files from releases.mozilla.org, unpack, index with a tool like lxr? Might want to have some fancy stuff where it tells us what versions are the latest for that listing, or if an extension has been deleted from the listings or renamed or whatever.

Some extensions have binary components, so you're not going to be able to look inside those.

I don't think this is going to be integrated into AMO, it'd probably be a separate webtool. 
Component: Add-ons → Administration
QA Contact: add-ons → administration
Version: unspecified → 2.0
Yeah, this is a WONTFIX as an AMO bug.  If you're going to reopen, please pick another product as target!

(Can Google Code Search help us here?)
Status: NEW → RESOLVED
Last Resolved: 11 years ago
Resolution: --- → WONTFIX
Just a note: I did write a page in Remora that allows for viewing the contents of an individual extension for the review queue, but in its current state it's not suitable for all extensions.

I would be willing to use that code in a new webtool for all extensions, but that would be after Remora launch.

Updated

10 years ago
Duplicate of this bug: 388391
So... I filed bug 388391 based on shaver telling Beltzner to do so.  Was there some miscommunication here somewhere?  ;)
We need this tool, this type of question comes up at least once or twice every security update as we try to guess at the real-world impact of various regressions or intentional behavior changes introduced by security restrictions.

Please do not WONTFIX based on the plans or workload of the AMO team. Although that seems like a natural fit since AMO knows when each extension is updated, it can certainly be built elsewhere nearly as easily (as noted in comment 2).

Status: RESOLVED → REOPENED
Component: Administration → Miscellaneous
Product: addons.mozilla.org → mozilla.org
Resolution: WONTFIX → ---
Whiteboard: [sg:want P5] → [sg:want P2]
Version: 2.0 → other
It wasn't WONTFIXed based on the workload of the AMO team, it was WONTFIXed because coupling it to AMO adds risk that is very undesirable.

Bug 388391 was to be a bug for running the grep, I believe, at least as I intended to instruct beltzner.  (I thought it would be an IT bug, in fact, but that's cool.)  It was WONTFIX-dup'd as a request for a new tool to be added to AMO.

Updated

10 years ago
Duplicate of this bug: 402566
As per recent discussions on release-drivers@, drivers want a high-powered VM or real machine running a mxr-like instance that processes all add-ons. Site should be behind LDAP authentication.
Assignee: nobody → server-ops
Severity: enhancement → normal
Status: REOPENED → NEW
Component: Miscellaneous → Server Operations
QA Contact: administration → mrz
public would be really nice, but we can figure that out later -- please let's just get this working!
Alias: amo-mxr
From internal discussions, preferable to use the same hardware that mxr is already using (or upgrade that host to be able to support both) instead of making a new host.

justdave - can you comment on which path is best and perhaps what additional resources the current host would need to support this?
Assignee: server-ops → justdave
The CPU is almost maxed out on that box right now, it'll definitely need about twice the CPU horsepower that it has now.  It's a DL360 G4.  64-bit may prove efficient since it's a lot of data manipulation, which tends to benefit from wider data paths.

Currently has about 130 GB hard drive, will need about twice that, too.

Comment 14

9 years ago
Created attachment 340775 [details]
Repo Creation Tool

I actually wrote up a little proposal for how the public version of this might work, see https://wiki.mozilla.org/User:Bhashem/ViewAddonSources. Also, created bug 457474 to track progress against the public version of this tool.


I'd like to make a couple of notes for whomever plans to do this since I've looked at this already. If you are starting with the full archive of the add-ons such as the one off releases (e.g. http://releases.mozilla.org/pub/mozilla.org/addons/), then this repo includes ALL versions of every add-on. This includes disabled, old, patched, sanboxed experiments along with the "latest".

For a typical grep, you want to just include the "latest" version of an add-on since we can assume that the majority of users have upgraded to the latest (not always true but generally an OK assumption). In order to determine which one is the latest version of an add-on, you can:

1) Attempt to parse the file name - not reliable at all since version numbers are arbitrary and assigned by the author.

2) Query the latest AMO production database (perhaps using the AMO API - but that only covers public add-ons). E.g. using the releases repo and an out-of-date copy of the AMO DB (such as the one on machine khan) will result in xpi files that do not include the latest info. Another approach is to offer a specialize query API to the production DB.

3) Perform a two-step process, export a list of the most current version of an add-on from the production db that is then used to select items from the repo or create a new repo that just has the files of interest.

4) Attempt to use the archive as the only source. Generally works by opening up each directory and xpi file and attempt to select the most up to date XPI based on the individual file dates. (You cannot rely on .XPI file dates since those are based on the date copied to the repo).

Once the "filtered" repo is created, you need to "explode" the files from .xpi or .jar -> directory structure. In many cases, those xpi's contain .jar's themselves, so you'll want to explode those recursively.

Finally, you have an exploded repo that you can now do a deep search for particular string patterns. 

OK, enough rambling - I'm attaching some scripts that I've used to implement Option 4 suggested above. But if I were to do it again, I would use Option 3 despite the two step process.
(In reply to comment #14)
> If you are starting with the full archive of the
> add-ons such as the one off releases (e.g.
> http://releases.mozilla.org/pub/mozilla.org/addons/), then this repo includes
> ALL versions of every add-on. This includes disabled, old, patched, sanboxed
> experiments along with the "latest".

That's a problem and needs to be fixed.  We've had mirrors get DMCA takedown notices more than once for addons that were copyright violations and had already been removed from the addons site for that reason, and never disappeared out of the addons directory.  They need to go away off the download server if they get disabled on the site.
Whiteboard: [sg:want P2] → [sg:want P2] waiting on hardware (comment 13)

Updated

9 years ago
Assignee: justdave → mrz
(In reply to comment #13)
> Currently has about 130 GB hard drive, will need about twice that, too.

So, 260GB?  Going to do this with iscsi.
(In reply to comment #16)
> (In reply to comment #13)
> > Currently has about 130 GB hard drive, will need about twice that, too.
> 
> So, 260GB?  Going to do this with iscsi.

How difficult is it to expand that in the future if needed? Can it be grown?
For justdave, dm-webtools04 up @ 10.2.74.220 waiting for work.
Assignee: mrz → justdave
Assignee: justdave → reed

Comment 19

9 years ago
fwiw, i created http://mxr-test.konigsberg.mozilla.org/mozdev/ for dveditz a couple weeks back.

is it possible to reach khan from konigsberg?
(In reply to comment #19)
> is it possible to reach khan from konigsberg?

No. What do you need from khan?

Updated

9 years ago
Depends on: 458794

Updated

9 years ago
Blocks: 458794
No longer depends on: 458794
Status: NEW → ASSIGNED
Whiteboard: [sg:want P2] waiting on hardware (comment 13) → [sg:want P2] building new box -- in progress
dm-webtools04 has been successfully set up, and mxr has been moved over to it.
Whiteboard: [sg:want P2] building new box -- in progress → [sg:want P2] need mxr code changes
Summary: Ability to view and search source code of all extensions on addons.mozilla.org → Ability to view and search source code of all extensions on addons.mozilla.org (MXR for add-ons)
Whiteboard: [sg:want P2] need mxr code changes → [sg:want P2] need mxr code changes from timeless
Just an update... timeless and I are working together to get this running. Should have it up this week, hopefully.
Moving to projects, gated on timeless (right?)
Component: Server Operations → Server Operations: Projects
This really would've been useful today as we were working on the Firefox 3.0.11 regression found during that cycle and trying to determine how many extensions this affects. I'm not sure where the hold up is now, but we did search using the setup in comment 19, but that's a bit out-of-date now.

Any chance we can get this moved up in priority? We don't *need* it too often, but when we do we really do *need* it. (And I'm sure there's lots of "want" times as well.)
Would have been helpful today for tracking the affect of bug 497098
Assignee: reed → nobody

Comment 26

8 years ago
can we get this moving again?  Moving out of projects...
Assignee: nobody → server-ops
Component: Server Operations: Projects → Server Operations
AMO is planning a taggable search code repository for Q1 as part of the Add-on Developer Hub's second phase. If this is set up by then, we will happily contribute to and/or re-use this bug's creation. Otherwise we will be starting from scratch on this bug's scope + more sometime around then.

Comment 28

8 years ago
seems like we should use similar tools and infra as the rest of the project unless there are some features missing.  any reason why we can't use mxr for this too?
Is this an addons bug? It seems like the addons devs could better integrate a code search with the preexisting developer panels.
@justin We haven't started evaluating tools yet, but we do plan on having more than mxr currently supports, most notably the ability to select certain lines of code and tag them as "opens a new tab" or "first run page" so that developers can see and search real, working code snippets.

I have no idea what mxr is written in or if we could easily build our additional functionality into it. I was just saying that we're planning on something similar for Q1 but haven't made any specific implementation plans yet, so if this bug is finished before then, that would certainly factor in to our plans.

Comment 31

8 years ago
how hard is this really?  If it's something that can be done in a few hours, seems like it would give a lot of people a tool they have been asking for for years.  The AMO part of Developer Hub sounds like a good idea, but an mxr would be a good stop gap in the mean time if it's trivial (or semi-trivial) to setup (in which case it would be an infra bug).
Resource constrained right now - it's on the q4 schedule.
Assignee: server-ops → mrz
Priority: -- → P4

Comment 33

8 years ago
given we are now in q4 - what's the eta?
Will have updated ETA on Monday.
So the main hangup with this is disk space.

There's currently 20 GB of addons stored in AMO's sandbox.  They're in xpi wrappers, which are pkzip-compressed internally.  pkzip has an average compression ratio between 2:1 and 3:1, which means on the upper bound we're looking at 60 GB of disk space just to unpack them all.  MXR's indexing usually eats up at least as much space as the source itself, and often more.  So I'd take a wild guess that we'd probably need about 150 GB of disk space to support the addons that are present currently, and that's not allowing for any growth.

There's currently 160 GB free on dm-webtools04 (the machine that currently hosts MXR).

Do we want each addon indexed individually, or all of them as a whole? (i.e. when you search for a variable name, do you want to search within all addons or only that specific one?).  Indexing them individually seems to make more sense, but greatly adds to the complication of setup because that means we need to automate adding/removing indexing jobs from MXR.

We would also need a script written to unpack all of the addons into a directory tree on a regular basis (probably daily).

dm-webtools04 already has a fairly-constrained CPU, with about 40 GB of total code to index on a multiple-times-daily basis.  It's possible, to avoid killing the machine, that we'd need to set up a separate instance of MXR on a different machine to run the indexing jobs then rsync the results back when it's done (which would also double the amount of required disk space in addition to adding more hardware).

To make a long story short: no, it's not trivial to hook this into MXR.
I see scripts that'll be usable attached to comment 14, so the only major hangup at this point would be CPU power.  Could just try it and see what happens.
> Do we want each addon indexed individually, or all of them as a whole? 

The main use case we have so far, I think, is "find all extensions that use a given interface or code pattern".  That argues for indexing as a whole, right?
I think we only need to consider the files associated with the latest version of an add-on. That should reduce the size considerably, but may require a script to read AMO's db and copy the latest files to another location for MXR.
(Reporter)

Comment 39

8 years ago
If you want to search within a specific extension you can use file-path matching in addition to contents matching.
(In reply to comment #36)
> I see scripts that'll be usable attached to comment 14, so the only major
> hangup at this point would be CPU power.  Could just try it and see what
> happens.

Do you still have sufficient space?
(In reply to comment #40)
> Do you still have sufficient space?

If we do as suggested in comment 38, yes.  That'll require additional code to do the "checkout" part though (which might be best written by someone familiar with the innards of AMO rather than IT).

Updated

8 years ago
Assignee: mrz → justdave
(In reply to comment #38)
> That should reduce the size considerably, but may require a
> script to read AMO's db and copy the latest files to another location for MXR.

Is that a "may" or "will" and is that on you guys?  ETA?
Whiteboard: [sg:want P2] need mxr code changes from timeless → [sg:want P2]
It's a "will". Wil, can we write a script that copies the files of the latest version of all viewsource=1 add-ons from REPO_PATH to a specified directory for source viewing?
Great.  Let me know some sort of ETA.
I don't like the idea of latest version only.  That means our URLs won't be reliably bookmarkable.  If this is just a temporary solution until webdev comes up with something taggable it's not a huge deal though.

Additionally, by only copying the add-ons who wish to share their source (viewsource=1) we're not really fulfilling the goals of comment 24, comment 25, etc. since some (most?) add-ons don't have that flag set.

I've filed bug 522480 for webdev to write a script and we can start on it next week and have it done in a few weeks.  If my above comments change the requirements, let me know asap.
Depends on: 522480
Yes, I think we should do all add-ons regardless of viewsource=1 and make it only available to a subset of people (admins, editors, and some security people if we don't want to just make them editors or admins).
I assumed this was for a publicly accessible site. It's certainly fine to have a second site that is not public and has all files, but add-on developers have explicitly expressed concern about honoring the viewsource setting when we've talked about a public source code search tool before.

As for copying only the latest version, I don't think bookmarkable URLs to a particular add-on is too important for the main use case. And if it is really important, we could just rename the file to something consistent when it's copied.
We need whoever the stakeholder of this bug is to decide which method we're going with so we can work on bug 522480 in our current AMO release.

The options are:
1. publicly-accessible repository with only add-ons that have enabled view source
2. private repository with all add-ons
3. both (may have space issues, judging from comment #35)

If this is mainly for the security team, it seems like it would be option 2 or 3.
(In reply to comment #48)
> We need whoever the stakeholder of this bug is to decide which method we're
> going with so we can work on bug 522480 in our current AMO release.

The current AMO milestone has begun and that bug is unassigned because we don't have a decision here.  If it misses this milestone the next one code freezes on Nov 24.

I talked to fligtar about this and it sounds like it doesn't fulfill the needs of the AMO tool-in-planning so it will only be for whoever needs it now.
I'll make a decision! Branch drivers (and developers working on branch releases) really just need #2. If someone else is interested in #1 or #3, then that's fine, but from a security/release perspective, we need #2.

Comment 51

8 years ago
Post def-con security talk makes this need very clear. jetpack also wants this for API analysis. brendan, chofmann, dveditz agree that this needs to move up in priority. Who can help? AMO team? Server Ops?
Please read the bug comments... this is already being worked on in the current AMO release.
Script is done thanks to davedash.  http://svn.mozilla.org/addons/trunk/bin/latest_addon_extractor.py
Pinging again- Shawn from the platform team would find this hugely helpful for clearing a gecko blocker.

Comment 55

8 years ago
I wish we could stop using terms like "stakeholders."   there are just people that are trying to get their jobs done and are blocked by not having this.

we need this for trying to construct a comprehensive list of jetpack apis that are going to cover the highest pct of use cases.

we need this for quick decisions about what 3.6 end-game changes might potencially break addons, and/or what changes already made in 3.6 might affect which set of addons in the effort to get compatibility levels up for 3.6, like shawn mentioned in the particular bug he is looking at. sam needs this for maintainence releases as well.

we need this for the security checks we need to look at for things asa mentioned in comment 51.

has this been disconnected from the AMO milestone?  can it be, so progress might move faster?

even a one time snap shot of where we are today would be extremely valuable. 

then we could get it updating with the latest data daily.

then we could see if its feasible to provide history.
> has this been disconnected from the AMO milestone?  can it be, so progress
> might move faster?

The AMO team's contributions were finished with comment 53.  This is an IT bug.
What Wil said and this won't be done until after Bugzilla 3.4 is pushed out.  That's justdave's only priority right now.

Comment 58

8 years ago
rough guess on when that will be?

Updated

8 years ago
Blocks: 519438

Comment 59

8 years ago
looks like Bug 519438 -  Frequent crashes in nsAppShell::ProcessNextNativeEvent after mozilla::storage::Connection::Close()  -- a fx 3.6 blocker depends on this bug.
It doesn't depend on this bug -- we manually grep for patterns upon request when people ask us to, usually by filing a bug in addons.mozilla.org::Administration.

Comment 61

8 years ago
I think many would disagree with that if 

a) few or none know about the need to file administration bugs and 

b) its too slow and inefficient to do many searches  

or c) do them interactively when the results from one search inspire other searches that need to be run to drill into a problem, or how widespread the problem might be.
Should be done with Bugzilla 3.4 this weekend and then start on this bug.
That's great to hear. We might want to figure out if we can do this every 2-4 months or so, actually, or if it's just as easy to do one-off requests, create a nice front end that lets us request that and get an email of the results?

Can't stress enough how much this matters as we come up to a release, minor or major.
Do what every 2-4 months? This bug is about setting up a regularly-updated mxr-like service. If the indexing mechanism is a full-AMO sweep then we'd want it updated at least weekly if not daily. If it's triggered by individual addon updates that's cool, but not if it complicates the code such that it delays the availability of a usable mxr-like service.

Actually let me take that in two parts
 1. mxr-like service for today's addons
 2. regularly updated.

I would be overjoyed to get #1 ASAP with a manually triggered one-time index while we work out #2.
I was planning to give it the usual MXR treatment...  if the pull script is sufficient, we'll get it going daily.  I'll be playing with it in the next couple days (probably have an initial deployment tonight, whether it updates depends on how good the pulls script works. :)
ok, the script wants a source repo, but doesn't explain much about it...  I'm assuming that's the "files" directory?
Assignee: justdave → dd
oops, didn't mean to reassign there.
Assignee: dd → justdave
> ok, the script wants a source repo, but doesn't explain much about it...  I'm
assuming that's the "files" directory?(In reply to comment #66)

All directories are "files" directories ;)

But yes, it's the root path of where the xpi file live.

Thanks,

-d
What's the ETA to wrap this up?
justdave,

Let me know if you need anything from me you can ping me on IRC.

Comment 71

8 years ago
can we please get a hard date to expect this done by?
I have no idea how long it's going to take because it involves basically experimenting with it until I find something that works.  That said, I did work on it a bunch last Tuesday, couldn't work on it Wednesday because of too many production issues I had to firefight, and had Thanksgiving weekend off.  Based on what I've figured out so far, I'm thinking the range is "in the next couple days" though, not weeks away or anything.
Blocks: 527843
This is all set up and running now, except that the extraction script isn't unpacking the source.  It's otherwise automated and hooked into MXR now however.

I've reopened bug 522480 to get the script updated, should be a matter of svn up to fix it once that's done.
Oh, it's at http://mxr.mozilla.org/addons/source/ fwiw.
So I assume this is #1 instead of #2?

> 1. publicly-accessible repository with only add-ons that have enabled view
> source
> 2. private repository with all add-ons
This shouldn't be publicly accessible, for the reasons in comment #48 and comment #50 -- the script copies all add-ons regardless of whether they have set to not have their source publicly viewable on the web.
(In reply to comment #76)
> This shouldn't be publicly accessible, for the reasons in comment #48 and
> comment #50 -- the script copies all add-ons regardless of whether they have
> set to not have their source publicly viewable on the web.

OK, MXR is now available over SSL as well, and the /addons directory will force you over to SSL and ask for LDAP authentication.  It's locked down so only folks who can also access the Intranet wiki can access it.  If there's a better group to lock it down to, or if we should create a new one for it, let me know.
I'd say the security group is a better group. Not everyone at Mozilla should have access to this site for the same reasons we don't give everyone admin access to crash-stats.
ok, done.

Comment 80

8 years ago
(In reply to comment #78)
> I'd say the security group is a better group.

Wait, why? What's sensitive in this information? I, for one, want to use this for regular development purposes, and I don't see why source code indexing for addons should be tied solely to security.
Certainly one of the reasons for having this index is to help us make decisions on how API changes and bugs will impact extensions, it would seem to me that only allowing the security group is a little too restrictive here though I'm not sure if there is a better one.
(In reply to comment #81)
> Certainly one of the reasons for having this index is to help us make decisions
> on how API changes and bugs will impact extensions, it would seem to me that
> only allowing the security group is a little too restrictive here though I'm
> not sure if there is a better one.

How about making it tied to mozilla-central commit access? Does that cover most people?
Maybe we need two copies of this, one with the "view-source disabled" extensions not included.  From my understanding, that's the main reason this got secured, because it includes extensions whose authors have asked that their source not be shared.
(In reply to comment #83)
> Maybe we need two copies of this, one with the "view-source disabled"
> extensions not included.  From my understanding, that's the main reason this
> got secured, because it includes extensions whose authors have asked that their
> source not be shared.

I'd rather just have one.  We're really trying to audit how the extensions framework is (mis)used and it has nothing to do with view source... perhaps a new ldap group is warranted for this?
Maybe we should remove the 'view source' option from AMO altogether since it's only cosmetic and not preventing anyone from viewing the source manually.  Perhaps it's even giving authors false assurance.
I think offering the "publically viewable source" checkbox for addon authors is providing a false sense of security, given that all of these addons are available on our FTP server (and on user's computers once installed)... Would it be contentious to remove the option and just make all addons' source publically viewable (on AMO and this MXR)?

I understand that this may make it easier for people to inadvertently violate the license terms of non-open-sourced addons (we allow those, right?), but I'm not sure that's actually a problem that we need to worry about...

Comment 87

8 years ago
Any decisions about access should be reached after a review of any agreements that we have with addon developers.   The "committer" and "security-group" both could be viewed as having "public" and "private" components.  Thats how the mozilla project rolls...

This agreement on the addon site
https://addons.mozilla.org/en-US/developers/docs/policies/agreement

with this clause...

"...you grant to Mozilla and its Affiliates a non-exclusive, worldwide, royalty-free, sublicensable license to distribute, transmit, reproduce, publish, publicly and privately perform and display and otherwise use the AMO Contribution, "

would seem to indicate that we have the ability to display or use the contributions.

Are there other agreements that come into play?

When I heard that we needed to protect some addons, I was thinking about obfuscated .js and source for binaries that comes out of band to jorge, and editors.    None of that is in this set up, correct?

We need to examine exactly what it is that we are protecting, and need to protect.

If these are the same .xpi files that we distribute of amo then just taking those and unzipping them and putting the same source that could be view by any users under mxr seems like we have not really exposed anything.
(In reply to comment #87)
> When I heard that we needed to protect some addons, I was thinking about
> obfuscated .js and source for binaries that comes out of band to jorge, and
> editors.    None of that is in this set up, correct?

Correct.

> If these are the same .xpi files that we distribute of amo then just taking
> those and unzipping them and putting the same source that could be view by any
> users under mxr seems like we have not really exposed anything.

They are the same files.

Comment 89

8 years ago
so, if necessary, it would be possible for the mxr search/ident/find and source scripts to be modified so that they read a file to decide whether certain paths are avaiable for an http://mxr.mozilla.org/addons/ view vs. an https://mxr.mozilla.org/addons/ view.

Roughly a file in the mxr index directory could contain a list of paths which are private and when search/ident/find prepare to display a path, they could check that list and if the url is http: instead of https:, they could bump a counter and skip displaying. At the bottom of the page, they could say:

X files skipped, they a <a href="https://$thispath">available to a restricted group</a>

For /source/ the rule would basically generate an error page which similar content.

-----
Personally, I think that restricting mxr is silly and that telling developers that by publishing their applications they can keep their source confidential is disingenuous. iirc when I setup an mxr for addons on konigsberg, i didn't didn't use private information, it was just an exercise in screen scraping, the files are as noted in comment 88 just files, and it seems that our licensing terms according to comment 87 don't preclude us from doing this.

that said, i understand there are political concerns to enabling for everyone to see the source because some vendors seem to want their code to be treated as a locked box.
This bug is about giving our application developers and the security team the tools they need to search the source of the add-ons we have available. If someone wants to file a separate bug in AMO about changing the way we handle the view source options in general, please do so so that discussion can be had there. Changes to that aren't required to meet the needs of this bug.

For this bug, I'd like anyone with mozilla-central commit access or within MoCo to be able to access the site. It should not be open to the whole world or crawlers, but anyone from the community who commits code should be able to use this tool.

As for the extraction piece, I don't think we realized the script needed to extract the add-ons too. My bash skills aren't good, but this is what I'd been using to extract add-ons copied from the script for manual greps on khan: http://fligtar.pastebin.mozilla.org/688344 Can we just use something like that, or do we really need AMO copy script updated?
fligtar,

I almost have a patch ready for the extraction script.  I'll update the bug and :justdave when it's finished.

Comment 92

8 years ago
what is the latest status on this. 

looks like things are being unziped, but search for something simple like Adblock
with https://mxr.mozilla.org/addons/search?string=Adblock  doesn't turn any results.
There's been a few hangups on my end regarding the unzipping, I'll take care of it hopefully by tomorrow.
I just manually indexed the little bit of content that did successfully unpack before the unzipper crashed, so you've got partial stuff anyway.

Comment 95

8 years ago
any updates on this?  looks like another run was attempted a few days ago, but no progress made since dec 7.

also, is there a way to drop a status file that indicates the intention of the addon author to show the source publicly?   I notice that for some addon's the license info is dropped in the root directory and that will help anyone looking at the source to make decisions about disclosing/not disclosing information found in the search in bugs or other places that the research made lead to postings.  

https://mxr.mozilla.org/addons/source/12/license.txt is an example.

having the "public source status" will add to this and also help in the cases where there is no license.

Comment 96

8 years ago
any update on this?   does the fact that bug 522480 has been fixed mean an update of the script need to be pushed?
Blocks: 528645
(In reply to comment #96)
> any update on this?   does the fact that bug 522480 has been fixed mean an
> update of the script need to be pushed?

That's quite likely...

/me pulls an svn update and gives it a try...
Well, it crashed before finishing, so I reopened bug 522480.

But it got quite a way before it crashed...  it looks like it's unpacking the .xpis.  browsing around a bit I suspect we probably want to also unpack the .jar files that are inside the .xpis, right? (It's not doing that yet)

Comment 99

8 years ago
you do...
> I suspect we probably want to also unpack the .jar files that are inside the
> .xpis, right?

We do, yes. Thought that was explicit in this bug but maybe it was in a mail thread (fligtar pasted the script he used, if that rings any bells).
justdave, please try this again, I rewrote the code so it unzips any jars in the xpi... if we need to do something more let me know, I'd like to have this out of my hands soon, since I'm out of town starting Thu.
started this running about an hour ago... it hasn't errored out yet, which is promising.  I'll post again when it either crashes or finishes.
It's still running, but we have visible results on the website already.

Looks mighty pretty :)

I do notice this though: https://mxr.mozilla.org/addons/source/10009/

Looks like we need to recursively unpack xpi, too.
(and if ddash is already gone I can probably figure that part out, because I imagine it's probably as simple as adding the xpi extension to wherever the existing comparison is being done looking for jar).
Actually, those are in the top-level directory, maybe I should sit and wait a bit.  It may be the script hasn't gotten to them yet after the initial copy.  I should probably wait until it's completely done before deciding if the results are accurate.
added embedded xpis to the list of things to unzip.

svn up
the run from before hasn't completed yet.  no output, wonder what it's doing. :)
Aborted and restarted with the new version.

Comment 108

8 years ago
dvedtiz spotted that we are grabbing personas and that is inflating the top level directory counts.  If there is an easy way to put personas under a different directory tree like https://mxr.mozilla.org/personas or just ignore them for now that will help them from clogging up the addon list and keep them from slowing down searches and navigation to from the top level directory.
(In reply to comment #108)
> dvedtiz spotted that we are grabbing personas and that is inflating the top
> level directory counts.  If there is an easy way to put personas under a
> different directory tree like https://mxr.mozilla.org/personas or just ignore
> them for now that will help them from clogging up the addon list and keep them
> from slowing down searches and navigation to from the top level directory.

I'm not seeing where we are grabbing personas.  I am specifically looking for addons that are "extensions" and not personas/themes/etc.

Is there a URL showing this, and I can drill down from there.
just grabbed the output from the last trial run of the script:

Updating addons...

real    16m38.033s
user    1m29.655s
sys     0m43.592s

real    609m15.564s
user    514m59.929s
sys     0m39.256s

real    134m33.566s
user    3m20.163s
sys     0m36.210s

real    0m1.439s
user    0m0.646s
sys     0m0.084s

The first time counts are from ddash's script unpacking everything, the other three are the three phases of generating indexes.  So it takes about 15 minutes to unpack it all and then 12 hours to index it.

Comment 111

8 years ago
haven't seen any personas yet in poking around.

I think I get https://mxr.mozilla.org/addons/source/15052/ as the highest addon.  Does that sound right?

I might be good to break up into directories of 1000 that would allow navigation.  having 15000 top level directories just draws a blank page for me.
Created attachment 419131 [details]
list o' add-ons


> I think I get https://mxr.mozilla.org/addons/source/15052/ as the highest
> addon.  Does that sound right?

No.  We had ~15000 add-ons or so, and then imported getpersonas which bumped us up to ~45000.  That was a couple months ago, so after that point there are a bunch more add-ons.  I'm attaching the output of "select id,addontype_id from addons where addontype_id in (1,2)".  1=extension, 2=theme.
What's outstanding on this bug?  Are we done yet (can I close it?)?

Comment 114

8 years ago
I think the original request has been met, but we probably need follow up bugs to make search work across so many addons like the suggestion in commment 111, some kind of alpha order break up, breaking up by ranking, or supported version (3.5.* etc.) might be better ways to organize the files so the searches could be more reliable and catch the most important results.

right now any interesting search for things like eval( or XMLHttpRequest to find possible abuse gets "Too many hits, displaying the first 1000"

We also need to figure out a way to get these sources public as suggested in comment 90 and other areas of this bug so we can apply open source development practices that have served us well with the core mozilla code and start to build a larger set of eyeballs on addon source.

If access is restricted to security group, one additional thing we might do in this bug is to get access for all the current AMO editors.  That would be a good way to start to build more eyeballs looking a the source across addons.
(In reply to comment #114)
> I think the original request has been met, but we probably need follow up bugs
> to make search work across so many addons like the suggestion in commment 111

As of now, this has been added to the daily cron job, so it will auto-update daily now.  Prior to tonight I'd been kicking it off manually every so often.  So with that done, we'll consider this resolved and you can file followup bugs for the individual issues that still need dealing with.
Status: ASSIGNED → RESOLVED
Last Resolved: 11 years ago8 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.