Twitter stage has old data/ broken links

VERIFIED FIXED

Status

VERIFIED FIXED
8 years ago
6 years ago

People

(Reporter: rbillings, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

8 years ago
Clicking tweet links with the date goes to page not found on http://twittercollage.allizom.org/en-US. All data is from March 12th or earlier- nothing from the past two days. Works fine on http://dev.twitterparty.quodis.com/en-US/. 

1) go to twittercollage.allizom.org
2) click date link on a tweet: http://twitter.com/#!/alexdelpippo84/status/46617932222050300
Noah, could you check on the cron jobs and post any logs that might help diagnose? Thanks

Comment 2

8 years ago
We believe these symptoms over at http://twittercollage.allizom.org/en-US are due to having data collected and processed by evolving code. 

This system has been running without a full reset for a couple of weeks now, and it has stored and processed data+images all long the code changes. Stored data and DB should now be a nice collection of all the bugs we originated, dug and fixed meanwhile.

It is advisable to:
- stop all running jobs
- recreate table (run \. schema/tables.sql) 
- delete all data/processed/* data/original/* files
- restart memcache server
- clear log /var/log/twitterparty/*
- restart all jobs
Cleaning out the old data is probably a good idea, and I'll get to that in a few minutes.

At the moment, I'm not seeing anything in the jobs output that indicates any obvious problems.  However, image-process is looking somewhat suspicious to me.  It's taking a long time to run, and isn't outputting any log info while running.  It's creating some really large sparse temp files in /tmp, e.g. 
9.7G -rw------- 1 root root  40G Mar 14 15:23 /tmp/magick-XXHpw8ce

I haven't looked at the code at all, but I wonder if this is somehow related to the growing collections of files in the Data directory that I asked about in Bug#636520  I suspect that these files are filling up the disk.  The filesystem is not particularly big.
I've confirmed that the disk is filling up:

Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                       28G   27G     0 100% /

This is likely the problem.  This app shouldn't be created such large files:
 12G -rw------- 1 root root  40G Mar 14 15:26 /tmp/magick-XXHpw8ce
Even after deleting all the data/processed and data/original content, recreating the 'tweets' db table, and restarting memcached, I'm still seeing similar problems.  image-process runs until it fills up the disk with a giant file in /tmp:

[root@mrapp-stage04 twitterparty]# ls -lsht /tmp/magick-XX*
12G -rw------- 1 root root 40G Mar 14 16:10 /tmp/magick-XXwrROMR
68K -rw------- 1 root root 63K Mar 14 15:54 /tmp/magick-XXCqFof4
[root@mrapp-stage04 twitterparty]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                       28G   27G     0 100% /

We could try adding more space to /tmp, but really, how much will it need?  12G seems excessive already.
(In reply to comment #5)
> We could try adding more space to /tmp, but really, how much will it need?  12G
> seems excessive already.

Andre, can you reduce how much space is being used on /tmp? And/or add garbage collection?

Comment 7

8 years ago
I'm unaware of tmp strategies used by Imagick but I suspect something's not going quite right.

In our dev server we find only 10 small files. 

ll /tmp/magick-*
-rw------- 1 root root 555 2011-03-01 09:13 /tmp/magick-XX9TomdD
-rw------- 1 root root 684 2011-02-26 06:55 /tmp/magick-XXavfNej
-rw------- 1 root root 684 2011-02-26 18:23 /tmp/magick-XXBbxS0q
-rw------- 1 root root 684 2011-02-26 06:53 /tmp/magick-XXDEu9BW
-rw------- 1 root root 684 2011-02-26 18:22 /tmp/magick-XXFCKEwU
-rw------- 1 root root 684 2011-02-26 18:23 /tmp/magick-XXFdj1ob
-rw------- 1 root root 684 2011-03-03 17:08 /tmp/magick-XXNY2oFI
-rw------- 1 root root 684 2011-02-27 19:09 /tmp/magick-XXsf14BW
-rw------- 1 root root 555 2011-03-01 09:13 /tmp/magick-XXSLDwmR
-rw------- 1 root root 554 2011-03-08 18:14 /tmp/magick-XXxNrInj

As long as we only have one process processing images (that would be image-process.php) We could safely remove all /tmp/magick-* from within this process after every completed iteration.

You'll have to make sure this process runs as root.

What do you think?
No, the issue is the imagemagick is creating a single really huge file with each iteration.  At the end of the image-process run, the temp file is deleted.
I'm not sure if this is at all related to the image-process issue, but I'm seeing the following form the mosaic-build script:

2011-03-14 17:07:40 | 1300147660 > mosaic-build.php > some slots empty: 1765
2011-03-14 17:07:40 | 1300147660 > mosaic-build.php > still some slots empty:1765 (give up)
2011-03-14 17:07:40 | 1300147660 > mosaic-build.php > SKIP! ...
2011-03-14 17:07:40 | 1300147660 > mosaic-build.php > OK! ... took 0 seconds
2011-03-14 17:07:40 | 1300147660 > mosaic-build.php > ... sleeping for 5 seconds ...

It seems to run like that for a long time without changing its output.  Those messages suggest to me that it's not actually doing anything.  Am I reading that right?

Comment 10

8 years ago
Yes you're reading that right!

It seems like there are only a few tiles processed... 1765 are missing in order to complete a full logo and burn the mosaic.json + mosaic.jpg 

So, the image-process.php process is simply not working. 

Let's try to troubleshoot this by messenger.

Comment 11

8 years ago
Debug (thanks Noah!) at http://twittercollage.allizom.org/en-US revealed:

Script is writing an infinite file to /tmp while trying to read the following file into an Imagick object.

Cause is ImageMagick security issue:

«The XWD Decoder in ImageMagick before 6.2.2.3, and GraphicsMagick before 1.1.6-r1, allows remote attackers to cause a denial of service (infinite loop) via an image with a zero color mask.»
http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2005-1739

Chances of coding around the issue to prevent it are really small. Would need to implement some alternative image file decoding in order to check the "zero color mask" properties.

Suggested solution: upgrade ImageMagick or run jobs in a dedicated VM.

VM Requirements:
- 256MB memory
- good network to memcache+mysql+internet
- sync'ed io to network storage (system logic will break with async io)

Comment 12

8 years ago
Staging at http://twittercollage.allizom.org/en-US is now running a debug branch. 

We have removed excess verbosity and hard-coded a simple protection against that particular file attack (yes, filesize, you guessed).

This way, until we sort a way out of the ImageMagick bug, I beg you to pull these little changes and resume the jobs in staging. 

We need the system rolling so you guys can go on testing the frontend, localization, etc...

Thanks
OK this issue appears resolved from our perspective. The content is fully up to date and image-process is running fine.

Andre, do you plan on merging your changes into master, or do you plan on implementing a different workaround for this imagemagick bug?

Comment 14

8 years ago
My changes were only meant to keep the system rolling past that particular image, so the plan is to totally discard them.

I don't believe we can serioulsy code our way out of the ImageMagick bug, at least within reasonable time.

The right solution would be to use an updated version of ImageMagick.
Andre, it looks like image-process again hit the imagemagick bug somehow, despite your workaround. We were alerted that the staging host had filled up its disk. Investigation again revealed the huge temp files.  It appears that image-process might have been looking at the following image when it failed:
http://people.mozilla.com/~nmeyerhans/9acbe5595be2f751208d471cdb.gif
(Reporter)

Comment 16

8 years ago
Not sure what the status of stage is at the moment, but I can confirm that tweets submitted to twitter, found in search, display almost immediately on the quodis site, but are not displaying on http://twittercollage.allizom.org/en-US - even after 20+minutes.
This is expected. Following comment 15, I shut down the backend jobs.  Due to the imagemagick bugs referenced previously, the entire deployment process is being re-worked. We can not re-enable updates at this time.  Best case is that we have this site re-staged this evening.
Depends on: 636520
(Reporter)

Comment 18

8 years ago
Verified that stage on allizom is updating quickly with new data
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
(Reporter)

Updated

8 years ago
Status: RESOLVED → VERIFIED
(Assignee)

Updated

7 years ago
Component: www.mozilla.org/firefox → www.mozilla.org
Product: Websites → Websites
(Assignee)

Updated

6 years ago
Component: www.mozilla.org → General
Product: Websites → www.mozilla.org
You need to log in before you can comment on or make changes to this bug.