Continuing from earlier tonight, let's see how you can use the HTTP archive as a starting point and continue examining the Internet at large.
Task: figure out what % of the JPEGs out there on the web today are progressive vs baseline. Ann Robson has an article for the perfplanet calendar later tonight with all the juicy details.
Problemo: there's no such information in HTTPArchive. However there's table requests
with a list of URLs as you can see in the previous post.
Solution: Get a list of 1000 random jpegs (mimeType='image/jpeg'), download them all and run imagemagick's identify
to figure out the percentage.
How?
You have a copy of the DB as described in the previous post. Now connect to mysql (assuming you have an alias by now):
$ mysql -u root httparchive
Now just for kicks, let's get one jpeg:
mysql> select requestid, url, mimeType from requests \ where mimeType = 'image/jpeg' limit 1; +-----------+--------------------------------------------+------------+ | requestid | url | mimeType | +-----------+--------------------------------------------+------------+ | 404421629 | http://www.studymode.com/education-blog....| image/jpeg | +-----------+--------------------------------------------+------------+ 1 row in set (0.01 sec)
Looks promising.
Now let's fetch 1000 random images, while at the same time dump them into a file. For convenience let's make this file a shell script so it's easy to run. And the contents will be one curl command per line. Let's use mysql to do all the string concatenation.
Testing with one image:
mysql> select concat('curl -o ', requestid, '.jpg "', url, '"') from requests\ where mimeType = 'image/jpeg' limit 1; +-----------------------------------------------------------+ | concat('curl -o ', requestid, '.jpg "', url, '"') | +-----------------------------------------------------------+ | curl -o 404421629.jpg "http://www.studymode.com/educ..." | +-----------------------------------------------------------+ 1 row in set (0.00 sec)
All looks good. I'm using the requestid as file name, so the experiment is always reproducible.
mysql> SELECT concat('curl -o ', requestid, '.jpg "', url, '"') INTO OUTFILE '/tmp/jpegs.sh' LINES TERMINATED BY '\n' FROM requests WHERE mimeType = 'image/jpeg' ORDER by rand() LIMIT 1000;
Query OK, 1000 rows affected (2 min 25.04 sec)
Lo and behold, three minutes later, we have generated a shell script in /tmp/jpegs.sh
that looks like:
curl -o 422877532.jpg "http://www.friendster.dk/file/pic/user/SellDiablo_60.jpg" curl -o 406113210.jpg "http://profile.ak.fbcdn.net/hprofile-ak-ash4/370543_100004326543130_454577697_q.jpg" curl -o 423577106.jpg "http://www.moreliainvita.com/Banner_index/Cantinelas.jpg" curl -o 429625174.jpg "http://newnews.ca/apics/92964906IMG_9424--1.jpg" ....
Now, nothing left to do but run this script and download a bunch of images:
$ mkdir /tmp/jpegs $ sh ../jpegs.sh
curl output flashes by and some minutes later you have almost 1000 images, mostly NSFW. Not 1000 because of timeouts, unreachable hosts, etc.
$ ls | wc -l 983
Now back to the original task: how many baseline and how many progressive JPEGs:
$ identify -verbose *.jpg | grep "Interlace: None" | wc -l XXX $ identify -verbose *.jpg | grep "Interlace: JPEG" | wc -l YYY
For the actual values of XXX and YYY, check Ann's post later tonight 🙂
Also turns out 983 - XXX - YYY = 26 because some of the downloaded images were not really images, but 404 pages and other non-image files.
Comments? Find me on BlueSky, Mastodon, LinkedIn, Threads, Twitter