Update: Easier way to get top X URLs: http://httparchive.org/urls.php, thanks @souders
Update: found and commented an offensive try{}catch(e){throw e;} in zombie.js (q.js, line 126), now the script doesn't fatal that often
Say you want to experiment or do some research with what's out there on the web. You need data, real data from the web's state of the art.
In the past I've scripted IE with the help of HTTPWatch's API and exported data from Fiddler. I've also fetched stuff form HTTPArchive. It doesn't have everything, but still you can manage to refetch the missing pieces.
Today I played with something called Zombie.js and thought I should share.
Task: fetch all HTML for Alexa top 1000 sites
Alexa
Buried in an FAQ is a link to download the top 1 million sites according to Alexa. Fetch, unzip, parse csv for the first 1000:
$ curl http://s3.amazonaws.com/alexa-static/top-1m.csv.zip > top1m.zip $ unzip top1m.zip $ head -1000 top-1m.csv | awk 'BEGIN { FS = "," } ; { print $2 }' > top1000.txt
Zombie.js
Zombie is a headless browser, written in JS. It's not webkit or any other real engine but for the purposes of fetching stuff is perfectly fine. Is it any better than curl? Yup, it executes javascript for one. Also provides DOM API and sizzling selectors to hunt for stuff on the page.
$ npm install zombie
Taking Zombie for a spin
You can just fiddle with it the Node console:
$ node > var Zombie = require("zombie"); undefined > var browser = new Zombie(); undefined > browser.visit("http://phpied.com/", function() {console.log(browser.html())}) [object Object] > <html><head> <meta charset="UTF-8" />....
The html()
method returns generated HTML after JS has had a chance to run which is pretty cool (there may be problems with document.write
though, but, hell, document-stinking-write!?).
You can also get a part of the HTML using a CSS selector, e.g. browser.html('head')
or browser.html('#sidebar')
. More APIs...
The script
var Zombie = require("zombie"); var read = require('fs').readFileSync; var write = require('fs').writeFileSync; read('top1000.txt', 'utf8').trim().split('\n').forEach(function(url, idx) { var browser = new Zombie(); browser.visit('http://www.' + url, function () { if (!browser.success) { return; } write('fetched/' + (idx + 1) + '.html', browser.html()); }); });
Next challenge: fetch all CSS and strip JS
The API provides browser.resources
which is an array of JS resources and really useful - HTTP request/responses/headers and everything.
Since I'll need all this HTML and CSS for a visual test later on, I don't want the page to change from one run to the next, e.g. to include different ads. So let's strip all JavaScript. While at it, also get rid of other possibly changing content, like iframes, embeds, objects. Finally all images should be spacer gifs, so there will be no surprises there either.
The bad
After some experimentation, I think I can conclude that Zombie.js is not (yet) suitable for this kind of "in the wild" downloading of sites. It's a jungle out there.
- It could be that my script is not good enough but I couldn't find a way to catch errors gracefully (wrapping
browser.visit()
in atry-catch
didn't help) - All JS errors show up, some cause the script to hang
- The script hangs on 404s.
- Sometimes it just exits.
- Relative URLs (e.g. to fetch a CSS file) don't seem to work.
- I couldn't get the CSS resources included (probably a bug in the version I'm using) so had to hack the defaults to enable CSS.
- I had to kill the script when it hangs and restart it pretty often, that's why I write file with the index of the next top X file.
- All in all I had something like 25% errors fetching the HTML and CSS.
The good
Loading a page in a fast headless browser with DOM access - priceless!
And it worked! (Well, in 75% of the sites out there)
I'm sure in more controlled environment with error-free JS and no 404s, it will behave way better.
The script
var Zombie = require("zombie"); var read = require('fs').readFileSync; var write = require('fs').writeFileSync; var urls = read('top1000.txt', 'utf8').trim().split('\n'); // where are we in the list of URLs var idx = parseInt(read('idx.txt')); console.log(idx); // go do it download(idx); function download(idx) { // remember the place in the likely scenario that // the script crashes write('idx.txt', idx + 1); var url = urls[idx]; if (!url) { // we're done! console.log('yo!'); process.exit(); } var browser = new Zombie(); browser.visit('http://www.' + url, function () { if (!browser.success) { return; } var map = {}; // need to mathch link hrefs to resource indices browser.resources.forEach(function(r, i) { map[r.request.url] = i; }); // collect all CSS, external and inline var css = []; var sss = 'link[rel=stylesheet], style'; // Select me Some Styles [].slice.call(browser.querySelectorAll(sss)).forEach(function(e) { if (e.nodeName.toLowerCase() === 'style') { css.push('/**** inline ****/'); css.push(e.textContent); } else { var i = map[e.href]; if (i && browser.resources[i].response) { css.push('/**** ' + e.href + ' ****/'); css.push(browser.resources[i].response.body); } } }) // remove style and these nodes that may cause the UI to change // from one run to the next var stripem = 'style, iframe, object, embed, link, script'; [].slice.call(browser.querySelectorAll(stripem)).forEach(function(node) { if (node.parentNode) { node.parentNode.removeChild(node); } }); // [].slice.call(browser.querySelectorAll('img')).forEach(function(node) { node.src = 'spacer.gif'; }) // placeholder, probably useless browser.body.appendChild( browser.document.createComment('specialcommenthere')) // we got the stuffs! var html = browser.html(); css = css.join('\n'); if (html && css) { // do we? ... got? ... the stuffs? write('fetched/' + (idx + 1) + '.html', html); write('fetched/' + (idx + 1) + '.css', css); console.log(idx + " [OK]"); } else { console.log(idx + " [SKIP]"); } download(++idx); // next! }); }
Comments? Feedback? Find me on Twitter, Mastodon, Bluesky, LinkedIn, Threads