Lessons learned from 222,557 font file subsets? / Stoyan's phpied.com

Earlier this year I wondered how many KB is "normal" for a web font file size (spoiler 20-ish KB). I finished the post questioning how much subsetting really helps, meaning how much do you save from painstakingly choosing which characters should stay in the subset as opposed to just broad strokes (ASCII vs Latin vs Cyrillic, etc)

So here's a little follow-up study of filesizes where I subset'd 1009 font files found on Google fonts' GitHub, one character at a time ending up with 222,557 WOFF2 files.

Constraints

I had to put some constraints, so that the study is not too broad, but yet is representative. The previous post has the reasoning in details, but here are the highlights:

Only inspect Google fonts as these are easier to download in bulk and quite popular
Only look at "regular" fonts (so no italics, no bolds)
Look at fonts that have at least 200 LATIN characters

So. Now we have 1009 TTF files each containing 200 characters or more. The full list is available with the data later in the post.

Step 1: What's in a font?

We start with a Node script that takes each font and prints a text file with each character supported by the font on a new line. The results look like so:

$ less ZenAntique-Regular.txt
U+20
U+21
U+22
U+23
U+24
U+25
U+26
U+27
U+28
U+29
U+2A
U+2B
U+2C
U+2D
U+2E
U+2F
U+30
U+31
U+32
U+33
U+34
U+35
U+36
....

U+20 is a space, U+31 is the number 1 and so on...

The script uses the fontkit library for font introspection. Prrrrrty cool.

In this directory you can see and inspect all the 1009 txt files.

Step 2: Subsetting

Using Glyphhanger we can now subset each font adding one character at a time. So the first subset is only a space character. The second subset is space and exclamation. The third subset is space, ! and ". And so on. The last subset contains all the characters supported by the font.

Here's an example of the 3-character subset of "Are you serious" font inspected and visualized in wakamaifondue.com

Scrolling further down we see the characters (space, ! and "):

Same subset font inspected by another wonderful tool fontdrop.info shows glyphs. (Remember glyph !== character)

Time to write the script to do the work! The full script is available here, but here's the gist: for each font file, read the corresponding txt file (full o' unicode characters) and keep running glyphhanger to subset the font, adding each new character to the new subset.

// read the list of files in the font directory
fs.readdir(fontDirectory, (err, files) => {

  // process each one
  files.forEach((file) => {
    const fontPath = path.join(fontDirectory, file);

    // check if the file is a TTF
    if (file.toLowerCase().endsWith('.ttf')) {
      const fontName = path.basename(file, path.extname(file));
      const txtFilePath = path.join(fontDirectory, `${fontName}.txt`);

      // read the Unicode characters from the corresponding .txt file
      const unicodeCharacters = fs
        .readFileSync(txtFilePath, 'utf-8')
        .split('\n')
        .map((line) => line.trim())
        .filter(Boolean);

      // for each character in the txt, run glyphhanger
      unicodeCharacters.forEach(unicodeCharacter, index => {
        subsetList.push(unicodeCharacter);
        const subsetString = subsetList.join(',');

        // glyph!hang!
        const command = `glyphhanger --formats=woff2 --subset="${fontPath}" --whitelist=${subsetString}`;
        execSync(command);

Running the script is pretty intensive, by which I mean slow. I had to make sure I can run it in parallel and at some point I had a bunch of instances. I'm pretty sure it still took about a day.

Step 3: Wrapping up and verifying continuity

One feature request I have for glyphhanger is to let you choose the output file name (or maybe I've missed it). As far as I've tested it always creates the output font file in the same directory as the source and with a "-subset" appendix. That's why my subsetting script copies the results to the results directory. At this stage the last thing to do is rename the WOFF2s so that a subset with one character lives in 1.woff2, the one with 23 characters is in 23.woff2 and so on. And while at it, double-check that all subsetting was successful, there are no gaps in the sequence, e.g. there's no missing 24.woff2. (I'm happy to report I found no gaps)

Here's the full Node script, nothing interesting in it that is worth highlighting.

At the end, we end up with a directory structure like so:

Step 4: to CSV

Time to get some data into a CSV file. Last and final script is available here. It just writes out a big ol' CSV file with one row per font and one column per subset.

The resulting stats CSV (1.3MB) is here, and here's a preview:

Name,1,2,3,4,5,6,7,8,9,10...
AreYouSerious-Regular,532,676,764,964,1424,1680,1964,1968,2060,...
Arizonia-Regular,2352,2360,2528,2624,3040,3356,3648,4100,4124,...
Armata-Regular,1632,1628,1652,1632,1632,1632,1644,1632,1652,1640,...
Arsenal-Regular,2184,2196,2196,2296,2348,2468,2668,2836,3040,3072,...
Artifika-Regular,2292,...
...

Analysis

That's a lot of data points, over 200 thousand. What to do, what to do... Some sort of median trend analysis is what I decided on.

(BTW, this is where I need your help, dear reader. Does looking at this data inspire some more analysis?)

I took the median of each column from 1 character to 200 characters and plotted it. Then tried a few trendline options. The polynomial trendline seemed to make the most sense. Here's the result:

This is my .numbers (Apple's Excel) file if that data inspires you.

What do we see? Adding more characters to a font makes the file size increase fairly linearly. And at 100-130 character mark, the linear increase maaaybe slows down a bit. Maybe. If you squint hard enough.

Wild speculation follows... past the 95 characters of US ASCII (see previous post for the 95 number) some of the additional Latin characters may be easier to draw based on existing strokes already in the file. E.g. drawing À, Á, Â, Ã, Ä and Å can probably reuse some of the A work. This is just a guess, as I'm yet to design a font.

Additional observations

(min) In the 1009 fonts, the one to take the least bytes to draw a space used 408 bytes
(max) The one that used the most bytes was 3.37K for a space character. Wild difference.
At the 100 characters point we have min 2.4K, max 327K, median 13K.
At the 200 characters point we have min 2.6K, max 333K, median 18K. That median agrees with the previous analysis that if your Latin font "costs" a lot more that 20K, you may look at it again

How much does one character cost?

This was my original question and I tried a few median/average ways to get to that answer (e.g. the average of the median of 1 to 200 characters, the median to add the 201th character and so on). No matter how you slice, it appears the cost of one character is usually about 0.1K.

As a takeaway for me... I think it's worth subsetting chunks of alphabets, e.g. Latin vs Cyrillic. If your UI is in English, go ahead and subset and remove all Cyrillic characters. If there's user-generated content, still do the same, but use the unicode-range in CSS to conditionally load the Cyrillic subset if the content requires it.

But if you're pondering should I keep À but remove Å... just keep it. 0.1K win is not worth the embarrassment of having an occasional Å look too alien (in a fallback font).

In fact sometimes it may be beneficial to have a font face collection (e.g. regular + bold) in the same file download. Sprite-style. This is not supported by WOFF nor WOFF2 but I think it should be. (Maybe a different file type?) Because if you have a median, well-behaved Latin Regular font at 20K and its italics 20K, it may be better to download 40K at once, rather than risk a Mitt Romney Font Problem.

Tell your friends about this post on Facebook and Twitter

Sorry, comments disabled and hidden due to excessive spam.

Meanwhile, hit me up on twitter @stoyanstefanov