moscardino.net

Pre-building a Lunr Index in Hexo

Or, how to make client side searching faster.

About a year back, I added search to this blog using Lunr.js. It’s was a small but nice addition, but had a bit of a drawback. Allow me to explain.

The Old Setup

In order for Lunr to work, you need some data to search. I used a plugin for Hexo to generate a JSON file containing all of my posts. This file was kind of large because it contained everything that Lunr needed to index and everything that I needed to display a search result for each post. Because I wanted to index everything about a post, that meant the entire text of each post was in that JSON file. It was over 300 KB (uncompressed).

Once I had that file, I could load up Lunr in the browser, use fetch to get the JSON file, and then tell Lunr to index it. It would then build up an index in memory for you to search. Building that index was not always slow, but it was a bit intensive (since all the text of all the posts was being indexed).

If you’re following my thinking, I see two problems:

  1. The JSON file is huge and has to be loaded with every search.
  2. Lunr had to do a lot of work to generate the index with every search.

At the end of my previous post, I said that pre-building the index would be a lot quicker and would be better suited for handling more posts (as the amount of posts is always growing), but that doing so required hooking into the Hexo build pipeline. At the time I didn’t know how to do that. Well I dug in and figured it out.

The New Setup

The new setup fixes both issues in one go. To understand it, we need to understand Hexo generators.

Hexo Generators

Here’s an example of a generator:

sample-generator.js
hexo.extend.generator.register("sample-generator", function (locals) {
return {
path: "generated.txt",
data: "Hello from my generator!",
};
});

Put that in a folder called scripts and Hexo will automatically pick it up and run it. The return object defines what the file created will be called and what the data will be. You also get access to locals which lets you access the data of the site, including all the config values and post data.

Technically, any JS file in scripts will be run as part of the build pipeline. Using a generator here gives us two benefits over a plain script. First, we get easy access to locals. Second, writing the file to disk gets abstracted away for us. There are events that you can hook into using a similar syntax which will let you access site data, but then you need to write the file manually.

There’s much more that you can do with generators, and the documentation goes into some detail on some different things you can do. But for my situation, a generator gives me a hook into the Hexo pipeline and lets me create an arbitrary file without much fuss.

Building the Lunr Data

Now that I have a hook into the build pipeline, I can pre-build my Lunr index and generate a smaller version of the posts JSON. The search page would then only need to load one file which could contain both.

Part of me wishes there was an option for Lunr to hold onto the documents that it indexes so they can be returned from a search. This is what I’m used to dealing with in Solr and Azure Search, but alas Lunr only returns a reference to the document and leaves it to you to get that data for display.

Here’s the generator I ended up with after a lot of tweaking:

build-lunr-data.js
const lunr = require("lunr");

hexo.extend.generator.register("hexo-lunr-builder", function (locals) {
// Get all the post data needed for the index and export
let posts = locals.posts.map((post, i) => ({
id: i,
title: post.title,
description: post.description,
image: post.image,
date: post.date.format("YYYY-MM-DD"),
text: post.text,
keywords: post.tags.map((tag) => tag.name),
path: post.path,
}));

// Build the index
let index = lunr(function () {
let builder = this;

// Define the index fields
builder.ref("id");
builder.field("title");
builder.field("text");
builder.field("description");
builder.field("keywords");

// Add the posts to the index
posts.forEach(function (post) {
builder.add({
id: post.id,
title: post.title,
text: post.text,
description: post.description,
keywords: post.keywords,
});
});
});

// Generate the export data using the index and a smaller version of the posts array
var exportData = {
index: index,
posts: posts.map((post) => ({
id: post.id,
title: post.title,
description: post.description,
image: post.image,
date: post.date,
path: post.path,
})),
};

// Create the data file
return {
path: "lunr-data.json",
data: JSON.stringify(exportData),
};
});

As you can see, we grab all of the post data we need from locals.posts first. We then build the Lunr index using the npm package. Finally, we put the index together with a smaller version of the posts array and export lunr-data.json. The new file is 52KB (uncompressed) and doesn’t require nearly as much client-side processing. 🎉

Updating the Search Code

The last part of this update required me to update the JS file I load up in the browser (along with Lunr). Since the index was already built, the old indexing code could be greatly simplified. Here’s the old code:

search-old.js
// Load the posts from the json file
let response = await fetch("/posts.json");
let posts = await response.json();

// Create the lunr index
let index = lunr(function () {
this.ref("id");
this.field("title");
this.field("text");
this.field("description");
this.field("keywords");

posts.forEach(function (post, i) {
post.id = i;
post.keywords = post.tags.map((tag) => tag.name);

this.add(post);
}, this);
});

And the new code:

search.js
// Load the pre-processed data from the json file
let lunrResponse = await fetch("/lunr-data.json");
let lunrData = await lunrResponse.json();

// Create the lunr index
let index = lunr.Index.load(lunrData.index);

Later in the file where I used to reference posts to render the results, I now reference lunrData.posts and get the same data I need.


Overall the search should be a bit quicker, especially on lower-powered devices, and it also loads far less data on every search. Hooking into the Hexo pipeline is possible but the documentation around it is spotty and hard to understand. Hopefully my experience here can help others. Even if it doesn’t, I made my search better and that’s good enough for me.