Mr. 23

Quack!
Like this?

2 notes

Full Text Search Experiments

I’ve recently been interested in understanding how full text search works (i.e. search engines) so I’ve decided to write my own and will add it to the DeBakey Pulse once it’s working well. I’ve implemented an indexed searcher and a naive search both in javascript (well technically in CoffeeScript but that’s basically JavaScript and it’s faster and easier to write). I wanted to see how much of an improvement indexed searching could do over a simple approach and I really wanted to see how fast a JavaScript search engine could be.

The dataset that I decided to search was the DeBakey Pulse’s articles. It’s the school newspaper’s web site and has 111 articles and 330060 characters of text currently. It’s hosted on App Engine and written in Python. I added a URL handler that would load all of the articles and dump it as JSON data. The data is pulled in using Node.js’ HttpClient and timings are measured once the data is parsed into a Javascript object.

Results: The indexer takes between 97 and 102 milliseconds to index all of the articles. It is then able to search for a two word query and sort for relevance in 1ms. The naive approach, which doesn’t factor in relevance, takes 7ms to perform a two word query.

This is very much just an experiment and I need much more data to better understand the performance. I will write a more in depth article about each of these tomorrow and also (since I finally made a GitHub account) I’ll post the source code (a literate search engine? maybe).

Also the indexed search has a heap size of 10.8MB whereas the naive search has a heap size of 4.2MB. I’d like to see how much data the indexer can index before it’s heap becomes too big.

Filed under CompSci

  1. mr23 posted this