How to Search: Alpha Search Demo
I’ve been experimenting with full text in JavaScript (more precisely CoffeeScript which I encourage anyone who needs to write a good bit of Javascript to use). I’ve made a small (and I mean very small) demo using Node.js and my searcher at http://search.g23.co/ It searches the my school’s newspaper website, the DeBakey Pulse. The search demonstrates something very small. At the time of this post, the Pulse has 114 articles that are indexed by my searcher and only 338380 characters of text (330kb is VERY SMALL for a search engine). Though this isn’t a naive search and it demonstrates indexing so I have reason to believe it could handle much more data.
A trivial approach to searching through text is to break the search query into words and then for each article, go through every word and look for a word that is in the query. If you find an article with that word, but it in a “found” pile and move on to the next article. Once that is done, the “found” pile needs to be scored based on textual relevance and sorted by score. I find the score in a simple way, computing the ratio of instances of the query word over the total words, i.e. if you were searching for “Google” and a 500 word article mentioned “Google” two times, then score would be 2/500 = 0.004.
This isn’t a problem for a computer if there are only a few hundred articles to search through, but if you have several thousand or million articles / pages, you need a better way to search. Enter indexing.
The way an indexed search works is the “engine” must go through a process of indexing before it can be searched. The engine has a fast key/value store (in memory for my implementation) which basically allows fast (sub millisecond) lookup for data given a certain key. This is usually called a dictionary or hashmap. The gist of how it works is the key has a function applied to it, and then the result is the lookup address of the data. For example, if I request “LOL” the hashmap would MAP “LOL”, the key, to the value in the computer’s memory, “CATZ” for example. This provides a fast way to store data in memory and an indexer uses this to quickly find articles with certain words.
The indexer runs through each article and tokenizes (splits it into words) it. The the indexer puts the id (or url) of the article as the value in the hashmap for the word. For example, document(id=23) has the word “internet” in it. The indexer then puts into the hashmap the key of “internet” and the value of 23 (because the id of the document is 23). Then if the indexer finds another document with the word “internet” in it, it would update the value at the key “internet” (which until now has a value of 23), to something like 23,92.

The indexer adds the id and also the computed score (so that it doesn’t have to compute the score again).
edit: I will add the source code for the JavaScript search to my (new) GitHub (as my first repository!) tomorrow when I have time to clean up the code a bit. Right now all my classes are jumbled into one file but I’d like to put them in separate files and be more organized. (I believe this can be done with Cake files for CoffeeScript)