Posts tagged compsci
Posts tagged compsci
Yesterday I wrote a simple Sudoku solver. I’m thinking of using it to generate sudokus. I’ve run the program against a “very hard” puzzle at http://www.conceptispuzzles.com/index.aspx?uri=picture/1045 and it solves this in ~0.2 seconds on my Macbook. I think this is probably slow. Anyone else made a backtracking Sudoku solver that’s faster?
The brief idea behind a backtracking sudoku solver isn’t to use logic but rather “take a step” and set a square as a number and keep doing this until the puzzle is no longer valid (i.e. a certain square has no possible values). Then obviously the previous step was wrong and so it should be undone and another number should be chosen instead.
It’s essentially like how one would solve a maze. Walk in, make a choice about which way to go and keep doing that. If you get to a dead end, go back to the last choice you made and make the other choice, if that also leads to another dead end, then go back two choices, etc. until you finally get out of the maze.
As for how sudoku generation will work, it will take a valid grid, mix it up quite a bit, then start removing numbers. It’ll keep removing numbers until the puzzle is no longer solvable, then it’ll stop so that the minimum number of clues are there to solve the puzzle.
So I’ve finished my anagram finder thing and it’s fast. http://anagrams.heroku.com/
It hardly has any design to the page as I’ve not had time to do that. It’s backed with MongoDB and searches documents of the form { word: “LOLCATZ”, lookup: “acllotz” } with an index on “lookup” so it’s fast.
Some issues that I had was the permute function that I wrote had a slight flaw that turned out making it hardly generate any permutations. This took me a while to find because when the search only provided a few results I figure it was just because I wasn’t using a large wordlist. Now the permute function is fixed and many more words are returned.
Any feedback or suggestions would be appreciated.
Inserted 8,087 small documents in 6.1125 seconds from a small Sinatra ruby app.
I decided to get familiar with MongoDB since I plan to use it in a product that I’m working on (was going to go with Redis but realized that’s a hack and while it works I haven’t been able to figure out an easy geospacial thing with Redis). A while ago I made a post about how NoSQL isn’t always the way to go. I was trying to find anagrams of words by computing a score for each word based on the product of prime numbers found by mapping the letters to primes. Then anagrams are just words where the gcd is greater than one. An example might help, ‘abc’ -> 2*3*5 = 30, ‘cab’ -> 5*2*3 = 30, gcd(30,30) > 1; also ‘ba’ -> 3*2 = 6 and gcd(30, 6) > 1. This works well for an in-memory system (it’s what I use in my Scala OMGBot that plays Letterblox) but not so much for a database like MongoDB.
What I’m doing now is making an anagram lookup service with several improvements. Firstly, there won’t be so many random words as now I’m using the text of Sherlock Holmes as my wordlist rather than some arbitrary wordlist that I found on the internet. Also this will use MongoDB hosted at MongoHQ and will hopefully be more efficient and faster.
The plan to find anagrams with Mongo is to store documents in the form { word: “LOLCATZ”, lookup: “acllotz”} with an index on lookup, the alphabetically sorted version of the word. Then, to find an anagram, all permutations of a word will be made into an array and then mongo will be queried via { lookup: {$in: [‘a’, ‘c’, … ‘ac’, … ‘lolcatz’, …]}} This should be much faster than the previous method that made big use of the modulo operator, something that mongo can’t index.
I should be done with this lookup system sometime tonight and I’ll make a post explaining a few things about it.
Recently the senior class voted on prom themes and there was also an additional field requesting people to select their friends. This was going to be used as an attempt to forecast possible winners of prom king and queen. My theory was that by using PageRank, an algorithm that Google uses to rank importance of a website, to determine the “most popular” people and thereby predict who will win prom king and queen.
TL;DR
TL
Of the 209 seniors, 104 people voted on prom themes. There were three themes and each person was allowed to put a number for each theme between 1 and 10. The reason for this, rather than a first choice, second choice, third choice system was because some people feel adamantly about one about and really don’t want the other options, while some could careless. Those who careless generally give the themes about the same score, whereas those that really don’t want one theme can give it a low score relative to others. I had many 1,10,1 kinds of votes, and a number of 6,7,8 kinds of votes. Clearly some people felt stronger about this than others (I should’ve graphed participation between girls and boys, I’ll do that tomorrow). To determine the prom theme, the scores were averaged and the highest average was chosen.
Now while less than half the class voted (almost half), the results are representative of the class. There was a clear theme which had a higher average than the other two, thus a complete class participation wouldn’t likely change the outcome (also those who didn’t vote didn’t care enough about the theme anyway).
At the bottom of the voting form was something for the voter to select all of their friends. The form required at least 3 friends to be selected. AT LEAST, most people only selected 3 friends and then submitted their vote. On a side note this says something about how attentive people read when they are directed from Facebook, or it may be that people didn’t have much time to list their friends (though there was a quick search as you type box).
I ran PageRank on this (initially the R implementation and then I coded my own) friend graph. While I got results of who the “most popular” people were, the results aren’t very conclusive. While the prom voting is, because those are simple stats, doing PageRank on a graph requires most of the graph, ideally the whole graph. With only half of the graph, the results are inconclusive.
Further, PageRank is useful for determining “good quality” web pages because it uses link data, what pages other people link to, to determine good pages. I don’t know if this analogy can apply to people, or at least I’m not sure of how to interpret this. When the “links” are friendship lines, what is being determined by finding the highest ranking people with PageRank? Most friendly, most popular, most influential? I’m not sure and I’d like to know your thoughts on this.
Recently I’ve been writing F# code to process and determine if a tweet is interesting based on previous tweets that I give it. It uses what’s called a Bayesian classifier to determine if a tweet goes into the interesting or boring category. I’ve learned a lot by writing this and it still doesn’t work exactly as I’d like. The problem is finding the right feature set to train the model against. I’m going to use the top 50 words and their distributions from the interesting tweets as features now and see if that improves the model’s ability to predict tweet interestingness.
Recursion is when something is defined by referencing it’s self. Google has an amazing explanation for recursion, just Google search “recursion” (no need to click on any links even! it’s right there in Google).
Tail call recursion is when something references it’s self at the end. It’s hard to explain so I’ll just explain by example. Lack of proper tail call recursion leads to major slow downs as well. I had heard that but never put it to the test so I wrote a couple of programs to show how detrimental improper tail call recursion can be.
The simplest example is the factorial function, it has a nice recursive definition. (Note to self, add MathML magic to this Tumblr, until then take images from Wikipedia).

I wanted to see how bad implementing this naively would be so I wrote two programs in F# to find the factorial of 1000. (F# has a BigInteger class built in so that’s awesome).
#light let rec badFact n = if n <= 1I then 1I else n * (badFact n-1I);; (* bad because no tail call optimize *) let bigI = 1000I;;printfn "%A! has %A digits" bigI (string (badFact bigI)).Length;;
And I wrote another version that can be tail call optimized.
#light let rec goodFact n prod = if n <= 1I then prod else goodFact (n-1I) (prod*n);; let bigI = 1000I;; printfn "%A! has %A digits" bigI (string (goodFact bigI 1I)).Length;;
The difference is subtle but the performance difference is significant. As it turns out, the naive implementation stackoverflows. The good one finishes in less than a second.
I thought this may be because it’s running on .NET which isn’t exactly designed for lots of functional things (it’s still good though, as the proper implementation ran very fast). I decided to write these in Haskell to see Haskell with it’s trampoline-like calling system be able to handle the bad version.
The Haskell version successfully ran the bad version but so quite slowly, as expected. On my Macbook it ran in 0.413 seconds. That may seem fast but the better one only required 0.007 seconds. This may be a reason why people sometimes think Haskell’s performance is unpredictable, improper recursive calls that, if not for Haskell’s beastly function calling mechanism, would cause stack overflows.
Why?
I should probably explain why the “bad version” is bad and why the good version runs circles around it. This has to do with how functions are called. Down at the assembler level there is a “stack” used for function calling. When in one function and calling another, the function “pushes” it’s location onto the stack so that the function that it calls knows where to (jmp) return to. When the program does “n * (badFact (n-1))” it has to call badFact and then return and finally multiply with “n”. In the goodFact it returns the result of goodFact directly, no multiplying by “n” at the end. This allows the compiler to optimize and turn this into a loop, i.e. no need for the calling stack. This is why the good version doesn’t stackoverflow and the bad one does. (Also Haskell’s calling system might not overflow as fast but it certainly adds overhead to the calculation, making the bad version very slow).
OMGBot wins Letterblox
Source code at https://github.com/g23/OMGBot
I think this is largely the reason Haskell is hardly used and other languages like OCaml and Lisp are more widespread.
Functional programming languages aren’t very widespread, but of them, Haskell is hardly used outside of academia at all. I was thinking today why it’s not and why Lisp and ML based languages are (to some degree). Last night I was reading about a couple of Financial firms that use OCaml (an ML variant) and I realized why they might prefer that over Haskell, Haskell has lazy evaluation whereas OCaml has eager evaluation.
I know Haskell to some degree and just starting to learn F# (Microsoft’s version of OCaml, largely the same syntax plus benefits of .NET). Haskell is really eloquent but some things are just hard to do with it. Haskell is good at doing math stuff but I wouldn’t consider making a bot or webservice with it. I think the reason it becomes hard is because of it’s laziness.
Laziness is Haskell’s two sided sword, it’s a very powerful language feature (i.e. you can represent infinite sequences and lists without bounds) but it also makes certain things convoluted and hard to do. ML based languages seem mostly “pure” (purely functional) but they lose their “purity” to some extent to make things easier. I really need more experience to make a fair opinion, but I don’t think making infinite sequences happens too frequently (except maybe in math software).
I’ve done a small “survey” of publicly accessible tweets recently. I used Twitter’s Streaming API to download a small stream of tweets (about 1-2% of all tweets during that time) for a little over 3 hours the other day. The result was roughly 285 MB of JSON data, each tweet object is delimited by a newline.
I wrote a small Python program to process these tweets and give some stats about them. The program runs in about 15 seconds on my 2008 Macbook but I haven’t determined if the slowness is due to reading the file or processing the data, my guess is reading the file but I’ll need to time that later.
The program reads the file line by line and tries to parse the JSON encoded tweet. Then it determines if the tweet has Geo data, if the tweet is a 140 char tweet, and it also creates a “word” distribution of the tweet. I say “word” but really it just splits the text for every space, so that means words include links, hashtags, usernames, and punctuation i.e. “this.” is different than “this”.
The program was meant to just give a simple look on the data and here are the results:
Tweets: 139591 GEO: 996 Characters: 9228541 Average Length: 66.1112894098 140s: 4290 140%: 3.07326403565 Unique 'words': 338538 Total hapaxes: 278362
I’ve been experimenting with full text in JavaScript (more precisely CoffeeScript which I encourage anyone who needs to write a good bit of Javascript to use). I’ve made a small (and I mean very small) demo using Node.js and my searcher at http://search.g23.co/ It searches the my school’s newspaper website, the DeBakey Pulse. The search demonstrates something very small. At the time of this post, the Pulse has 114 articles that are indexed by my searcher and only 338380 characters of text (330kb is VERY SMALL for a search engine). Though this isn’t a naive search and it demonstrates indexing so I have reason to believe it could handle much more data.
A trivial approach to searching through text is to break the search query into words and then for each article, go through every word and look for a word that is in the query. If you find an article with that word, but it in a “found” pile and move on to the next article. Once that is done, the “found” pile needs to be scored based on textual relevance and sorted by score. I find the score in a simple way, computing the ratio of instances of the query word over the total words, i.e. if you were searching for “Google” and a 500 word article mentioned “Google” two times, then score would be 2/500 = 0.004.
This isn’t a problem for a computer if there are only a few hundred articles to search through, but if you have several thousand or million articles / pages, you need a better way to search. Enter indexing.
The way an indexed search works is the “engine” must go through a process of indexing before it can be searched. The engine has a fast key/value store (in memory for my implementation) which basically allows fast (sub millisecond) lookup for data given a certain key. This is usually called a dictionary or hashmap. The gist of how it works is the key has a function applied to it, and then the result is the lookup address of the data. For example, if I request “LOL” the hashmap would MAP “LOL”, the key, to the value in the computer’s memory, “CATZ” for example. This provides a fast way to store data in memory and an indexer uses this to quickly find articles with certain words.
The indexer runs through each article and tokenizes (splits it into words) it. The the indexer puts the id (or url) of the article as the value in the hashmap for the word. For example, document(id=23) has the word “internet” in it. The indexer then puts into the hashmap the key of “internet” and the value of 23 (because the id of the document is 23). Then if the indexer finds another document with the word “internet” in it, it would update the value at the key “internet” (which until now has a value of 23), to something like 23,92.

The indexer adds the id and also the computed score (so that it doesn’t have to compute the score again).
edit: I will add the source code for the JavaScript search to my (new) GitHub (as my first repository!) tomorrow when I have time to clean up the code a bit. Right now all my classes are jumbled into one file but I’d like to put them in separate files and be more organized. (I believe this can be done with Cake files for CoffeeScript)