portrait

Jacky Tian

I sling code and barbells with similar ease

A Smarter Scrabble AI, Part 1: Lexical Representation

For a while now I’ve been very interested in the strategy behind Scrabble. While I’m not a very good Scrabble player at all, I enjoy watching skilled players. Apart from a broader knowledge of the words in the lexicon, advanced players often make plays with multiple crosses and know exactly where to place words to deny their opponents the chance to grab multipliers. After playing with one of these very good players a few months ago and working on a Scrabble game with a static AI (choosing the highest-scoring move every turn), I started thinking to myself: how difficult would it be to create a “smart” (lookahead) Scrabble AI?

Because this is going to be quite a lengthy project, I’ve decided to write about the process in multiple blog posts. I’ll be posting snippets of code in every post, and you can keep up with my progress or even contribute at the GitHub page for the project. For this post, I’ll be starting with lexical representation - the problem of representing the Scrabble lexicon in a data structure that makes move generation easier than iterating through a raw dictionary.

Common methods of representing a dictionary include Tries and DAWGs. For my static Scrabble AI, I chose to represent the lexicon with a Trie, which was fast enough for static move generation but, with the addition of blanks to the letter set, not nearly fast enough to run more than a handful of simulations ahead without taking too much time.

A DAWG was my next idea for lexical representation. DAWGs (Directed Acyclic Word Graphs) are essentially Tries with equivalent suffixes condensed into a single path through the automaton. Here’s an example of a DAWG representation of the lexicon consisting of the words “ear”, “eat”, “cat”, “car”, and their plural forms.

The main advantage of DAWGs over Tries is that they use up less memory, for obvious reasons. However, they still suffer from non-determinism when finding prefix sets for given suffixes, which is the main bottleneck when it comes to move generation speed.

To solve the problem of non-deterministic prefix searches, Steven A. Gordon proposed a structure called a GADDAG in his 1994 paper “A Faster Scrabble Move Generation Algorithm”. The GADDAG is similar to a “two-way” DAWG, storing every possible combination of REV(x)|y paths for each word in the lexicon, where ’|’ is a delimiter. Each node in the automaton is a state which has a letter set, consisting of the letters which, if encountered next, finish a word, and a set of transitions. Word construction is achieved by simply traversing the automaton and keeping track of the path letters. For example, here is the GADDAG for the word “care”:

From the above diagram, it’s apparent that there are a lot of paths that can be compressed, particularly the 3 paths that end in ‘E’. In fact, a GADDAG can be partially minimized during construction due to the fact that many of the REV(x)|y paths for a word will share suffixes. Here is a minimized version of the above GADDAG, for the word “care”:

Gordon gives an algorithm for minimized GADDAG construction in his paper in pseudocode, which is pretty straightforward to translate to Python. First, consider the following representation of GADDAG states:

class GaddagState(object):
    def __init__(self):
        self.arcs = dict()
        self.letter_set = set()

    def add_arc(self, char):
        """
        Add an arc from the current state for char if no such arc exists. 
        Returns the node this arc leads to.
        """
        if char in self.arcs:
            next_state = self.arcs[char]
        else:
            next_state = GaddagState()
            self.arcs[char] = next_state
        return next_state

    def add_final_arc(self, c1, c2):
        """
        Add an arc from the current state for c1 if no such arc exists and adds c2 
        to the letter set at that state. Returns the node this arc leads to.
        """
        if c1 in self.arcs:
            next_state = self.arcs[c1]
        else:
            next_state = GaddagState()
            self.arcs[c1] = next_state
        next_state.letter_set.add(c2)
        return next_state

    def force_arc(self, char, forced_state):
        """
        Add an arc from the current state for char to forced_state, raising an 
        exception if an arc for char already exists going to any other state.
        """
        if char in self.arcs:
            if not self.arcs[char] == forced_state:
                raise Exception('Arc for forced character already exists.')
        self.arcs[char] = forced_state

With this representation, minimized GADDAG construction is done as follows:

class Gaddag(object):
    def __init__(self):
        self.root = GaddagState()

    def add_word(self, word):
        n = len(word)
        state = self.root   

        #create path for n...1
        for i in xrange(n-1, 1, -1):
            state = state.add_arc(word[i])
        state.add_final_arc(word[1], word[0])

        state = self.root   

        #create path for n-1...1|n
        for i in xrange(n-2, -1, -1):
            state = state.add_arc(word[i])
        state = state.add_final_arc('|', word[-1])

        #partially minimize the remaining paths
        for m in xrange(n-3, -1, -1):   
            forced_state = state
            state = self.root
            for i in xrange(m, -1, -1):
                state = state.add_arc(word[i])
            state = state.add_arc('|')
            state.force_arc(word[m+1], forced_state)

This routine is looped for each word in the lexicon, finishing the GADDAG. The completed structure is about 500MB in size, which is pretty significant but will still fit fine into RAM on most modern systems. GADDAG construction with this algorithm takes about 11 seconds on my system (2.9GHz i7), which I think is just fine for a task that only needs to be run once every time the program is started up. I pickled the structure into a pickles/ directory in the project, but Python’s cPickle module unpickles in an extremely inefficient way, taking 16 seconds and twice the RAM necessary to unpickle the GADDAG - I’m still looking for a binary serialization package that gets around the overhead of cPickle. For now, I just construct the structure from scratch every time I need it.

That concludes the first section of this project, dealing with the problem of lexical representation. I chose a GADDAG over a DAWG or a trie because it has some serious speed benefits in move generation (Gordon says double), and the foundations for the structure and the move generation algorithm are already laid out. There’s a little more code than what I posted in this post in the project for the GADDAG (mainly unit tests), so if you want to take a look at that, check out the GitHub page for the project. The next section will be about completing the back-end API for the game, minus move generation strategy.

May 15, 2013

Copyright © 2014