Word Squares
From BenningtonWiki
Contents |
[edit] Code
The Ruby and C programs are available at http://svn.bennington.edu/svn/word-squares/. Start with word-squares.rb and word-squares.c.
[edit] Workshop
How I spent my weekend. Irony.
- THEME: Develop algorithm, then code.
- THEME: Work to make the algorithm optimal.
1. Intro, story
- Card: title, 5x5 word square (beach)
2. The problem.
- Card: empty 4x4
3. What is computing?
- Card: Computing, Programming, etc. definitions. Linux kernel, Science visualizations, Kat
3.5. Analysis. Tree.
- Card: 4x4s with AAAs, etc. Fold over with 26^16
- Card: 4x4 arranged as odometer. Ill. of word odometer
- Illustrator-letters.
4. Word approach.
- Card: AAAA, AAAB, AAAC, ... bar. 0.5%.
- Illustrator-letters with branches knocked out.
- Card: 4x4s with words. Fold over with 2691^4
- Illustrator-words.
5. Further culling vertical stems.
- Card: 4x4 with surf. start of surf, zone.
- Illustrator-words with branches knocked out.
- Also rule that words can't be repeated.
6. Implementation. No mention so far of the speed, etc.
- Walk through Ruby code.
- Run Ruby version on 4x4.
7. Recode in C.
10. Summary.
- Card: 5x5, 6x6, wiki URL
The solutions. Naive approach #1: 4x4 square with 1 of 26 letters in each square. Try each possibility. Test for wordiness of all four rows and all four columns. Reject all squares that do not contain eight words.
26^16 = 43608742899428874059776 = 4x10^22 possibilities.
Recognizing that instead of 26^16 it can be 2691^4 possibilities. = 52439047073361 = 5x10^13.
Taking advantage of the fact that although for a sequence of 4 letters there are 456976 possibilities, in reality only 2691 of them are words (0.5%).
New analysis of problem. New tree. Culling most of the branches that aren't words. (No aa.., first is abed, so cross out aba., abb., abc., abd., etc.)
Even further. Cull branches that don't lead to words vertically. Much harder to know the savings prior to running the program.
Ruby checks 1000/sec. Over 3000 years to check all 4x4 squares made up of 4 rows of actual 4-letter words.
Walk-through of algorithm in Ruby. Show words.txt. Show program to compile words.txt.
Recode in C. 1.5 million/sec.
Results:
- 304,740 3x3 squares
- 4,259,437 4x4 squares
- 56,986 5x5 squares at ~75%
- 2 6x6 squares
Postscript: If the Library of Babel restricted itself to English words... average length of English word... Total length of all words (if you filled a book with every word once, how long would the book be)... write code to sum length of all words.
Need: large pad of paper. Easel. Projector for laptop.
New Oxford American Dictionary, 2nd Edition
[edit] Solutions
s u r f 332 o p a l 764 o o z e 119 k n e w 682
s u r f 332 o p a l 764 o o z e 119 k n e e 760
m a r s h 3200 i r a t e 2403 d e v i l 991 s p e l l 2014 t a n t o 327
b a s i l 95 e l u d e 2128 a p p l e 730 c h e e r 2176 h a r r y 2861
t a s t e 756 a s h e n 2645 p a i n t 2424 i n f e r 398 r a t t y 3182
s a l t u s 4249 a n o i n t 2888 c l a m m y 1440 r a d i a l 2664 a g e n d a 6113 l e d g e r 3760
[edit] Emails
From: joe07734@gmail.com
Subject: words_of_length
Date: September 27, 2007 6:42:03 PM EDT
To: jzimba@gmail.com
here's the ruby code to walk through the words file and build a list of x letter words (excluding proper nouns):
def words_of_length(x) a = [] File.read("/usr/share/dict/words").each_line do |word| word.strip! a << word if word.length == x && word[0,1] >= "a" end return a end
From: jzimba@gmail.com
Subject: Re: words_of_length
Date: September 27, 2007 7:56:09 PM EDT
To: joe07734@gmail.com
Ah...a Rosetta Stone. I would so love to learn Ruby. Let me stare.
My thoughts in order:
You're defining a function with one argument. Interesting, you don't need to say what sort of thing x is.
Make an empty list/array/thing.
Hmm...those dots sure do a lot for you, don't they. So, "each_line" modifies "read" in some pre-defined way?
Does the "end" end the "do"? Interesting that you don't need an index for this.
Strip! Strip?
Does it really put the "then" before the "if" ??
Ruby looks awesome.
J
From: joe07734@gmail.com
Subject: Re: words_of_length
Date: September 27, 2007 7:59:28 PM EDT
To: jzimba@gmail.com
i'm not happy at all with the words in /usr/share/dict. can you send me your word list?
From: jzimba@gmail.com
Subject: Re: words_of_length
Date: September 27, 2007 8:18:23 PM EDT
To: joe07734@gmail.com
Here you go.
It seems tough to lay hold on a really good dictionary. My sister has the OED on a disk, but you can't just view it as clear text.
I found this dictionary online, it's well-known among word freaks; for google purposes, I think the filename on the download was fullable.txt.
In the version attached, I have gotten rid of some bogus words - mainly short ones - but it's far from clean.
When I create word games, I usually exclude proper nouns, acronyms, abbreviations, hyphenated words, foreign words, slang, and one-letter words; and I frown on jargon. This dictionary seems to work OK within those parameters. It does contain its share of jargon, but when analyzing any given game, you can weed those out of the results.
The words in the attached file are sorted by word length, alphabetical for each successive word length. (Many of my algorithms can be applied to one slice of the dictionary at a time.)
In case it helps you parse & reorganize, here are the counts of words of each length; e.g. there are zero words of length 1, sixty words of length 2, etc.
{{1, 0}, {2, 60}, {3, 654}, {4, 2679}, {5, 8476}, {6, 15290}, {7, 23208}, {8, 28558}, {9, 25011}, {10, 20404}, {11, 15581}, {12, 11382}, {13, 7835}, {14, 5134}, {15, 3198}, {16, 1938}, {17, 1125}, {18, 594}, {19, 328}, {20, 159}, {21, 62}, {22, 29}, {23, 13}, {24, 9}, {25, 2}, {26, 0}, {27, 2}, {28, 1}}
This list in itself is interesting. For example, word counts peak at eight letters, which is one reason Scrabble "works." You find yourself completely nonplussed when you have to search mentally for an eight-letter word.
Another thing you learn from playing with words all the time is which slices of the dictionary are good for different things. Five-letter words are "weird" - just look at them - whereas eight-letter words tend to be "cheap", lots of words of the form [shorter word]ED or RE[shorter word] or [shorter word]LY. The cheapness gets better, then gets worse again: words of length 12-16 are all about the "ALLY" suffix or the "UNDER" prefix.
Anyway. The license on the download was pretty permissive...although if Universal Press Syndicate wants to license any of my games, I am probably outside the acceptable use. Sue me.
Jason
From: joe07734@gmail.com
Subject: Re: words_of_length
Date: September 27, 2007 8:22:01 PM EDT
To: jzimba@gmail.com
not as critical any more. i wrote a program that compiles the words in the Dictionary application into a text file. i can send it to you if you'd like. 49,885 words. [heck, i'll attach it now.]
fyi the raw xml contents are in /Library/Dictionaries/New Oxford American Dictionary.dict/Contents/. that dir is also fun because it contains the images.
[just got your email, but i'm sending this anyway, and i'll read your email now]
From: joe07734@gmail.com
Subject: Re: words_of_length
Date: September 27, 2007 8:27:56 PM EDT
To: jzimba@gmail.com
thanks! this is awesome. the words.txt i just sent you was generated by this program:
#! /usr/bin/env ruby -w File.read("/Library/Dictionaries/New Oxford American Dictionary.dict/Contents/dict_body").each_line do |line| word_match = line.match(/<o:hw>(.*)<\/o:hw>/) next if !word_match word = word_match[1].gsub(/<.*?>/m, '') next if word.match(/[^a-z]/) print word, "\n" end
the program filters out all words that do not consist solely of the lowercase letters a-z. no two-letter words, nothing hyphenated, etc. i didn't do any manual culling.
i'll give you explanations of the ruby code line by line in a bit. right now i'm almost done with the word square generator.
joe
From: jzimba@gmail.com
Subject: Re: words_of_length
Date: September 27, 2007 8:37:54 PM EDT
To: joe07734@gmail.com
Anxious to understand the word square generator.
Attached are words that exist in the file you sent, but not in the file I sent.
Some are good, like TANGLEFOOT and ZOO (how is ZOO not in my list??). Some are bad, like AHOLEHOLE or BAZILLION
J
From: joe07734@gmail.com
Subject: Re: words_of_length
Date: September 27, 2007 8:38:15 PM EDT
To: jzimba@gmail.com
Just discovered that I failed to cull abbreviations. I added a check for that and it removed 163 words.
#! /usr/bin/env ruby -w File.read("/Library/Dictionaries/New Oxford American Dictionary.dict/Contents/dict_body").each_line do |line| word_match = line.match(/<o:hw>(.*)<\/o:hw>/) next if !word_match word = word_match[1].gsub(/<.*?>/m, '') next if word.match(/[^a-z]/) next if line.match(/<o:ps> abbreviation <\/o:ps>/) print word, "\n" end
From: joe07734@gmail.com
Subject: word squares
Date: September 27, 2007 8:58:47 PM EDT
To: jzimba@gmail.com
abed 0 baba 101 ebon 758 dang 568
is the first one it came up with. It's currently on:
abed 0 bear 162 ease 755 trey 2947
Each number on the right is the index of the word next to it within the list of four letter words. It gives me an idea of the scope of this. I'll send you the code with an explanation next. I just want to watch the squares scroll by for a sec.
Joe
From: jzimba@gmail.com
Subject: Re: word squares
Date: September 27, 2007 9:14:05 PM EDT
To: joe07734@gmail.com
Looks great!
To be a well-posed game from the player's perspective, I require that the horizontal words be distinct from the vertical words. (The first example has all of its words twice - evidence that your code works! - and the second example has "bear" and "ease" twice.)
J
From: joe07734@gmail.com
Subject: Re: word squares
Date: September 27, 2007 9:20:37 PM EDT
To: jzimba@gmail.com
I agree. Plus, I want "normal" words because they make the arrangement more impressive. "abed"? "baba"? bah!
From: joe07734@gmail.com
Subject: Re: word squares
Date: September 28, 2007 12:30:12 AM EDT
To: jzimba@gmail.com
It was harder than I thought ensuring that the words are distinct.
The code's up on the svn server. You can get it and the latest dictionary at:
http://svn.bennington.edu/svn/word-squares/
(The dictionary had another problem -- duplicates -- which I fixed.)
Run the program by saving it in your home directory and then within Terminal typing:
$ ruby word-squares 4
Where "4" is the dimensions.
puro 0 idol 474 lore 1762 into 2033
I'm learning a lot of 4-letter words! Also, I shuffle the words so it's not always the same predictable alphabetical sequence. This run happened to put "puro" first.
The basic outline of the algorithm is to assign words to each descending row starting at the top. As each word is added a check is made of the columns' word stems to ensure that all of them are indeed stems of one or more legitimate words. If you get to the bottom row you win.
Eg,
puro 0 idol 474
pi-, ud-, ro- and ol- are checked. All of them are stems of words in the dictionary.
puro 0 idol 474 lore 1762
pil-, udo-, ror- and ole- are checked. All are okay. Etc.
I build tables of all the 1 letter stems (obvious), 2 letter stems, etc. I'll annotate the code and tell you more about it another day. I'm too beat to do it now.
This is a great candidate for distributed processing. Each machine takes a slice of the search space and goes to work. Typically you have a task server that creates a queue of tasks from which the client machines receive their next task to work on. I might write this, with an OS X screen saver as the client. Cool?!
I think I'll teach a straightforward Ruby programming course next term.
Good night, and thanks for the fun, Joe
From: jzimba@gmail.com
Subject: Re: word squares
Date: September 28, 2007 12:51:13 AM EDT
To: joe07734@gmail.com
OK, time to abandon Mathematica for projects like this! It's so slow. If I'm understanding the approach you outlined, it's one I tried and abandoned as taking prohibitively long to execute.
Knowing Mathematica, strings are probably represented internally as matrices of quaternions or something.
I'll check out the ruby thing - after Ryan Moran's visit to WMWM, I resolved to learn how to use it.
J
From: joe07734@gmail.com
Subject: 5x5!
Date: September 28, 2007 1:52:55 AM EDT
To: jzimba@gmail.com
rishi 0 inter 606 varna 4005 enact 3911 tepee 1276
two hindi words...
From: joe07734@gmail.com
Subject: word squares in C
Date: September 29, 2007 12:26:53 PM EDT
To: jzimba@gmail.com
Cc: angelatraficante@gmail.com, tas.reachhigher@gmail.com
Hi Jason,
I finished a C implementation of the word squares program.
http://svn.bennington.edu/svn/word-squares/word-squares.c
I say that the two programming languages to know today are C and Ruby. Ruby because it works the way algorithms work, and C because it works the way computers work. If I'm in an algorithmic mood I'll code in Ruby. That's mostly what I do these days. But if I need maximum performance or I'm writing an operating system I code in C. The C version of word squares is 2000x faster than Ruby.
To try the C version, copy word-squares.c and make-word-squares to your home directory. You need to compile the C program so do this in Terminal:
$ make-word-squares
You need to have Xcode installed (C compiler, etc.). I can also send you a pre-compiled version if you need it. The reason I don't just upload a pre-compiled version is that there'd need to be a PPC version and an Intel version, plus versions with different compile options (different options for debugging versus high performance, etc.). Plus I'm lazy.
Run it like the Ruby version:
$ word-squares 5
c r a z e 1 r e l a x 639 a m a z e 2858 p a t e r 1798 s p e n t 1106
...
On my laptop it checks approx. 1,500,000 squares / second. It would take 25,000 years to run through all the 5x5 possibilities. The Ruby version checks about 800 / sec. I don't even want to think about how long that would take.
Notice how much longer and denser the C code is compared to Ruby. In C I'm in total control of the CPU. This is a Good Thing for me, but a challenging thing for many programmers.
I would like to remove all the cheat words from the dictionary.
Joe
P.S. Ang -- Jason and I have been discussing this back and forth via email, and I thought you might like to see. You can get all the files via svn with:
$ svn co http://svn.bennington.edu/svn/word-squares
From: joe07734@gmail.com
Subject: Re: word squares in C
Date: September 29, 2007 12:43:30 PM EDT
To: jzimba@gmail.com
Cc: angelatraficante@gmail.com, tas.reachhigher@gmail.com
P.S. It generated all 3x3 word squares in about 20 seconds. There are 304,740.
From: joe07734@gmail.com
Subject: Re: gdb
Date: September 29, 2007 6:14:04 PM EDT
To: tas.reachhigher@gmail.com
you should check out this code i just wrote. it creates n x n word squares. wrote it in C last night after writing it in Ruby the night before. it was a zimba problem.
[svn.bennington.edu/svn/word-squares/ svn.bennington.edu/svn/word-squares/]
i mention it because it's the first little C program i've written in a long time, and it shows how i like to write C. that is, if i can say so, a very mature style of C. in particular, the way i handle errors, including malloc failures. i would encourage you to adopt a similar style.
you might find it informative to compare the C and Ruby versions. they're identical, algorithmically speaking.
gdb is *sort of* like irb, only you can only play inside of an already-compiled C program. look at variables, change them, run the code a bit, stop it at interesting places, etc. i believe you can't know a program (including your own) unless you've stepped through its execution a line at a time, at least three times. i did this with every line in illustrator at the machine language level. that was fun. gdb is a great tool to learn C, but it's arcane. i can give you some quick pointers (ha!) that'll make all the difference, if you think you'll use it.
joe
From: joe07734@gmail.com
Subject: better code, better estimates
Date: October 1, 2007 8:56:56 AM EDT
To: jzimba@gmail.com
Cc: angelatraficante@gmail.com, svanryzin@bennington.edu
Hi Jason,
The stats I tried to add to word-squares quickly overflowed my paltry 64-bit integers (2^64 doesn't buy what it used to), so I found a C library that does arbitrary precision math and added that to the mix. Now I get stats like:
Dictionary contains 6,157 6-letter words. There are 54,477,218,036,268,343,883,449 possible word squares. ... 431,642,638,211,104,315,152,000 0.04937% 0 1,108,176,488,706,913,754,940 /sec 3 days remaining
The previous version I wrote about didn't count the pruned branches correctly. Now that it does, I've discovered the estimates are much shorter than I'd anticipated. The 6x6 space can be searched in about 3 days. I'm running these with my OS X-based dictionary (New Oxford American Dictionary, 2nd Edition). I'd be interested in running the same code with your dictionary.
These numbers are unreadable, but scientific notation is nearly as foreign to me.
My problem in life is that the algorithm interests me greatly, but the subject of the algorithm often doesn't. I'm mildly interested in the results of the word-squares program (hey, a 6x6 square would be a good pick up line), but mostly I just dig tinkering with the code. But I think it's done. Rather than write about it, we should have lunch.
Or maybe this should be a science workshop?
Joe
P.S. One of the better 4x4's:
s u r f o p a l b o n e a n t e
From: joe07734@gmail.com
Subject: Science Workshop - Word Squares and Algorithm Design
Date: October 1, 2007 8:56:39 AM EDT
To: amiejomcc@yahoo.com
Cc: jzimba@bennington.edu, sherman@bennington.edu, jbullock@bennington.edu, jkravitz@bennington.edu, jfoley@bennington.edu, amcintyre@bennington.edu, kwoods@bennington.edu
Hi Amie Jo,
Can I have the science workshop on Oct 10? According to the wiki that day's free.
I've been playing with an algorithm that creates n x n word squares, following a discussion with Jason. E.g.:
s u r f o p a l b o n e a n t e
In fact I haven't slept much since starting on it last Thursday. First I wrote it in Ruby, developing and optimizing the algorithm. The Ruby version was able to search through about 500 / sec. Then I spent two days re-writing it in C as an exercise, and because I know C like the back of my hand, and I knew it would be infinitely faster. The C version is able to search through 1.5 million / sec.
There are a lot of interesting things here about words and combinatorics and algorithm design. Especially how insights into the data set can lead to much-better-performing algorithms. For example, you could approach this problem by generating every possible letter in every possible square, and then testing if you have words across and down. That's 26^16 -- 43,608,742,899,428,874,059,776 squares to check. The Library of Babel approach. But realizing that you can limit the 'random' letters to only patterns that are words (eg, placing 'surf' in the top row) reduces the problem to 2,691^4 -- 52,439,047,073,361 (2,691 is the number of four-letter words in my dictionary). That's 0.000000001% of the babylon search space. I further reduce the searches by abandoning slices of the search space that are obviously unfruitful. For example, all squares with the top two lines:
s u r f z o n e . . . . . . . .
can be skipped because there are no words beginning with 'sz', which is the stem of the vertical word in the first column. The savings is difficult (impossible?) to calculate without running the algorithm (that's another interesting point, by the way).
Also, interesting things about the speed of Ruby vs. C. C works the way computers work, for better or ill when it comes to translating an algorithm into C. Ruby is a much newer language and works the way algorithms work. So it's very easy to code ideas up. But the downside is that it then has to do more work under the hood to fake it.
Whew. Okay, see what floats my boat?
Joe
