Greedy matching

From BenningtonWiki

Jump to: navigation, search

worksheet-thumbnail.gif

I've been working on an online schedule planner for students at Bennington College. In the worksheet I have a cool little ruby function that parses strings representing times for classes, in the format that the PDF curriculum uses. It handles times like "Monday, 2:00pm - 4:00pm". But it's extremely flexible and handles crazy expressions like "mwf 10-noon, tth 2-4" too. It's a cool bit of code. If you want to look more closely the source file is parse_time.rb on Bennington svn. And you can try it out in irb like this:

~/svn/curriculum $irb
irb(main):001:0> load 'parse_time.rb'
=> true

irb(main):002:0> parse_time("Monday, 2:00pm - 4:00pm")
=> [{"dtend"=>[16, 0], "byday"=>["MO"], "dtstart"=>[14, 0]}]

irb(main):003:0> parse_time("mwf 10-noon, tth 2-4")
=> [{"dtend"=>[12, 0], "byday"=>["MO", "WE", "FR"], "dtstart"=>[10, 0]}, {"dtend"=>[16, 0], "byday"=>["TU", "TH"], "dtstart"=>[14, 0]}]

(By the way, the keys 'dtstart', 'dtend' and 'byday' are named what they are because I'm following an internet standard for calendar event representations, called curiously enough iCalendar. Here's the RFC. There's a standard for everything, and it's worth the time to do a little research before (re)inventing something.)

(Also by the way, notice how hash keys aren't ordered. I create the keys in start, end, days order, but in the hash they come out 'dtend', 'byday', 'dtstart'. You can't rely on any kind of ordering of hash keys. You'll see more of this below.)

Anyway, my student Reid and I found a bug in parse_time yesterday, having to do with parsing the days of the week. To understand the bug, look at these expressions. They're all legit and equal:

"Monday, Wednesday"
"mon, wed"
"mon wed"
"monwed"
"mw"
"MW"

The way I approached parsing this was to have a list of legal day-of-week strings, and then lop off matches as they're found. The list of day names looks something like this (case doesn't matter):

su, sun, sunday, m, mo, mon, monday, t, tu, tues, tuesday, w, we, wed, wednesday, etc.

The algorithm would do this:

"Monday, Wednesday" -> matches "monday"
", Wednesday" -> matches "wednesday" (it doesn't have to be at the start)
"" -> done

"MW" -> matches "m"
"W" -> matches "w"
"" -> done

Notice how this approach makes it very flexible regarding delimiters (or the lack of, as in the second example). It doesn't matter whether the day names are separated by spaces, commas, commas followed by spaces, dashes, etc. The list of valid weekday strings is actually stored in a hash, like this:

$weekdays = { 'su' => "SU", 'sun' => "SU", 'sunday' => "SU",
'm' => "MO", 'mo' => "MO", 'mon' => "MO", 'monday' => "MO",
't' => "TU", 'tu' => "TU", 'tues' => "TU", 'tuesday' => "TU",
'w' => "WE", 'we' => "WE", 'wed' => "WE", 'wednesday' => "WE",
'th' => "TH", 'thur' => "TH", 'thurs' => "TH", 'thursday' => "TH",
'f' => "FR", 'fr' => "FR", 'fri' => "FR", 'friday' => "FR",
's' => "SA", 'sa' => "SA", 'sat' => "SA", 'saturday' => "SA",
}

The hash keys ('su', 'sun', 'sunday', etc.) are what I look for, and the hash values ("SU", "MO", etc.) are the days that the matches actually represent.

So it might be dawning on you that if I just look for each key in no particular order, I could get into trouble with keys that have identical character prefixes. 'T' Tuesday might get confused with 'TH' Thursday. If while parsing the time string "th 2-4" I happen to look for 'T' first, I'll chop it off and be left with 'h', which is ignored and the user is left with the wrong day.

In English day names this comes up twice, with Tuesday/Thursday and Saturday/Sunday. When Reid and I were playing with this yesterday I tried to create a time spec for every day of the week: "mtwthfssu" and when I used it to search for courses it missed Thursday and Sunday events. See why? The 't' in 'th' was interpreted as Tuesday, and the second 's' was taken as just another Saturday. The 'h' was ignored, as was the 'u'. Heck, they're just delimiters.

The solution was to sort the keys from longest to shortest and look for matches in that order. This is called "greedy" matching in regular expression terms, where you take as much as possible that matches, as opposed to as little as possible. Take 'TH' if possible, otherwise 'T'. Greedy. I actually do the matching with a single ruby regular expression that's built up from all of the keys, using alternation. (I assumed that alternation was automatically greedy, but instead it just matches in left-to-right order, stopping with the first match.) That's another topic I don't feel like getting into right now.

But check out the resulting fun little bit of ruby:

$weekdays.keys.sort {|a,b| b.length <=> a.length}

$weekdays is the hash, as above. The method 'keys' returns an array of just the keys of the hash. 'sort' sorts that array, using the following block to specify the sort order. Without the block, 'sort' uses the object class's default sort rule, which for strings is a normal alphabetical sort order. But I don't want that; I want to sort the array from longest to shortest. The '<=>' comparison operator is made for sort blocks: it evaluates to -1, 0 or +1 depending on how the two things compare (less, equal, greater). That's what the block has to return for 'sort' to work.

irb(main):004:0> $weekdays
=> {"wed"=>"WE", "fr"=>"FR", "w"=>"WE", "tu"=>"TU", "mon"=>"MO", "sun"=>"SU", "tuesday"=>"TU", "mo"=>"MO", "m"=>"MO", "su"=>"SU", "saturday"=>"SA", "sa"=>"SA", "we"=>"WE", "thursday"=>"TH", "wednesday"=>"WE", "tues"=>"TU", "monday"=>"MO", "sunday"=>"SU", "sat"=>"SA", "f"=>"FR", "thurs"=>"TH", "thur"=>"TH", "fri"=>"FR", "s"=>"SA", "friday"=>"FR", "th"=>"TH", "t"=>"TU"}

irb(main):005:0> $weekdays.keys
=> ["wed", "fr", "w", "tu", "mon", "sun", "tuesday", "mo", "m", "su", "saturday", "sa", "we", "thursday", "wednesday", "tues", "monday", "sunday", "sat", "f", "thurs", "thur", "fri", "s", "friday", "th", "t"]

irb(main):006:0> $weekdays.keys.sort
=> ["f", "fr", "fri", "friday", "m", "mo", "mon", "monday", "s", "sa", "sat", "saturday", "su", "sun", "sunday", "t", "th", "thur", "thurs", "thursday", "tu", "tues", "tuesday", "w", "we", "wed", "wednesday"]

irb(main):007:0> $weekdays.keys.sort {|a,b| b.length <=> a.length}
=> ["wednesday", "thursday", "saturday", "tuesday", "sunday", "friday", "monday", "thurs", "thur", "tues", "wed", "fri", "sat", "mon", "sun", "mo", "th", "su", "we", "fr", "tu", "sa", "f", "s", "w", "m", "t"]

Just noticed I should put 'tue' and 'thu' in there.

You can try out the curriculum worksheet. And you can check out the code (anonymous svn).

Personal tools