Grid Algorithm

Over the past couple of weeks, I’ve been writing a small lexer in Rust, motivated by the need for something simpler and easier to use than the default lexer offered by LALRPOP. The in-built lexer allows for simple regexes, but managing these regexes becomes difficult for more sophisticated situations such as double-quoted literals.

The core idea behind a lexer is to convert a stream of characters into a stream of tokens. Each token constitutes a distinct building block that can be used to construct a grammar for a language parser. As an analogy, the vocabulary and punctuation supporting the English language constitute a set of tokens. Of course, you could avoid the lexer altogether and deal directly with the underlying characters, but that makes the language grammar exponentially more complex to specify. Imagine parsing English sentences based on rules associated with each character of the alphabet!

How It Works

To recap, a lexer’s job is to convert a stream of characters into a stream of tokens. Tokens can be anything that’s useful to feed into your parser. The parser is generated from a set of productions declared in a grammar file. LALRPOP provides a bottom-up LR(1)1“LR(1)” expands to a left-to-right, rightmost derivation parser with a single token lookahead. parser. A bottom-up parser uses recognized tokens to construct leaf nodes of the parse tree. In contrast, a top-down parser (of which recursive-descent parsers are the most common) generates productions eagerly as matching symbols are encountered. Top-down parsers are often very convenient and easy to use, but they’re less flexible compared to bottom-up parsers, and require backtracking when they encounter production dead-ends.

One way to structure the lexer is to break it down into a number of processors, one per type of token that you could potentially produce. For example, a Sym processor processes symbols, which are arbitrarily long consecutive sequences of characters conforming to a class (such as letters, numbers and underscores). The processor abstraction is useful for encapsulating logic associated with each token type into its own implementation. The idea is to iterate through each character in the stream, and feed it to each of the processors until one of them accepts the character. If none of the others accept the character, the Err processor is sure to do so (which means the character stream is ill-formed).

This abstraction is represented by the following trait:

pub trait Processor {
    fn accept(
        &self,
        idx: usize,
        chr: Option<&char>) -> Result<LexRes, LexErr>;
}

To complete this definition, the associated types LexRes and LexErr are provided below. The lexer — when fully or partially successful — returns an ordered set of tokens, where each token declares the starting index of the token (relative to the original character stream), the length of the stream consumed, and the type of token (represented by the Cat enum). Notice that some of these enum values such as Sym (for ‘symbol’) and Lit (for ‘literal’) capture a value, whereas others do not.

#[derive(Debug, Eq, PartialEq)]
pub struct LexRes {
    pub tokens: Vec<Token>,
    pub accepted: usize,
    pub rejected: bool,
}

#[derive(Debug, Eq, PartialEq)]
pub struct LexErr {
    pub idx: usize,
    pub msg: String,
}

#[derive(Debug, Eq, PartialEq)]
pub struct Token {
    pub idx: usize,
    pub len: usize,
    pub cat: Cat,
}

#[derive(Clone, Debug, Eq, PartialEq)]
pub enum Cat {
    Sym(String),
    Lit(String),
    Assgn,
    Eql,
    SemiColn,
    LParen,
    RParen,
    Spc,
}

What the simple definition of LexRes above belies is its underlying flexibility. When a processor receives a character, it can choose from 3 options:

  1. Accept k outstanding characters (accepted = k; k ≤ n)
  2. Reject all characters not accepted (rejected = true; k < n)
  3. Wait for more characters (rejected = false; k < n)

The number n represents the total number of outstanding characters, which increments on every iteration, but decrements on acceptance or rejection of characters. Notice that these options are not mutually exclusive; in fact, the first option combines with either the second or third, but the latter cannot be combined with each other.

When the processor chooses the third option, this acts as a signal to the lexer that the next character should be fed directly to this processor. In essence, the processor holds an exclusive lock over the remainder of the stream as long as characters remain pending (neither accepted nor rejected). This process can continue ad infinitum, or at least until the end of the character stream. In each iteration, the processor has the choice of accepting up to the total number of outstanding characters (in stream order), and either rejecting the rest (which ends the cycle and releases the exclusive lock) or continuing to wait for additional characters.

The declared ordering of processors has functional significance in certain situations. For instance, if you have a choice of detecting one of <, - and <- as different token types, you’d likely want the latter to have precedence. Also, as a trivial example, the Err processor is always used as the last resort.

Play the short animation below, for a concrete visual of what the lexer’s processing looks like for the simple character sequence:

let quote <- say("\"Hello world!\" --RRI");

In the example above, let, quote and say are interpreted as symbols using the Sym processor. The Seq processor looks for a specific sequence of characters (here, it is <-, mapping to Cat::Assgn). The Lit processor is distinctive because it internally tracks state associated with the escape character \, which can escape double-quotation marks inside a double-quoted literal.

Complexity

If we assume that (a) we have a small constant number of processors and (b) each processor performs its task in constant time, the overall algorithm is clearly O(n) in time-complexity, with respect to a character stream of length n. However, we can also evaluate how efficient the algorithm is with respect to a baseline. In theory, the only way for us to be certain that a character is part of a particular token is for its processor to accept it, which means that every character must be processed at least once, giving us a total of n processing calls as the ideal minimum to baseline our algorithm against.

From the visual above, it’s clear that the checkmarks (✓) represent the ideal path, whereas the crosses (☒) represent ‘misses’, or distance traveled before the character is accepted. Further, when a processor like Sym rejects the character immediately succeeding its final character (such as whitespace or punctuation, which is not part of the symbol), processing is restarted at the rejected character from the first processor, which, in the particular example, is again Sym. That means we have an extra character to be processed for every symbol recognized (note that ‘symbol’ in this context means the output token of the Sym processor such as let, quote and say). The efficiency of the algorithm with respect to the baseline can therefore be defined as:

\frac{count(\texttt{accept})}{count(\texttt{accept}) + count(\texttt{reject}) + count(\texttt{symbol})}

In essence, this is a number in (0, 1] that is the inverse of the average number of steps until a character is accepted by a processor. It varies based on the string processed by the lexer, so we’d likely only care about average statistics. Applying it to the example above gives us an efficiency of approximately 0.59: on average it takes 1.7 steps for each character to be accepted by a processor. Turning this into a useful metric would require us to normalize it, which we won’t attempt to do today.

Finally, there are further opportunities for refinement. First, each processor requires a different number of character comparisons to complete its task. Although we’ve assumed that each operates in constant time, we really ought to weight each processor, or otherwise measure the total number of comparisons directly. This would give us a more accurate metric than the one presented above. Second, we can order processors to maximize efficiency, basing the order on predictions that we derive from the statistics of the input being processed. We’re still bound by the need to process each character at least once, but we might be able to get really good at predicting which processor to try next, giving us an efficiency of close to 1.

That’s all for today, folks! 🖖

The Josephus Problem

The final problem presented in the first chapter of Concrete Mathematics: A Foundation for Computer Science is The Josephus Problem. While the fables of the origin of this problem are entertaining in themselves, the premise itself is simple. We have n objects that are given sequential labels from 1 to n, and arranged in a circle in the same sequence. Starting from the object labeled 1, we are to eliminate every second object until only one object survives. The objective is to determine the label associated with the surviving object.

The Josephus Problem for n = 12 (link to LaTeX source)

As an example, consider the Josephus number J(n) for n = 12, shown in the diagram above. Labels get eliminated in the following sequence:

2, 4, 6, 8, 10, 12, 3, 7, 11, 5, 1

…leaving 9 at the very end.

J(12)=9

In looking for a general solution to this problem, we first observe that the pattern of elimination is slightly different depending on whether n is even or odd. When n is even, the first round eliminates all even numbers. The second round begins again with 1, but with exactly half the original number of objects, but with the labels themselves being larger. We can establish the following relation:

J(2n) = 2J(n)\;-\;1 \tag{1}

When n is odd, the first round again eliminates all even numbers, but the second round begins by eliminating 1. Similar to equation (1), we can establish the following relation:

J(2n + 1) = 2J(n) + 1 \tag{2}

With these recurrence relations established, we can calculate a few sample values of the function and thereby derive a general proposition.

\begin{aligned}
J(1) &= 1 \\ \\
J(2) &= 1 \\
J(3) &= 3 \\ \\
J(4) &= 1 \\
J(5) &= 3 \\
J(6) &= 5 \\
J(7) &= 7 \\ \\
J(8) &= 1 \\
J(9) &= 3 \\
J(10) &= 5 \\
J(11) &= 7 \\
J(12) &= 9 \\
J(13) &= 11 \\
J(14) &= 13 \\
J(15) &= 15 \\ \\
J(16) &= 1
\end{aligned}
\begin{aligned}
J(2^m + l) &= 2l + 1 & 0 \le m, 0 \le l \lt 2^m \tag{3}
\end{aligned}

Proof (by induction on m)

We already know the value of the function when n=1, which we obtain when m = l = 0.

\begin{aligned}
J(1) = J(2^0 + 0) = 1 \tag{4}
\end{aligned}

Suppose that the proposition is true for some m \ge 0. We want to show that this implies that the proposition is also true for m + 1. To do so, we need to examine two cases separately: (a) when l is even, and (b) when l is odd.

\begin{aligned}
\text{(when $l$ is even)} \\ \\
J \big(2^{m+1}+l\big) &= J \big(2 \cdot [2^m+l/2]\big) \\
                        &= 2 \cdot J (2^m+l/2)-1 \\
                        &= 2 \cdot (2 \cdot l/2 + 1)-1 \\
                        &= 2l + 1 \\
\\
\text{(when $l$ is odd)} \\ \\
J \big(2^{m+1}+l\big) &= J \big(2 \cdot [2^m+(l-1)/2] + 1\big) \\
                        &= 2 \cdot J \big(2^m+(l-1)/2\big) + 1 \\
                        &= 2 \cdot \big(2 \cdot (l-1)/2 + 1\big) + 1 \\
                        &= 2l + 1 \\
\end{aligned}

These results follow from the earlier recurrence equations (1) and (2) and the assertion that the proposition is true for a specific m. Together with the base case in (4), they show that the proposition is indeed true for all m. QED.

There are a number of ways to implement this program in Rust, and in this case, we take the most straightforward approach of evaluating the formula (3) that we just proved.

Preparation

  • Make sure cargo is installed on your system via rustup.
  • Create a new package called hanoi to run your code.
$ cargo new josephus
$ cd josephus

P0: Basic Solution

Link to GitHub branch.

//
// src/lib.rs
//

/// Calculate the Josephus Number J(n) for any n > 0.
///
/// We use a simple formula here: express the number `n` as
/// 2^m + l for some m >= 0, 0 <= l < 2^m, then calculate the
/// value of 2 * l + 1, which is the result.
///
/// * `n` - The number to calculate J for
pub fn josephus(n: u32) -> u32 {
    let base: u32 = 2;

    let mut index: u32 = 0;
    let mut hbits: u32 = n;

    loop {
        hbits = hbits >> 1;
        if hbits == 0 {
            break;
        }
        index += 1;
    }

    let l = n - base.pow(index);

    2 * l + 1
}

#[cfg(test)]
mod tests {

    use super::*;

    #[test]
    fn test_one() {
        assert_eq!(josephus(1), 1);
    }

    #[test]
    fn test_three() {
        assert_eq!(josephus(3), 3);
    }

    #[test]
    fn test_twelve() {
        assert_eq!(josephus(12), 9);
    }

    #[test]
    fn test_fifty_one() {
        assert_eq!(josephus(51), 39);
    }

    #[test]
    fn test_sixty_four() {
        assert_eq!(josephus(64), 1);
    }
}

That’s all for today, folks! 🖖

The Tower of Hanoi

The first problem presented in Concrete Mathematics: A Foundation for Computer Science is The Tower of Hanoi. Although a relatively simple problem, it offers a useful foray into introductory programming with Rust. We are given a tower of n disks, initially stacked in decreasing size on one of three pegs. The objective is to transfer the entire tower to one of the other pegs, moving only one disk at a time and never moving a larger one onto a smaller. Once we satisfy ourselves that a solution to this problem exists, we can choose to determine the total number of moves T_n required to solve the problem, or an enumerated list of moves constituting this number.

In creating a solution for this, note that we are not pursuing a closed form solution; in essence, the computer has to ‘do the work’ of making each move, recording it in the process. We begin with the simplest possible version of the solution.

Preparation

  • Make sure cargo is installed on your system via rustup.
  • Create a new package called hanoi to run your code.
$ cargo new hanoi
$ cd hanoi
//
// src/main.rs
//
use std::io;
use std::io::Write;

/// Solve the Tower of Hanoi problem
fn main() {
    let mut num_disks = 0;
    let mut num_moves = 0;

    get_num_disks(&mut num_disks);

    if num_disks >= 0 {
        println!("Number of disks = {}", num_disks);
        move_disk_set(num_disks, "A", "B", "C", &mut num_moves);
        println!("Number of moves = {}", &num_moves);
    } else {
        println!("Invalid number ({}) of disks!", &num_disks);
    }
}

/// Get the number of disks to solve the problem for
///
/// * `k` - Buffer to populate the result in
fn get_num_disks(k: &mut i64) {
    let mut buf = String::new();
    print!("Enter the number of disks: ");
    io::stdout().flush().unwrap();
    io::stdin()
        .read_line(&mut buf)
        .expect("Failed to read input!");
    *k = buf
        .trim()
        .parse::<i64>()
        .expect("Failed to parse input as an integer!");
}

/// Move a tower of the given size from source to destination.
///
/// * `num_disks` - Number of disks in the tower
/// * `src`       - Label for the source peg
/// * `dst`       - Label for the destination peg
/// * `buf`       - Label for the peg acting as a buffer
/// * `num_moves` - Number of moves executed so far
fn move_disk_set(
    num_disks: i64,
    src: &str,
    dst: &str,
    buf: &str,
    num_moves: &mut i64,
) {
    match num_disks {
        0 => return,
        n => {
            move_disk_set(n - 1, src, buf, dst, num_moves);
            move_disk(num_moves);
            move_disk_set(n - 1, buf, dst, src, num_moves);
        }
    }
}

/// Move a single disk from source to destination.
///
/// * `num_moves` - Number of moves executed so far
fn move_disk(num_moves: &mut i64) {
    *num_moves = *num_moves + 1;
}

P0: Basic Solution

Link to GitHub branch.

The output of this program is quite straightforward.

$ cargo run
   Compiling hanoi v0.1.0 (/Users/rri/Code/play/hanoi)
    Finished dev [unoptimized + debuginfo] target(s) in 0.36s
     Running `target/debug/hanoi`
Enter the number of disks: 3
Number of disks = 3
Number of moves = 7

P1: Printed Moves

Link to GitHub branch.

The solution gets a bit more interesting when you print each move on the console, thus giving you a way to see the code in action and verify its output. (For brevity, only the updated functions are shown below.)

Update move_disk to print out the move when it is called.

fn move_disk(num_moves: &mut i64, src: &str, dst: &str) {
    *num_moves = *num_moves + 1;
    println!("{:>10}: {} -> {}", num_moves, src, dst);
}

Update move_disk_set to pass more parameters to move_disk.

fn move_disk_set(
    num_disks: i64,
    src: &str,
    dst: &str,
    buf: &str,
    num_moves: &mut i64,
) {
    match num_disks {
        0 => return,
        n => {
            move_disk_set(n - 1, src, buf, dst, num_moves);
            move_disk(num_moves, src, dst);
            move_disk_set(n - 1, buf, dst, src, num_moves);
        }
    }
}

The output of the modified program is slightly more involved.

$ cargo run
   Compiling hanoi v0.1.0 (/Users/rri/Code/play/hanoi)
    Finished dev [unoptimized + debuginfo] target(s) in 0.28s
     Running `target/debug/hanoi`
Enter the number of disks: 4
Number of disks = 4
         1: A -> C
         2: A -> B
         3: C -> B
         4: A -> C
         5: B -> A
         6: B -> C
         7: A -> C
         8: A -> B
         9: C -> B
        10: C -> A
        11: B -> A
        12: C -> B
        13: A -> C
        14: A -> B
        15: C -> B
Number of moves = 15

We would like to write automated tests for this solution. The need to test software usually forces us to change the structure of the program in certain ways, but this is a good thing as it drives the design of the application to be much more modular. The problem with the solution above is that we can’t really test what output is printed on the console (at least, not easily). We need to modify our program to return a list of moves that we are free to inspect, rather than execute these moves directly. Modular software has this useful property that all effects on the environment external to system under consideration are reified (given a concrete representation), as you see in the next section.

P2: Tested Moves

Link to GitHub branch.

The basic idea behind reification of effects is to take code that does some work, and represent it in a structure or object that represents the work (instead of actually doing the work). In this case, instead of printing a move on the console, we return a structure that represents a move (complete with source and destination), and leave it up to the consumer of the structure to determine what to do with it. Second, we no longer need to maintain a count of moves, as the count is implied by the size of our move list. Finally, we add a few tests that validate our assumptions. (For brevity, only the updated functions and structures are shown below.)

Define a structure to represent a move.

/// Representation of a 'move' of a disk from one peg to another
struct Move<'a> {
    /// Label for the source peg
    src: &'a str,
    /// Label for the destination peg
    dst: &'a str,
}

Update the main function to consume and process (print) the moves.

fn main() {
    let mut num_disks = 0;
    let mut lst_moves: Vec<Move> = Vec::new();

    get_num_disks(&mut num_disks);

    if num_disks >= 0 {
        println!("Number of disks = {}", num_disks);
        move_disk_set(num_disks, "A", "B", "C", &mut lst_moves);
        println!("Number of moves = {}", lst_moves.len());
        for (pos, m) in lst_moves.iter().enumerate() {
            println!("{:>10}: {} -> {}", pos + 1, m.src, m.dst);
        }
    } else {
        println!("Invalid number ({}) of disks!", &num_disks);
    }
}

Update the move_disk function to add to a vector instead of printing the move.

fn move_disk<'a>(
    lst_moves: &mut Vec<Move<'a>>,
    src: &'a str,
    dst: &'a str,
) {
    lst_moves.push(Move { src: src, dst: dst });
}

Add some tests to satisfy ourselves that the solution works. More interesting and useful tests may be added, of course.

#[cfg(test)]
mod tests {

    use super::*;

    #[test]
    fn test_nil() {
        test_disks(0, 0);
    }

    #[test]
    fn test_one() {
        test_disks(1, 1);
    }

    #[test]
    fn test_two() {
        test_disks(2, 3);
    }

    #[test]
    fn test_ten() {
        test_disks(10, 1023);
    }

    fn test_disks(num_disks: i64, exp_num_moves: i64) {
        let mut lst_moves: Vec<Move> = Vec::new();
        move_disk_set(num_disks, "A", "B", "C", &mut lst_moves);
        assert_eq!(lst_moves.len() as i64, exp_num_moves);
    }
}

One of the most interesting things about this code that has insidiously made its way into an otherwise simple example is the notion of lifetimes in Rust. The parameter 'a in the example represents a lifetime, denoting the scope for which the reference parameterized by it is available. The easiest way to think of this is as an extra parameter that gets threaded through whenever the reference is passed along to a function. For instance, the move_disk_set function now needs to be parameterized with 'a, with no other changes. In this case, the actual value of the parameter is inferred to be 'static as we are ultimately starting with string literals ("A", "B", "C").

fn move_disk_set<'a>(
    num_disks: i64,
    src: &'a str,
    dst: &'a str,
    buf: &'a str,
    lst_moves: &mut Vec<Move<'a>>,
) {
    match num_disks {
        0 => return,
        n => {
            move_disk_set(n - 1, src, buf, dst, lst_moves);
            move_disk(lst_moves, src, dst);
            move_disk_set(n - 1, buf, dst, src, lst_moves);
        }
    }
}

The output of the running the tests are exactly as expected.

$ cargo test
    Finished test [unoptimized + debuginfo] target(s) in 0.00s
     Running target/debug/deps/hanoi-40e0f0eba95aed0c

running 4 tests
test tests::test_nil ... ok
test tests::test_one ... ok
test tests::test_two ... ok
test tests::test_ten ... ok

test result: ok. 4 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

That’s all for today, folks! 🖖