On the hobby side, hopefully this is the year that I finish my autonomous cubicle defence turret. I have most of the parts already, I need to check how fast is a Raspberry Pi with OpenCV to track faces (yay auto-targeting!). I'm using an Arduino to control the servos, and still need to figure out what the weapon will be (I presume some sort of Nerf gun will be a good match). Anyway, I solemnly promise to try to post something meaningful this year.
Codng.com
Trying to change the world one post at a time
Thursday, June 13, 2013
It's been a while
Wednesday, September 07, 2011
Parallelism & Performance: Why O-notation matters
UPDATE: There was a bug in the way the suffix map was built in the original post that lolplz found. I updated this post with that bug fixed, however, the content of the post has not changed significantly.
Aleksandar Prokopec presented a very interesting talk about Scala Parallel Collections in Scala Days 2011. I saw it today and gave me some food for thought,
He opens his talk with an intriguing code snippet:
for { s <- surnames n <- names if s endsWith n } yield (n , s)What this code does, given what presumably are two sequences of strings, it builds all pairs (name, surname) where the surname ends with the name.
The algorithm used is very brute force, it's order is roughly O(N^2) (it's basically a Cartesian Product being filtered). He then goes on to show that by just using parallel collections:
for { s <- surnames.par n <- names.par if s endsWith n } yield (n , s)He can leverage all the cores and reduce the runtime by two or four, depending on the number of cores available in the machine where it's running. For example, the non-parallel version runs in 1040ms, the parallel one runs in 575ms with two cores, and in 305ms with four. Which is indeed very impressive for such a minor change.
What concerns me is that, no matter how many cores you add, the problem as presented is still O(N^2). It is true that many useful problems can only be speeded up by throwing more hardware at it, but most of the times, using the right data representation can yield even bigger gains.
If we use this problem as an example, we can build a slightly more complex implementation, but that hopefully is a lot faster. The approach I'm taking is that of building a suffix map for surnames. There are more efficient data structures (memory wise) to do this, but for simplicity I'll use a Map:
val suffixMap = collection.mutable.Map[String, List[String]]().withDefaultValue(Nil) for (s <- surnames; i <- 0 until s.length; suffix = s.substring(i)) suffixMap(suffix) = s :: suffixMap(suffix)Having built the prefix map, we can naively rewrite the loop to use it (instead of the Cartesian Product):
for { n <- names s <- suffixMap(n) } yield (n , s)In theory, this loop is roughly O(N) (assuming the map is a HashMap), since it now does a constant number of operations on each name it processes, rather than processing all the names for each surname (I'm ignoring the fact that the map returns lists).
Note: The algorithmic order does not change much if we take into account the suffix map construction. Let's assume that we have S surnames and N names, the order of building the suffix map is O(S) and the order building the pairs is O(N), the total order of the algorithm is O(N+S). If we assume that N is approximately equal to S, then the order is O(N+N) which is the same than O(2N), which can be simplified to O(N).So, let's see if this holds up in real life. For this I wrote the following scala script that runs a few passes of several different implementations:
#!/bin/sh exec scala -deprecation -savecompiled "$0" "$@" !# def benchmark[T](name: String)(body: =>T) { val start = System.currentTimeMillis() val result = body val end = System.currentTimeMillis() println(name + ": " + (end - start)) result } val surnames = (1 to 10000).map("Name" + _) val names = (1 to 10000).map("Name" + _) val suffixMap = collection.mutable.Map[String, List[String]]().withDefaultValue(Nil) for (s <- surnames; i <- 0 until s.length; suffix = s.substring(i)) suffixMap(suffix) = s :: suffixMap(suffix) for( i <- 1 to 5 ) { println("Run #" + i) benchmark("Brute force") { for { s <- surnames n <- names if s endsWith n } yield (n , s) } benchmark("Parallel") { for { s <- surnames.par n <- names.par if s endsWith n } yield (n , s) } benchmark("Smart") { val suffixMap = collection.mutable.Map[String, List[String]]().withDefaultValue(Nil) for (s <- surnames; i <- 0 until s.length; suffix = s.substring(i)) suffixMap(suffix) = s :: suffixMap(suffix) for { n <- names s <- suffixMap(n) } yield (n , s) } benchmark("Smart (amortized)") { for { n <- names s <- suffixMap(n) } yield (n , s) } }There are four implementations:
- Brute Force: the original implementation
- Parallel: same as before, but using parallel collections.
- Smart: Using the prefix map (and measuring the map construction)
- Smart (amortized): same as before, but with the prefix map cost amortized.
Benchmark Results
Running this script in a four core machine, I get the following results:Run #1 Brute force: 2158 Parallel: 1355 Smart: 153 Smart (amortized): 27 Run #2 Brute force: 1985 Parallel: 899 Smart: 82 Smart (amortized): 7 Run #3 Brute force: 1947 Parallel: 716 Smart: 69 Smart (amortized): 5 Run #4 Brute force: 1932 Parallel: 714 Smart: 67 Smart (amortized): 6 Run #5 Brute force: 1933 Parallel: 713 Smart: 68 Smart (amortized): 5As expected, the parallel version runs 3.5 times as fast as the naive one, but the implementation using the "Smart" approach runs more than 30 times faster than the naive one. If we were able to amortize the cost of building the suffix map, the speed-up is even more staggering, at a whooping 380 times faster (although this is not always possible)!
What we can conclude from this, paraphrasing Fred Brooks1, is that regarding performance there is no silver bullet. The basics matter, maybe now they even matter more than ever.
Analyzing algorithmic order, in its most basic form (making huge approximations) is a very practical tool to solving hard performance problems. Still, the simple approach used here to optimize the problem is also parallelizable, which for a large enough problem, it might gain some speedup using parallel collections.
Don't get me wrong, I love Scala parallel collections and parallel algorithms are increasingly important, but they are by no means a magical solution that can be used for any problem (and I think Aleksandar is with me in this one).
Footnotes
1 misquoting is probably more accurate.
Tuesday, May 17, 2011
File locks in bash
I finally implemented one using Bash's "noclobber" option. I don't know if it will work correctly on NFS, but it should work fine on most filesystems. Hopefully it will be useful to some of you.
#!/bin/bash set -e declare SCRIPT_NAME="$(basename $0)" function usage { echo "Usage: $SCRIPT_NAME [options] <lock file>" echo "Options" echo " -r, --retries" echo " limit the number of retries before giving up the lock." echo " -s, --sleeptime, -<seconds>" echo " number of seconds between each attempt. Defaults to 8 seconds." exit 1 } #Check that at least one argument is provided if [ $# -lt 1 ]; then usage; fi declare RETRIES=-1 declare SLEEPTIME=8 #in seconds #Parse options for arg; do case "$arg" in -r|--retries) shift; if [ $# -lt 2 ]; then usage; fi; RETRIES="$1"; shift echo "$RETRIES" | egrep -q '^-?[0-9]+$' || usage #check that it's a number ;; -s|--sleeptime) shift; if [ $# -lt 2 ]; then usage; fi; SLEEPTIME="$1"; shift echo "$SLEEPTIME" | egrep -q '^[0-9]+$' || usage #check that it's a number ;; --) shift ; break ;; -[[:digit:]]*) if [ $# -lt 1 ]; then usage; fi; SLEEPTIME=${1:1}; shift echo "$SLEEPTIME" | egrep -q '^[0-9]+$' || usage #check that it's a number ;; --*) usage;; #fail on other options esac done #Check that only one argument is left if [ $# -ne 1 ]; then usage; fi declare lockfile="$1" for (( i=0; $RETRIES < 0 || i < $RETRIES; i++ )); do if ( set -o noclobber; echo "$$" > "$lockfile") 2> /dev/null; then exit 0 fi #Wait a bit sleep $SLEEPTIME done #Failed cat $lockfile exit 1
Thursday, April 14, 2011
Content-aware image resizing
We will start from this image:
And take 200 pixels from its width, and turn it into this one:
Note that the image wasn't just resized, but most of the detail is still there. The size reduction is rather aggressive so there are some artifacts. But the results are quite good.
This algorithm works by repeatedly finding vertical seams of pixels and removing them. It chooses which one to remove by finding the seam with the minimal amount of energy.
The whole algorithm revolves around an energy function. In this case, I'm using a function suggested in the original paper which is based on the luminance of the image. What we do is compute the vertical and horizontal derivatives of the image, take the absolute value of each, and add both. The derivative is approximated by a simple subtraction.
The following code computes the energy of the image. The
intensities
image is basically the grayscale version of the image, normalized between 0 and 1.private static FloatImage computeEnergy(FloatImage intensities) { int w = intensities.getWidth(), h = intensities.getHeight(); final FloatImage energy = FloatImage.createSameSize(intensities); for(int x = 0; x < w-1; x++) { for(int y = 0; y < h-1; y++) { //I'm aproximating the derivatives by subtraction float e = abs(intensities.get(x,y)-intensities.get(x+1,y)) + abs(intensities.get(x,y)-intensities.get(x,y+1)); energy.set(x,y, e); } } return energy; }
After applying this function to our image, we get the following:
You can observe that the edges are highlighted (i.e. have more energy). That is caused by our choice of an energy function. Since we're taking the derivatives and adding its absolute value, abrupt changes in luminance are highlighted (i.e. edges).
The next step is where things start to get interesting. To find the minimal energy seam, we build an image with the accumulated minimal energy. We do so by computing an image where the value of each pixel is the value of the minimum of the three above it, plus the energy of that pixel:
We do so with the following code:
final FloatImage energy = computeEnergy(intensities); final FloatImage minima = FloatImage.createSameSize(energy); //First row is equal to the energy for(int x = 0; x < w; x++) { minima.set(x,0, energy.get(x,0)); } //I assume that the rightmost pixel column in the energy image is garbage for(int y = 1; y < h; y++) { minima.set(0,y, energy.get(0,y) + min(minima.get(0, y - 1), minima.get(1, y - 1))); for(int x = 1; x < w-2; x++) { final float sum = energy.get(x,y) + min(min(minima.get(x - 1, y - 1), minima.get(x, y - 1)),minima.get(x + 1, y - 1)); minima.set(x,y, sum); } minima.set(w-2,y, energy.get(w-2,y) + min(minima.get(w-2, y - 1), minima.get(w-3, y - 1))); }
Once we do this, the last row contains the sum of all the potential minimal seams.
With this, we search the last row for the one with the minimum total value:
//We find the minimum seam float minSum = Float.MAX_VALUE; int seamTip = -1; for(int x = 1; x < w-1; x++) { final float v = minima.get(x, h-1); if(v < minSum) { minSum=v; seamTip=x; } }
And backtrace the seam:
//Backtrace the seam final int[] seam = new int[h]; seam[h-1]=seamTip; for(int x = seamTip, y = h-1; y > 0; y--) { float left = x>0?minima.get(x-1, y-1):Float.MAX_VALUE; float up = minima.get(x, y-1); float right = x+1<w?minima.get(x+1, y-1):Float.MAX_VALUE; if(left < up && left < right) x=x-1; else if(right < up && right < left) x= x+1; seam[y-1]=x; } }
Having the minimum energy seam, all is left to do is remove it.
If we repeat this process several times, removing one seam at a time, we end up with a smaller image. Check the following video to see this algorithm in action:
If you want to reduce an image vertically, you have to find horizontal seams. If you want to do it vertically and horizontally you have to find which seam has the least energy (either the vertical or the horizontal one) and remove that one.
This implementation is quick & dirty and very simplistic. Many optimization can be done to make it work faster. It is also quite incomplete. By priming the energy image, you can influence the algorithm to avoid distorting certain objects in the image or to particularly pick one.
It is also possible to use it to enlarge an image (although I haven't implemented it), and by a combination of both methods one can selectively remove objects from an image.
The full source code for this demo follows. Have fun!
import javax.imageio.ImageIO; import java.io.File; import java.io.IOException; import java.awt.image.BufferedImage; import java.awt.*; import static java.lang.Math.abs; import static java.lang.Math.min; public class SeamCarving { public static void main(String[] args) throws IOException { final BufferedImage input = ImageIO.read(new File(args[0])); final BufferedImage[] toPaint = new BufferedImage[]{input}; final Frame frame = new Frame("Seams") { @Override public void update(Graphics g) { final BufferedImage im = toPaint[0]; if (im != null) { g.clearRect(0,0,getWidth(), getHeight()); g.drawImage(im,0,0,this); } } }; frame.setSize(input.getWidth(), input.getHeight()); frame.setVisible(true); BufferedImage out = input; for(int i = 0; i < 200; i++) { out = deleteVerticalSeam(out); toPaint[0]=out; frame.repaint(); } } private static BufferedImage deleteVerticalSeam(BufferedImage input) { return deleteVerticalSeam(input, findVerticalSeam(input)); } private static BufferedImage deleteVerticalSeam(final BufferedImage input, final int[] seam) { int w = input.getWidth(), h = input.getHeight(); final BufferedImage out = new BufferedImage(w-1,h, BufferedImage.TYPE_INT_ARGB); for(int y = 0; y < h; y++) { for(int x = 0; x < seam[y]; x++) { out.setRGB(x,y,input.getRGB(x, y)); } for(int x = seam[y]+1; x < w; x++) { out.setRGB(x-1,y,input.getRGB(x, y)); } } return out; } private static int[] findVerticalSeam(BufferedImage input) { final int w = input.getWidth(), h = input.getHeight(); final FloatImage intensities = FloatImage.fromBufferedImage(input); final FloatImage energy = computeEnergy(intensities); final FloatImage minima = FloatImage.createSameSize(energy); //First row is equal to the energy for(int x = 0; x < w; x++) { minima.set(x,0, energy.get(x,0)); } //I assume that the rightmost pixel column in the energy image is garbage for(int y = 1; y < h; y++) { minima.set(0,y, energy.get(0,y) + min(minima.get(0, y - 1), minima.get(1, y - 1))); for(int x = 1; x < w-2; x++) { final float sum = energy.get(x,y) + min(min(minima.get(x - 1, y - 1), minima.get(x, y - 1)),minima.get(x + 1, y - 1)); minima.set(x,y, sum); } minima.set(w-2,y, energy.get(w-2,y) + min(minima.get(w-2, y - 1),minima.get(w-3, y - 1))); } //We find the minimum seam float minSum = Float.MAX_VALUE; int seamTip = -1; for(int x = 1; x < w-1; x++) { final float v = minima.get(x, h-1); if(v < minSum) { minSum=v; seamTip=x; } } //Backtrace the seam final int[] seam = new int[h]; seam[h-1]=seamTip; for(int x = seamTip, y = h-1; y > 0; y--) { float left = x>0?minima.get(x-1, y-1):Float.MAX_VALUE; float up = minima.get(x, y-1); float right = x+1<w?minima.get(x+1, y-1):Float.MAX_VALUE; if(left < up && left < right) x=x-1; else if(right < up && right < left) x= x+1; seam[y-1]=x; } return seam; } private static FloatImage computeEnergy(FloatImage intensities) { int w = intensities.getWidth(), h = intensities.getHeight(); final FloatImage energy = FloatImage.createSameSize(intensities); for(int x = 0; x < w-1; x++) { for(int y = 0; y < h-1; y++) { //I'm approximating the derivatives by subtraction float e = abs(intensities.get(x,y)-intensities.get(x+1,y)) + abs(intensities.get(x,y)-intensities.get(x,y+1)); energy.set(x,y, e); } } return energy; } }
import java.awt.image.BufferedImage; public final class FloatImage { private final int width; private final int height; private final float[] data; public FloatImage(int width, int height) { this.width = width; this.height = height; this.data = new float[width*height]; } public int getWidth() { return width; } public int getHeight() { return height; } public float get(final int x, final int y) { if(x < 0 || x >= width) throw new IllegalArgumentException("x: " + x); if(y < 0 || y >= height) throw new IllegalArgumentException("y: " + y); return data[x+y*width]; } public void set(final int x, final int y, float value) { if(x < 0 || x >= width) throw new IllegalArgumentException("x: " + x); if(y < 0 || y >= height) throw new IllegalArgumentException("y: " + y); data[x+y*width] = value; } public static FloatImage createSameSize(final BufferedImage sample) { return new FloatImage(sample.getWidth(), sample.getHeight()); } public static FloatImage createSameSize(final FloatImage sample) { return new FloatImage(sample.getWidth(), sample.getHeight()); } public static FloatImage fromBufferedImage(final BufferedImage src) { final int width = src.getWidth(); final int height = src.getHeight(); final FloatImage result = new FloatImage(width, height); for(int x = 0; x < width; x++) { for(int y = 0; y < height; y++) { final int argb = src.getRGB(x, y); int r = (argb >>> 16) & 0xFF; int g = (argb >>> 8) & 0xFF; int b = argb & 0xFF; result.set(x,y, (r*0.3f+g*0.59f+b*0.11f)/255); } } return result; } public BufferedImage toBufferedImage(float scale) { final BufferedImage result = new BufferedImage(width, height, BufferedImage.TYPE_INT_ARGB); for(int x = 0; x < width; x++) { for(int y = 0; y < height; y++) { final int intensity = ((int) (get(x, y) * scale)) & 0xFF; result.setRGB(x,y,0xFF000000 | intensity | intensity << 8 | intensity << 16); } } return result; } }
Friday, April 08, 2011
Nerding with the Y-combinator
public class Combinators { interface F<A,B> { B apply(A x); } //Used for proper type checking private static interface FF<A, B> extends F<FF<A, B>, F<A, B>> {} //The Y-combinator public static <A, B> F<A, B> Y(final F<F<A, B>,F<A, B>> f) { return U(new FF<A, B>() { public F<A, B> apply(final FF<A, B> x) { return f.apply(new F<A, B>() { public B apply(A y) { return U(x).apply(y); } }); } }); } //The U-combinator private static <A,B> F<A, B> U(FF<A, B> a) { return a.apply(a); } static F<F<Integer, Integer>, F<Integer, Integer>> factorialGenerator() { return new F<F<Integer, Integer>, F<Integer, Integer>>() { public F<Integer, Integer> apply(final F<Integer, Integer> fact) { return new F<Integer, Integer>() { public Integer apply(Integer n) { return n == 0 ? 1 : n * fact.apply(n-1); } }; } }; } public static void main(String[] args) { F<Integer, Integer> fact = Y(factorialGenerator()); System.out.println(fact.apply(6)); } }Having the Y-combinator implemented in Java, actually serves no purpose (Java supports recursion) but it was interesting to see if it could be done with proper generics.
Monday, April 04, 2011
Paper programming
We bought a "Segundamano" (lit. second hand) magazine and found a used one for U$S 200. My dad contacted the seller and we went to pick it up.
You have to keep in mind that this was 1989 and the social and economic landscape in Argentina was a mess. That year the inflation rate was over 3000% (it is not a typo) and those 200 dollars was a lot of money, so my dad really made an effort.
If you are still paying attention, you might have noticed that I never mentioned a Datasette nor a disk drive. All I got was a Commodore 64, plain and simple. But this was not going to stop me.
After we got it, my dad was concerned that I might get too obsessed with the computer, so he placed some additional constraints on when I could use it (the fact that we had only one TV might have also been a factor in his decision). I was allowed to use it only on Saturday mornings before noon.
In retrospective, I think that this two factors made me a better programmer.
At that time I had some experience programming in Logo mostly. It was a Logo dialect in spanish that we used at school. We had a Commodore-128 laboratory at school (about eight machines with disk drives and monitors). I started learning Logo when I was 8, by the time I was twelve I could program a bit of BASIC also, but not much since literature was scarce.
One great thing about the Commodore 64 was that it came with BASIC, but most importantly, it came with a manual! The Commodore's Basic manual was my first programming book.
What happened was that I was forced to work as if I had punched cards. I would spend most of my spare time during the week writing programs in a notebook I had. Thinking about ways to solve problems and reading and re-reading the C64 manual.
On saturday mornings I would wake up at 6 am and hook-up the computer to the TV and start typing whatever program I was working on that week. Run it and debug it, improve it and around noon my mom would start reminding me that time was up. So at that point I began listing the BASIC source code and copying it back to my notebook.
It was during this time that I rediscovered things like an optimized Bubble-sort, although I didn't know its name then (and I wouldn't learn it for many more years). I still vividly remember the moment. It was one afternoon that I was trying to figure out a way to sort an array, so I started playing with a deck of cards. I finally figured out that I could do several passes of comparison and exchanges on adjacent cards. And if I did exactly as many passes as the number of elements the array would be sorted. I also noticed that the largest element would be at the end of the array after the first pass, and the second largest would be in place after the second pass and so on. This allowed me to save about half the number of comparisons.
The most valuable thing that I learned during this time is that when it comes to programming, thinking hard about the problem at hand before doing anything pays off, big time!
I eventually was able to save enough for a Datasette (it costed 40 dollars, my dad payed half of it) and I was able to build much larger programs.
The biggest one I wrote was an interpreter with a multitasking runtime. It was very basic. I had no notion of a lexer, let alone a parser. Commands were composed of a single letter (the first one of the line) and several arguments. You could declare procedures with names and call them. It was like pidgin-basic (since my exposure was to basic) but the procedure structure resembled Logo.
Programs were stored in memory in several arrays. There was a main one that contained statements, and couple to hold procedure names and offsets into the main array.
The runtime divided the screen in 4 areas (this was a 40x25 text screen, so not much room was left for each), and each window could run a separate program. The runtime would do a round robin on each running program, executing a single statement of each on each pass. For this it had to keep keeping a separate program counter and variables for each. I even planned to add windowing calls to create custom windows but I never got to finish it.
It was at this time that I also got interested in electronics, so I built a a few contraptions controlled by the C64, but that's a tale for another post.
Wednesday, March 30, 2011
Matching Regular Expressions using its Derivatives
Introduction
Regular expressions are expressions that describe a set of strings over a particular alphabet. We will begin with a crash course on simple regular expressions. You can assume that we're talking about text and characters but in fact this can be generalized to any (finite) alphabet.The definition of regular expressions is quite simple1, there are three basic (i.e. terminal) regular expressions:
- The null expression (denoted as: ∅) which never matches anything
- The empty expression, that only matches the empty string (I will use ε to represent this expression since it's customary2)
- The character literal expression (usually called 'c'), that matches a single character
These three basic building blocks can be combined using some operators to form more complex expressions:
- sequencing of regular expressions matches two regular expressions in sequence
- alternation matches one of the two sub-expressions (usually represented by a '|' symbol)
- the repetition operator (aka. Kleene's star) matches zero or more repetitions of the specified subexpression
Some examples will make this clearer:
- The expression 'a' will only match the character 'a'. Similarly 'b' will only match 'b'. If we combine them by sequencing 'ab' will match 'ab'.
- The expression 'a|b' will match either 'a' or 'b'.
- If we combine sequencing with alternation as in '(a|b)(a|b)' (the parenthesis are clarifying), it will match: 'aa', 'ab', 'ba' or 'bb'.
- Kleene's star as mentioned before matches zero or more of the preceding subexpression. So the expression 'a*' will match: '', 'a', 'aa', 'aaa', 'aaaa', ...
- We can do more complex combinations, such as 'ab*(c|ε)' that will match things like: 'a', 'ab', 'ac', 'abc', 'abb', 'abbc', ... that is any string starting with an 'a' followed by zero or more 'b''s and optionally ending in a 'c'.
Typical implementations of regular expression matchers convert the regular expression to an NFA or a DFA (which are a kind of finite state machine).
Anyway, a few weeks ago I ran into a post about using the derivative of a regular expression for matching.
It is a quite intriguing concept and worth exploring. The original post gives an implementation in Scheme3. But leaves out some details that make it a bit tricky to implement. I'll try to walk you through the concept, up to a working implementation in Java.
Derivative of a Regular Expression
So, first question: What's the derivative of a regular expression?The derivative of a regular expression with respect to a character 'c' computes a new regular expression that matches what the original expression would match, assuming it had just matched the character 'c'.
As usual, some examples will (hopefully) help clarify things:
- The expression 'foo' derived with respect to 'f' yields the expression: 'oo' (which is what's left to match).
- The expression 'ab|ba' derived with respect to 'a', yields the expression: 'b'
Similarly, the expression 'ab|ba' derived with respect to 'b', yields the expression: 'a'
- The expression '(ab|ba)*' derived with respect to 'a', yields the expression: 'b(ab|ba)*'
RegEx
class. The skeleton of this class looks like this:
public abstract class RegEx { public abstract RegEx derive(char c); public abstract RegEx simplify(); //... public static final RegEx unmatchable = new RegEx() { /* ... */ } public static final RegEx empty = new RegEx() { /* ... */ } }It includes constants for the unmatchable (null) and empty expressions, and a derive and simplify methods wich we will cover in detail (but not just now).
Before we go in detail about the rules of regular expression derivation, let's take a small -but necessary- detour and cover some details that will help us get a working implementation.
The formalization of the derivative of a regular expression depends on a set of simplifying constructors that are necessary for a correct implementation. These will be defined a bit more formally and we will build the skeleton of its implementation at this point.
Let's begin with the sequencing operation, we define the following constructor (ignore spaces):
seq( ∅, _ ) = ∅ seq( _, ∅ ) = ∅ seq( ε, r2 ) = r2 seq( r1, ε ) = r1 seq( r1, r2 ) = r1 r2The first two definitions state that if you have a sequence with the null expression (∅, which is unmatchable) and any other expression, it's the same than having the null expression (i.e. it will not match anything).
The third and fourth definitions state that if you have a sequence of the empty expression (ε, matches only the empty string) and any other expression, is the same than just having the other expression (the empty expression is the identity with respect to the sequence operator).
The fifth and last definition just builds a regular sequence.
With this, we can draft a first implementation of a sequence constructor (in the gang-of-four's parlance it's a factory method):
public RegEx seq(final RegEx r2) { final RegEx r1 = this; if(r1 == unmatchable || r2 == unmatchable) return unmatchable; if(r1 == empty) return r2; if(r2 == empty) return r1; return new RegEx() { // .... }; }I'm leaving out the details of the RegEx for the time being, we will come back to them soon enough.
The alternation operator also has simplifying constructor that is analogous to the sequence operator:
alt( ε, _ ) = ε alt( _, ε ) = ε alt( ∅, r2 ) = r2 alt( r1, ∅ ) = r1 alt( r1, r2 ) = r1 | r2If you look closely, the first two definitions are rather odd. They basically reduce an alternation with the empty expression to the empty expression (ε). This is because the simplifying constructors are used as part of a simplification function that reduces a regular expression to the empty expression if it matches the empty expression. We'll see how this works with the rest of it in a while.
The third and fourth definitions are fairly logical, an alternation with an unmatchable expression is the same than the alternative (the unmatchable expression is the identity with respect to the alternation operator).
The last one is the constructor.
Taking these details into account, we can build two factory methods, one internal and one external:
private RegEx alt0(final RegEx r2) { final RegEx r1 = this; if(r1 == empty || r2 == empty) return empty; return alt(r2); } public RegEx alt(final RegEx r2) { final RegEx r1 = this; if(r1 == unmatchable) return r2; if(r2 == unmatchable) return r1; return new RegEx() { //..... }; }The internal one
alt0
includes the first two simplification rules, the public one is user-facing. That is, it has to let you build something like: 'ab*(c|ε)'.
Finally, the repetition operator (Kleene's star) has the following simplification rules:
rep( ∅ ) = ε rep( ε ) = ε rep( re ) = re*The first definition states that a repetition of the unmatchable expression, matches at least the empty string.
The second definition states that a repetition of the empty expression is the same than matching the empty expression.
And as usual, the last one is the constructor for all other cases.
A skeleton for the
rep
constructor is rather simple:
public RegEx rep() { final RegEx re = this; if(re == unmatchable || re == empty) return empty; return new RegEx() { // .... }; }
Simplify & Derive
As hinted earlier on, derivation is based on a simplification function. This simplification function reduces a regular expression to the empty regular expression (ε epsilon) if it matches the empty string or the unmatchable expression (∅) if it does not. The simplification function is defined as follows:s(∅) = ∅ s(ε) = ε s(c) = ∅ s(re1 re2) = seq(s(re1), s(re2)) s(re1 | re2) = alt(s(re1), s(re2)) s(re*) = εNote that this function depends on the simplifying constructors we described earlier on. Suppose that we want to check if the expression 'ab*(c|ε)' matches the empty expression, if we do all the substitutions:
- seq(s(ab*),s(c|ε))
- seq(s(seq(s(a), s(b*))),s(alt(s(c), s(ε))))
- seq(s(seq(∅, s(ε))),s(alt(∅, ε)))
- seq(s(seq(∅, ε)),s(ε))
- seq(s(∅),ε)
- seq(∅,ε)
- ∅
- alt(s(a*), s(b))
- alt(ε, ∅)
- ε
D( ∅, _ ) = ∅ D( ε, _ ) = ∅ D( c, x ) = if c == x then ε else ∅ D(re1 re2, x) = alt( seq( s(re1) , D(re2, x) ), seq( D(re1, x), re2 ) ) D(re1 | re2, x) = alt( D(re1, x) , D(re2, x) ) D(re*, x) = seq( D(re, x) , rep(re) )The first two definition define the derivative of the unmatchable and empty expressions regarding any character, wich yields the unmatchable expression. The third definition states that if a character matcher (for example 'a') is derived with respect to the same character yields the empty expression otherwise yields the unmatchable expression. The fourth rule is a bit more involved, but trust me, it works. The fifth rule states that the derivative of an alternation is the alternation of the derivatives (suitably simplified). And the last one, describes how to derive a repetition. For example D('(ba)*', 'b') yields 'a(ba)*'. We now have enough information to implement the simplify and
Matching
If you haven't figured it out by now, matching works by walking the string we're checking character by character and successively deriving the regular expression until we either run out of characters, at wich point we simplify the derived expression and see if it matches the empty string. Or we end up getting the unmatchable expression, at wich point it is impossible that the rest of the string will match. A iterative implementation of a match method is as follows:public boolean matches(final String text) { RegEx d = this; String s = text; //The 'unmatchable' test is not strictly necessary, but avoids unnecessary derivations while(!s.isEmpty() && d != unmatchable) { d = d.derive(s.charAt(0)); s = s.substring(1); } return d.simplify() == empty; }If we match 'ab*(c|ε)' against the text "abbc", we get the following derivatives:
- D(re, a) = ab*(c|ε) , rest: "bbc"
- D(re, b) = b*(c|ε) , rest: "bc"
- D(re, b) = b*(c|ε) , rest: "c"
- D(re, c) = b*(c|ε) , rest: ""
Implementation
The following is the complete class with all methods implemented. I provide a basic implementation of thetoString
method (which is nice for debugging), and a helper text
method which is a shortcut to build an expression for a sequence of characters. This class is fairly easy to modify to match over a different alphabet, such as arbitrary objects and Iterables instead of Strings (it can be easily generified).
public abstract class RegEx { public abstract RegEx derive(char c); public abstract RegEx simplify(); public RegEx seq(final RegEx r2) { final RegEx r1 = this; if(r1 == unmatchable || r2 == unmatchable) return unmatchable; if(r1 == empty) return r2; if(r2 == empty) return r1; return new RegEx() { @Override public RegEx derive(char c) { return r1.simplify().seq(r2.derive(c)) .alt0(r1.derive(c).seq(r2)); } @Override public RegEx simplify() { return r1.simplify().seq(r2.simplify()); } @Override public String toString() { return r1 + "" + r2; } }; } private RegEx alt0(final RegEx r2) { final RegEx r1 = this; if(r1 == empty || r2 == empty) return empty; return alt(r2); } public RegEx alt(final RegEx r2) { final RegEx r1 = this; if(r1 == unmatchable) return r2; if(r2 == unmatchable) return r1; return new RegEx() { @Override public RegEx derive(char c) { return r1.derive(c).alt0(r2.derive(c)); } @Override public RegEx simplify() { return r1.simplify().alt0(r2.simplify()); } @Override public String toString() { return "(" + r1 + "|" + r2 + ")"; } }; } public RegEx rep() { final RegEx re = this; if(re == unmatchable || re == empty) return empty; return new RegEx() { @Override public RegEx derive(char c) { return re.derive(c).seq(re.rep()); } @Override public RegEx simplify() { return empty; } @Override public String toString() { String s = re.toString(); return s.startsWith("(") ? s + "*" :"(" + s + ")*"; } }; } public static RegEx character(final char exp) { return new RegEx() { @Override public RegEx derive(char c) { return exp == c?empty:unmatchable; } @Override public RegEx simplify() { return unmatchable; } @Override public String toString() { return ""+ exp; } }; } public static RegEx text(final String text) { RegEx result; if(text.isEmpty()) { result = empty; } else { result = character(text.charAt(0)); for (int i = 1; i < text.length(); i++) { result = result.seq(character(text.charAt(i))); } } return result; } public boolean matches(final String text) { RegEx d = this; String s = text; //The 'unmatchable' test is not strictly necessary, but avoids unnecessary derivations while(!s.isEmpty() && d != unmatchable) { d = d.derive(s.charAt(0)); s = s.substring(1); } return d.simplify() == empty; } private static class ConstantRegEx extends RegEx { private final String name; ConstantRegEx(String name) { this.name = name; } @Override public RegEx derive(char c) { return unmatchable; } @Override public RegEx simplify() { return this; } @Override public String toString() { return name; } } public static final RegEx unmatchable = new ConstantRegEx("<null>"); public static final RegEx empty = new ConstantRegEx("<empty>"); public static void main(String[] args) { final RegEx regEx = character('a') .seq(character('b').rep()) .seq(character('c').alt(empty)); if(regEx.matches("abbc")) { System.out.println("Matches!!!"); } } }Disclaimer: Any bugs/misconceptions regarding this are my errors, so take everything with a grain of salt. Feel free to use the code portrayed here for any purpose whatsoever, if you do something cool with it I'd like to know, but no pressure. Footnotes
- Sometimes the simpler something is, the harder it is to understand. See lambda calculus for example.
- I will not use ε (epsilon) to also represent the empty string since I think it is confusing, even though it is also customary.
- I think that the Scheme implementation in that article won't work if you use the repetition operator, but I haven't tested it. It might just as well be that my Scheme-foo is a bit rusty.