Showing posts with label Performance. Show all posts
Showing posts with label Performance. Show all posts

Wednesday, September 07, 2011

Parallelism & Performance: Why O-notation matters

UPDATE: There was a bug in the way the suffix map was built in the original post that lolplz found. I updated this post with that bug fixed, however, the content of the post has not changed significantly.

Aleksandar Prokopec presented a very interesting talk about Scala Parallel Collections in Scala Days 2011. I saw it today and gave me some food for thought,

He opens his talk with an intriguing code snippet:

for {
  s <- surnames
  n <- names
  if s endsWith n
} yield (n , s)
What this code does, given what presumably are two sequences of strings, it builds all pairs (name, surname) where the surname ends with the name.

The algorithm used is very brute force, it's order is roughly O(N^2) (it's basically a Cartesian Product being filtered). He then goes on to show that by just using parallel collections:

for {
  s <- surnames.par
  n <- names.par
  if s endsWith n
} yield (n , s)
He can leverage all the cores and reduce the runtime by two or four, depending on the number of cores available in the machine where it's running. For example, the non-parallel version runs in 1040ms, the parallel one runs in 575ms with two cores, and in 305ms with four. Which is indeed very impressive for such a minor change.

What concerns me is that, no matter how many cores you add, the problem as presented is still O(N^2). It is true that many useful problems can only be speeded up by throwing more hardware at it, but most of the times, using the right data representation can yield even bigger gains.

If we use this problem as an example, we can build a slightly more complex implementation, but that hopefully is a lot faster. The approach I'm taking is that of building a suffix map for surnames. There are more efficient data structures (memory wise) to do this, but for simplicity I'll use a Map:

val suffixMap = collection.mutable.Map[String, List[String]]().withDefaultValue(Nil)
for (s <- surnames; i <- 0 until s.length; suffix = s.substring(i)) 
    suffixMap(suffix) = s :: suffixMap(suffix)
Having built the prefix map, we can naively rewrite the loop to use it (instead of the Cartesian Product):
for {
  n <- names
  s <- suffixMap(n)
} yield (n , s)
In theory, this loop is roughly O(N) (assuming the map is a HashMap), since it now does a constant number of operations on each name it processes, rather than processing all the names for each surname (I'm ignoring the fact that the map returns lists).
Note: The algorithmic order does not change much if we take into account the suffix map construction. Let's assume that we have S surnames and N names, the order of building the suffix map is O(S) and the order building the pairs is O(N), the total order of the algorithm is O(N+S). If we assume that N is approximately equal to S, then the order is O(N+N) which is the same than O(2N), which can be simplified to O(N).
So, let's see if this holds up in real life. For this I wrote the following scala script that runs a few passes of several different implementations:
#!/bin/sh
exec scala -deprecation -savecompiled "$0" "$@"
!#
def benchmark[T](name: String)(body: =>T) {
    val start = System.currentTimeMillis()
    val result = body
    val end = System.currentTimeMillis()
    println(name + ": " + (end - start))
    result
}

val surnames = (1 to 10000).map("Name" + _)
val names    = (1 to 10000).map("Name" + _)

val suffixMap = collection.mutable.Map[String, List[String]]().withDefaultValue(Nil)
for (s <- surnames; i <- 0 until s.length; suffix = s.substring(i)) 
    suffixMap(suffix) = s :: suffixMap(suffix)

for( i <- 1 to 5 ) {
    println("Run #" + i)
    benchmark("Brute force") {
        for {
            s <- surnames
            n <- names
            if s endsWith n
        } yield (n , s)
    }

    benchmark("Parallel") {
        for {
            s <- surnames.par
            n <- names.par
            if s endsWith n
        } yield (n , s)
    }

    benchmark("Smart") {
        val suffixMap = collection.mutable.Map[String, List[String]]().withDefaultValue(Nil)
        for (s <- surnames; i <- 0 until s.length; suffix = s.substring(i)) 
            suffixMap(suffix) = s :: suffixMap(suffix)
        for {
            n <- names
            s <- suffixMap(n)
        } yield (n , s)
    }

    benchmark("Smart (amortized)") {
        for {
            n <- names
            s <- suffixMap(n)
        } yield (n , s)
    }
}
There are four implementations:
  • Brute Force: the original implementation
  • Parallel: same as before, but using parallel collections.
  • Smart: Using the prefix map (and measuring the map construction)
  • Smart (amortized): same as before, but with the prefix map cost amortized.

Benchmark Results

Running this script in a four core machine, I get the following results:
Run #1
Brute force: 2158
Parallel: 1355
Smart: 153
Smart (amortized): 27
Run #2
Brute force: 1985
Parallel: 899
Smart: 82
Smart (amortized): 7
Run #3
Brute force: 1947
Parallel: 716
Smart: 69
Smart (amortized): 5
Run #4
Brute force: 1932
Parallel: 714
Smart: 67
Smart (amortized): 6
Run #5
Brute force: 1933
Parallel: 713
Smart: 68
Smart (amortized): 5
As expected, the parallel version runs 3.5 times as fast as the naive one, but the implementation using the "Smart" approach runs more than 30 times faster than the naive one. If we were able to amortize the cost of building the suffix map, the speed-up is even more staggering, at a whooping 380 times faster (although this is not always possible)!

What we can conclude from this, paraphrasing Fred Brooks1, is that regarding performance there is no silver bullet. The basics matter, maybe now they even matter more than ever.

Analyzing algorithmic order, in its most basic form (making huge approximations) is a very practical tool to solving hard performance problems. Still, the simple approach used here to optimize the problem is also parallelizable, which for a large enough problem, it might gain some speedup using parallel collections.

Don't get me wrong, I love Scala parallel collections and parallel algorithms are increasingly important, but they are by no means a magical solution that can be used for any problem (and I think Aleksandar is with me in this one).

Footnotes

1 misquoting is probably more accurate.

Friday, February 25, 2011

Performance Invariants (Part II)

A few days ago I wrote a post about performance invariants. The basic idea behind them, is that there should be an easy way to declare performance constraints at the source code level, and that you should be able to check them every time you run your unit tests. To make a long story short, I have been a busy little bee for the last few days and managed to build a reasonable proof-of-concept.

Let's start with simple example:
import performance.annotation.Expect;
...
class Test {
    @Expect("InputStream.read == 0")
    static void process(List<String> list) {
        //...
    }
}
What we're asserting here is that we want to make sure that the number of calls to methods called read defined in classes named InputStream should be exactly zero.

If we want to exclude basically all IO, we can change the expectation to:
import performance.annotation.Expect;
...
class Test {
    @Expect("InputStream.read == 0 && OutputStream.write == 0")
    static void process(List<String> list) {
        //...
    }
}
Note that these are checked even for code that is called indirectly by the method process.

If we add an innocent looking println:
    @Expect("InputStream.read == 0 && OutputStream.write == 0")
    static void process(List<String> list) {
        System.out.println("Hi!");
        //...
    }

And run it with the agent enabled by using:
~>java -javaagent:./performance-1.0-SNAPSHOT-jar-with-dependencies.jar \
       -Xbootclasspath/a:./performance-1.0-SNAPSHOT-jar-with-dependencies.jar Test
You should get something like the following output:
Hi!
Exception in thread "main" java.lang.AssertionError: Method 'Test.process' did not fulfil: InputStream.read == 0 && OutputStream.write == 0
         Matched: [#OutputStream.write=7, #InputStream.read=0]
         Dynamic: []
        at performance.runtime.PerformanceExpectation.validate(PerformanceExpectation.java:69)
        at performance.runtime.ThreadHelper.endExpectation(ThreadHelper.java:52)
        at performance.runtime.Helper.endExpectation(Helper.java:61)
        at Test.process(Test.java:17)
        at Test.main(Test.java:39)
This is witchraft, I say! ... well kind of.

Let's stop a moment and consider what's going on here. Notice the first line of the output. It contains the text "Hi!" that we printed. This happens because the check is performed after the method process finishes. In the fourth line, you can see how many times each method matched during the execution of the process method. Ignore the "Dynamic" list for just a second.

Let's try something a bit more interesting:
    class Customer { /*... */}
    //...
    @Expect("Statement.executeUpdate < ${customers.size}")
    void storeCustomers(List<Customer> customers) {
        //...
    }
Note the ${customers.size} in the expression, what this intuitively mean is that we want to take the size of the list as an upper bound. It's like the poor programmer's big-O notation. If we were to run this, but assuming that we execute two updates for each customer (instead of one as asserted), we would get:
Exception in thread "main" java.lang.AssertionError: Method 'Test.storeCustomers' did not fulfil: Statement.executeUpdate < ${customers.size}
         Matched: [#Statement.executeUpdate=50]
         Dynamic: [customers.size=25.0]
        at performance.runtime.PerformanceExpectation.validate(PerformanceExpectation.java:69)
        at performance.runtime.ThreadHelper.endExpectation(ThreadHelper.java:52)
        at performance.runtime.Helper.endExpectation(Helper.java:61)
        at Test.storeCustomers(Test.java:19)
        at Test.main(Test.java:42)
Check the third line, this time, the "Dynamic" list contains the length of the list. In general, expressions of the form ${a.b.c.d} are called dynamic values. They refer to arguments, instance variables or static variables. For example:
  • ${static.CONSTANT} refers to a variable named CONSTANT in the current class.
  • ${this.instance} refers to a variable named 'instance' in the current object (only valid for instance methods).
  • ${n} refers to an argument named 'n' (this only works if the class has debug information)
  • ${3} refers to the fourth argument from the left (zero based indexing)
All dynamic values MUST yield a numeric value, otherwise a failure will be reported at runtime. Currently the library will complain if any dynamic value is null.

Although this is an early implementation, it is enough to start implementing performance invariants that can be checked every time you run your unit tests.
Enough for today, in a followup post I'll go into the internals of the agent. If you want to browse the source code or try it out, go and grab a copy from github.

Tuesday, February 22, 2011

Performance Invariants

UPDATE A newer post on this subject can be found here

Let's start with a problem: How do you make unit tests that test for performance?

It might seem simple, but consider that:
  • Test must be stable across hardware/software configurations
  • Machine workload should not affect results (at least on normal situations)

A friend of mine (Fernando Rodriguez-Olivera if you must know) thought of the following (among many other things):
For each test run, record interesting metrics, such as:
  • specific method calls
  • number of queries executed
  • number of I/O operations
  • etc.
And after the test run, assert that these values are under a certain threshold. If they're not, fail the test.
He even implemented a proof-of-concept using BeanShell to record these stats to a file during the test run, and it would check the constraints after the fact.

Yesterday I was going over these ideas while preparing a presentation on code quality and something just clicked: annotate methods with performance invariants.
The concept is similar to pre/post conditions. Each annotation is basically a post condition on the method call that states which performance "promises" the method makes.

For example you should be able to do something like:
@Ensure("queryCount <= 1")
public CustomerInfo loadCustomerInfo() {...}
Or maybe something like this:
@Ensure("count(java.lang.Comparable.compareTo) < ceil(log(this.customers.size()))")
public CustomerInfo findById(String id) {...}

These promises are enabled only during testing since checking for them might be a bit expensive for a production system.

As this is quite recent I don't have anything working (yet), but I think it's worth exploring.
If I manage to find some time to try and build it, I'll post some updates here.