Archive for the 'TDD' Category

TDD in Clojure, part 2 (in which I recover fairly gracefully from a stupid decision)

Part 1
Part 3

I ended Part 1 saying that my next step would be to implement a function that counts the number of living neighbors a cell has. Given that we’re already pretending (through stubbing) that a living? function exists, living-neighbor-count is pretty trivial if we also pretend we’ve got a neighbors function:

Following my “mapping, like accessors, is too simple to test” guideline, I almost didn’t write a test. But what the heck:

Once the test passes, we need to write neighbors. To implement it, we’re going to have to take cells apart (to get x and y coordinates) and put them back together (to create neighbors). So I don’t see any point to using stubs and dummy variables like ...cell... in this test:

Boldly, I will here use one test to define both cell-at and neighbors (as well as the test helper have-coordinates that checks a list of cells against a list of coordinates).

(If I were more sensitive to that small voice in my head that warns I’m going astray, I would have heard something around now, but I ignored it. So we will too.)

Enter the REPL

My thought about how to implement neighbors has three steps, so I’ll try them out in the REPL. First, I’ll make (x,y) pairs to add and subtract from the original cell’s coordinates:

That’s good, except (0, 0) shouldn’t be in there. (A cell can’t be its own neighbor.) So I need to delete that:

(remove #{[0 0]} product) is a Clojure idiom. remove returns its second (sequence) argument, omitting any element that the first argument (a function) returns truthy for. #{x} is the set containing x. In Clojure, sets act as functions that return something truthy iff their single argument is in the set. That is:

Finally, I need a function that shifts a cell by an offset. For the REPL, I’ll pretend the cell is just an [x y] vector. (We have yet to define what it really is.)

I can build neighbors from what I’ve tried out. To make the test pass, I’ll continue to use vectors for cells, hiding them behind a simple functional interface of cell-at, x, and y.

The concrete representation of the cell — and disaster

Here are the functions as yet undefined:

There’s no more escaping it. I’m going to have to decide what kind of thing border produces. That thing has to be a sequence for tick to map over:

border's result is also stored in world where living? will use it to decide whether a given cell is alive or dead.

My first thought was that I could use the set idiom I used above—the bordered world could just be the set of all living coordinates. Sneakily, any location not in the set would represent a dead cell. That would be great for implementing living?, but it wouldn’t work for tick, which has to process not only living cells, but also the dead cells that make up the border.

So my fallback was for border to produce a map, something like this:

Maps are sequences, so you can map over them. But I don’t think I’ve ever actually tried it. What happens?…

OH GREAT. If I go down this route, we’ll have three different ways of representing cells:

  • as the original location in inputs like *vertical-blinker*: [0 1]
  • as part of a living/dead map: {... [0 1] :dead ...}
  • as a living/dead vector: [ [0 1] :dead ]

That’s intolerable. And yes, I bet at least half of my two readers thought I was mistaken not to think about data structures at the very beginning. However, my strategy with Clojure TDD has been to put off thinking about data structure as long as I can, and I’ve been surprised and pleased by how often I ended up with simpler data than it seemed I would. I’ve found that, given the use of globally-available immutable “background” data, much of what might have been explicit data structure–vectors of maps of vectors of…–ends up in the implicit structure of the computation. More about that, though, will have to wait for another post.

A recovery plan

The problem is here:

When I wrote that, I remember that the still small voice of conscience objected to the way I was both stashing the bordered-world away as background and simultaneously picking it apart with map. That just felt weird, but I argued myself into thinking it was harmless. It was not.

Really, since my whole program takes input [x y] pairs (such as *vertical-blinker*) and turns them into a different set of [x y] pairs, most of my work ought to be done with those pairs. I should be thinking about locations of cells, not cells themselves. In that way of thinking, border shouldn’t produce “cells”. It should take locations of living cells and produce locations that point to both living cells and adjacent dead cells.

Further, I shouldn’t repeat those locations in a world function. Instead, I need something that can answer questions about cells, given their locations. It should be a… (I’m bad with names)… an oracle about cells. I first imagined this:

using-cell-oracles-from should produce any wise and oracular functions we need. So far, that’s just living?.

I realized something more. Locations are flowing into the pipeline, locations are flowing out, and in this version, locations won’t be transformed into cells anywhere within the pipeline. That makes unborder, which was originally supposed to convert a mixture of living and dead cells into only living locations, seem kind of stupid. If tick produces only living locations, unborder can go away. (The name unborder always bugged me, because it didn’t really describe what the function would have to do. Once again, I should have paid attention.)

That leads to this top-level function:

That wasn’t so bad…

As it turns out, changing my mind about such a fundamental decision was easy.

What did I have to do to the code? I had to write using-cell-oracles-from. Here’s a test.

I won’t show the code that passes this test—it’s a somewhat grotty macro (but a simple transformation of the earlier against-background). You can see it in the complete source for this post.

I did a quick global-replace of “cell” with “location” and tweaked a couple of the resulting names. Although both you and I know that locations are just pairs, I retained the functions make-location (formerly cell-at), x, and y to keep the code insulated from the potential of another change of mind.

I had to convert the successor function to dead-in-next-generation?. That was pretty simple. I had to change two lines in the test. Here’s one:

To make that test pass, I had to rewrite successor. It used to be this:

Now it’s this:

That was just a matter of inverting the logic and deleting killed and vivified. (Before I ever got around to writing them!)

The ease of this change makes me happy. Even though I blundered at the very beginning of my design, the way stub-heavy TDD lets me defer decisions—and forces me to encapsulate them so that I have something to stub—made the blunder a not-catastrophe. I wish I could say that I blundered deliberately to demonstrate that property of this style of TDD, but that would be a lie.

Enough for today

Only one function remains: add-border-to. That’ll be pretty easy, but this post is already too long. The next one will finish up the implementation and add whatever grand summary I can come up with.

TDD in Clojure: a sketch (part 1)

Part 2

I continue to use little experiments to help me think through TDD in Clojure. (I plan to begin a realistic experiment soon.) Right now, I’m mainly focused on three questions:

  • What would mocking or stubbing mean in a strict(ish) functional language?

  • What’d be a good mocking notation for Clojure?

  • How do you balance the outside-in style associated with mocks and the bottom-up style that the REPL (interpreter) encourages?

Here’s an example from Conway’s Game of Life. It begins with an implementation suggestion from Paul Blair and Michael Nicholaides at the Philly Code Retreat. Instead of thinking of the board as a 2×2 array of cells, with some of them dead and some alive, think instead only of living cells, each of which knows its coordinates. Here’s an example that shows how “blinkers” blink from generation to generation.

A couple of things have happened here:

  • This is my notation for a straightforward non-stubbing test. The value on the left is executed and it’s compared (for equality) to the value on the right.

  • I’ve started coding outside-in, and I’ve named the first function I need: next-world.

The Blair/Nicholaides approach advances the “world” to the next generation by (conceptually) adding dead cells around the edge of all the living cells, running the normal life rules that govern how cells change because of their neighbors, and then throwing away all the cells that end up dead. In other words:

  • The pending bit is just there because (sadly) Clojure makes you declare functions before mentioning them. pending just creates functions that print that they’ve not yet been implemented.

  • The rest of the code flows the world argument through a pipeline of three functions. If you’re not familiar with the -> macro, the result is the same as this:

    I don’t feel the need to test this code now because it’s really declarative—it says what it means to produce a next world under this approach. (It will be tested in the very end by the “integration test” that shows a blinker working.)

I can now implement any of the three new functions. I’ll pick tick because it seems to be the heart of the matter. Here’s a first implementation:

There are two odd things going on here.

First, stubbing function calls.

In object-oriented languages, I think of mock-driven-design as a way of teasing out collaborators for the object I’m building. I push responsibilities for work onto objects that I’ll implement later. Mocking lets me defer the implementation of those objects until I’m ready, and creating some examples of the API teaches me the (implicit) specification for the new object.

I’ve found that with pure functional programs that don’t modify state, it makes more sense to think of a function like (f 2) => 4 as a fact. What I’m doing as I test-drive a function is describing how facts about its inputs and outputs depend on other facts, in an almost Prolog-like way. For example, consider this code:

That says that, for any cell you care to provide, f of that cell will be 10, provided g of that cell is true and h is 2. If either of those latter two facts don’t apply to the cell, I’m not saying what f’s value is.

I use the funny ...cell... notation in the way that mathematicians use n to talk about any integer. (They call that universal quantification.) I don’t want to create a particular cell because I might need to specify properties that have nothing to do with the function I’m working on. This notation says that nothing about the cell is relevant except for what comes after the provided.

Here’s one way to write a Life rule in this notation:

The falsey bit in the first line is because Clojure has two distinct values that can mean “false”. falsey is a function that takes the result of the left-hand side and fails the test if that result is anything other than one of the two false values. I’m using it because I don’t want to overspecify living?. There’s no reason to care which of the two “false” values it returns.

There’s a problem with this test, though. Remember what I said above: the left-hand side gets evaluated and handed to falsey. That means living? has to have a definition—which means I’d have to settle on how the code knows whether a cell is alive or dead. I like doing one thing at a time and putting off decisions as long as I can, and right now I’d rather be focused on successor instead of cell representations.

Here’s a way to defer that decision:

Here I’m saying something subtly different than before. I’m saying that the result of successor is specifically that cell produced by calling killed on the original cell. The =means=> notation tells the framework to create a mock instead of evaluating the right-hand side for its value. In a more familiar mocking syntax (for Ruby), the whole test is equivalent to:

OK. The next figure gives the whole set of Life rules, expressed as executable tests. (Well, executable as soon as I implement the testing framework.) Notice that I called the outer wrapper know (a fact) instead of example. know seems more appropriate for rules. The two forms mean the same thing.

Notice also that I implemented a notation for saying “run this test for each value in a sequence”. The use of commas, as in [4,,,8], indicates that—conceptually—the fact is true for all values four through eight. Only the ones listed are actually tried. (Commas count as >white space in Clojure.)

This isn’t the tersest possible format—a table would be better—but it’ll do. I think it’s reasonably readable. Do you?

Here, for reference, is code that passes the test:

We now have an expanded choice of functions to write:

I could go breadth-first—with border and unborder—or go depth-first with one of the functions on the second line. In this particular case, I’d rather go depth first. I’ve avoided deciding on a representation, so I don’t know yet what border should do.

If this installment meets your approval, I’ll add another one that begins work on—oh—probably living-neighbor-count is the most complicated, so it’s a good one to chip away at.

Mocks and legacy code

While on my grand European trip, I stopped in for a day at a nice company with a fun group of people doing good work on a legacy code base. They challenged me to improve an existing test using mocks. The test was typical of those I’ve seen in legacy code situations: there was a whole lot of setup code because you couldn’t instantiate any single object without instantiating a zillion of them, and the complexity of the test made figuring out its precise purpose difficult.

After some talk, we figured out that what the test really wanted to check was that when a Quote is recalculated because it’s out-of-date, you get a brand-new Quote.

Rather than morph the test, I tried writing it afresh in my mockish style. A lot of the complexity of the test was in setting things up so that an existing quote should be retrieved. Since my style these days is to push off any hard work to a mocked-out new object, I decided we should have a QuoteFinder object that would do all that lookup for us. The test (in Ruby) would look something like this:

quote_finder = flexmock(”quote finder“)
quote = flexmock(”quote“)

during {
   some function 
}.behold! {
 quote_finder.should_receive(:find_quote).once.
              with(…whatever…).
              and_return(quote)
 
}

Next, the new quote had to be generated. The lazy way to do that would be to add that behavior to Quote itself:

quote_finder = flexmock(”quote finder“)
quote = flexmock(”quote“)

during {
   some function 
}.behold! {
  quote_finder.should_receive(:find_quote).once.
               with(…whatever…).
               and_return(quote)
  quote.should_receive(:create_revised_quote).once.
        with(…whatever…).
        and_return(”a new quote“)
}

Finally, the result of the function-under-test should be the new quote:

quote_finder = flexmock(”quote finder“)
quote = flexmock(”quote“)

during {
   some function 
}.behold! {
  quote_finder.should_receive(:find_quote).once.
               with(…whatever…).
               and_return(quote)
  quote.should_receive(:create_revised_quote).once.
        with(…whatever…).
        and_return(”a new quote“)
}

assert { @result == a new quote }

I felt a bit of a fraud, since I’d shoved a lot of important behavior into tests that would need to be written by someone else (including the original purpose of the test, making sure the next Quote was a different object than the last one.) The team, though, gave me more credit than I did. They’d had two Aha! moments. First, the idea of “finding a quote” was spread throughout the code, and it would be better localized in a QuoteFinder object. Second, they decided it really did make sense to have Quotes make new versions of themselves (rather than leave that responsibility somewhere else). So this test gave the team two paths they could take to improve their code.

In the beginning, the QuoteFinder and Quote#create_revised_quote would likely just delegate their work to the existing legacy code, but there were now two new organizational centers that could attract behavior. So this looks a lot like Strangling an App, but it avoids that trick’s potential “then a miracle occurs” problem of needing a good architecture to strangle with: instead, by following the make-an-object-when-you-hesitate strategy that mocking encourages, you can grow one.

I’ve not seen any writeup on using mocks to deal with legacy code. Have you?

P.S. It’s possible I’ve gotten details of the story wrong, but I think the essentials are correct.

TDD & Functional Testing: from collections to scalars

I’ve been fiddling around with top-down (mock-style) TDD of functional programs off-and-on for a few months. I’ve gotten obsessed with deferring the choice of data structures as long as possible. That seems appropriate in a functional language, where we should be talking about functions more than data. (And especially appropriate in Clojure, my language of choice, since Clojure lets you treat maps/dictionaries as if they were functions from keys to values.)

That is, I like to write these kinds of tests:

(example-of "saturating a terrain"
   (saturated? (... terrain ...)) => true
   (because
      (span-between-markers (... terrain ...)) => (... sub-span ...)
      (saturated? (... sub-span ...)) => true
)

… instead of committing to what a terrain or sub-span look like. That’s been working reasonably well for me.

I’ve also been saying that “maps are getters”. By that, I mean that—given that you’ve test-driven raise-position—it really makes no more sense to test-drive this:

(defn raise [terrain]
   (map raise-position terrain))

… than it does to test a getter: it’s too obvious. That leads to a nice flow of testing: I’m always testing the transformation of things to other things. I don’t have to worry, until the very end of test-driving, that the “things” are actually complex data.

The problem I’ve been running into recently, though, is handling cases where complex data structures are converted into single values. For example, I’ve been trying to show a top-down TDD of Conway’s Life. In that case, I have to reduce a set of facts about the neighborhood of a cell into a single yes-or-no decision: should that cell be alive or dead in the next iteration? But expressing that fact is rather awkward when you don’t want to say precisely what a “cell” is or how you know it’s “alive” or “dead” (other than that there’s some function from a cell and its environment to a boolean).

To be concrete, here’s something I want to claim: a cell is alive in the next iteration if (1) it is alive now and (2) exactly two of the cells in its neighborhood are alive. How do you say that while being not-specific? I’ve not found a way that makes me happy.

Part of the problem, I think, is that when you start talking about individual elements of collections, you’re moving from the Land of TDD, which is a land of functions-of-constants to a Land of Quantified Variables (like “there exists an element of the collection such that…”). That way lies madness.

A sort of thought about interaction (and perhaps state-based) tests

This here post is about making tests terse by specifying what has happened instead of (as in interaction tests) who did it or (as in state-based test) the different kinds of things-it-has-happened-to.

I have a test that says that the Availability object should use the TupleCache object to get particular values for: all animals, animals that are still working, and animals that have been removed from service. If one wants to show animals that can be removed from service, it’s this:

all animals - animals still working - animals already removed from service

Here’s a mock-style test that describes how the Availability uses the TupleCache:

 should use tuple cache to produce a list of animals do
      @availability.override(mocks(:tuple_cache))
      during {
        @availability.animals_that_can_be_removed_from_service
      }.behold! {
        @tuple_cache.should_receive(:all_animals).once.
                     and_return([{:animal_name => out-of-service jake‘},
                                 {:animal_name => working betsy‘},
                                 {:animal_name => some…‘},
                                 {:animal_name => …other…‘},
                                 {:animal_name => …animals‘}])
        @tuple_cache.should_receive(:animals_still_working_hard_on).once.
                     with(@timeslice.first_date).
                     and_return([{:animal_name => working betsy‘}])
        @tuple_cache.should_receive(:animals_out_of_service).once.
                     and_return([{:animal_name => out-of-service jake‘}])
       }
      assert_equal([”…animals“, …other…“, some…“], @result)
    end

I’m not wild about the amount of detail in the test, but let’s leave that to the side. Notice that the results of the test imply that the Availability is turning the tuples (think of them as hashes or dictionaries) into a simple list of strings. Notice also that the list of strings is sorted. Noticing that brings a couple of questions to mind:

  • That sorting - does it use ASCII sorting, which sorts all uppercase characters in front of lowercase? or is it the kind of sorting the users expect (where case is irrelevant)?

  • Are duplicates stripped out of the result?

As it happens, I want the responsibility of converting tuples into lists to belong to another object. I’d prefer Availability to have only the responsibility of asking the right questions of the persistent data, not also of massaging the results. I’d like to put that responsibility into a Reshaper object. Here’s an expanded test that does that:

    should use tuple cache to produce a list of animals do
      @availability.override(mocks(:tuple_cache, :reshaper))
      during {
        @availability.animals_that_can_be_removed_from_service
      }.behold! {
        @tuple_cache.should_receive(:all_animals).once.
                     and_return([”…tuples-all…“])
        @tuple_cache.should_receive(:animals_still_working_hard_on).once.
                     with(@timeslice.first_date).
                     and_return([”…tuples-work…“])
        @tuple_cache.should_receive(:animals_out_of_service).once.
                     and_return([”…tuples-os…“])
        # New lines
        @reshaper.should_receive(:extract_to_values).once.
                  with(:animal_name, [’…tuples-work…‘], [”…tuples-os…“], [”…tuples-all…“]).
                  and_return([[”working betsy“], [’out-of-service jake‘],
                              [’working betsy‘, out-of-service jake‘,
                              some…‘, …other…‘, …animals‘]])
        @reshaper.should_receive(:alphasort).once.
                  with([’some…‘, …other…‘, …animals‘]).
                  and_return([”…animals“, …other…“, some…“])
      }
      assert_equal([”…animals“, …other…“, some…“], @result)
    end

It shows that the Availability method calls Reshaper methods which we could see (if we looked) guarantee the properties that we want. But I don’t like this test. The relationship between Availability and Reshaper doesn’t seem to me nearly as fundamental as that between Availability and TupleCache. And I hate the notion that the general notion of “convert a pile of tuples into a sensible list” is made so specific: it will make maintenance harder. And I’m not thrilled (throughout this test) of the way that the human reader must infer claims about the code from the examples.

So how about this?:

   should use tuple cache to produce a list of animals do
      @availability.override(mocks(:tuple_cache))
      during {
        @availability.animals_that_can_be_removed_from_service
      }.behold! {
        @tuple_cache.should_receive(:all_animals).once.
                     and_return([{:animal_name => out-of-service jake‘},
                                 {:animal_name => working betsy‘},
                                 {:animal_name => some…‘},
                                 {:animal_name => …other…‘},
                                 {:animal_name => …animals‘}])
        @tuple_cache.should_receive(:animals_still_working_hard_on).once.
                     with(@timeslice.first_date).
                     and_return([{:animal_name => working betsy‘}])
        @tuple_cache.should_receive(:animals_out_of_service).once.
                     and_return([{:animal_name => out-of-service jake‘}])
      }
      assert_equal([”…animals“, …other…“, some…“], @result)
      assert { @result.history.alphasorted }

The last line of the test claims that—at some point in the past—the result list has been “alphasorted”. A list that’s been alphasorted has the properties we want, which we can check by looking at the tests for the Reshaper#alphasort method.

In essence, we check whether at some point in the past the object we’re looking at has been “stamped” with an appropriate description of its properties. Therefore, we don’t have to construct test input that checks the various ways that description can become true - we simply trust earlier tests of what the stamp means.

Here’s code that adds the stamp:

    def result.history()
      @history = OpenStruct.new unless @history
      @history
    end
    result.history.alphasorted = true
    result.freeze

(Notice that I “freeze” the object. In Ruby, that makes the object immutable. That’s in keeping with my growing conviction that maybe programs should consist of functional code sandwiched between carefully-delimited bits of state-setting code.)

Having said all that, I suspect that the original awkwardness in the tests is a sign that I need a different factoring of responsibilities, rather than making up this elaborate solution. But I haven’t figured out what that factoring should be, so I offer the alternative for consideration.

A parable about mocking frameworks

Somewhere around 1985, I introduced Ralph Johnson to a bigwig in the Motorola software research division. Object-oriented programming was around the beginning of its first hype phase, Smalltalk was the canonical example, and Ralph was heavily into Smalltalk, so I expected a good meeting.

The bigwig started by explaining how a team of his had done object-oriented programming 20 years before in assembly language. I slid under the table in shame. Now, it’s certainly technically possible that they’d implemented polymorphic function calls based on a class tag–after all, that’s what compilers do. Still, the setup required to do that was surely far greater than the burden Smalltalk and its environment put on the programmer. I immediately thought that the difference in the flexibility and ease that Smalltalk and its environment brought to OO programming made the two programming experiences completely incommensurable. (The later discussion confirmed that snap impression.)

I suspect the same is true of mocking frameworks. When you have to write test doubles by hand, doing so is an impediment to the steady cadence of TDD. When you write a statement in a mocking framework’s pseudo-language, doing so is part of the cadence. I bet the difference in experience turns into a difference in design, just as Smalltalk designs were different from even the most object-oriented assembler designs (though I expect not to the same extent).

Mocks, the removal of test detail, and dynamically-typed languages

Simplify, simplify, simplify!
Henry David Thoreau

(A billboard I saw once.)

Part 1: Mocking as a way of removing words

One of the benefits of mocks is that tests don’t have to build up complicated object structures that have nothing essential to do with the purpose of a test. For example, I have an entry point to a webapp that looks like this:

get /json/animals_that_can_be_taken_out_of_service‘, :date => 2009-01-01

It is to return a JSON version of something like this:

{ unused animals => [’jake‘] }

Jake can be taken out of service on Jan 1, 2009 because he is not reserved for that day or any following day.

In typical object-oriented fashion, the controller doesn’t do much except ask something else to do something. The code will look something like this:

  get /json/animals_that_can_be_taken_out_of_service do
    # Tell the “timeslice” we are concerned with the date given.

    # Ask the timeslice: What animals can be reserved on/after that date?
    # (That excludes the animals already taken out of service.) 

    # Those animals fall into two categories:
    # - some have reservations after the timeslice date. 
    # - some do not.
    # Ask the timeslice to create the two categories.

    # Return the list of animals without reservations. 
    # Those are the ones that can be taken out of service as of the given date. 
  end

If I were testing this without mocks, I’d be obliged to arrange things so that there would be examples of each of the categories. Here’s the creation of a minimal such structure:

  jake = Animal.random(:name => jake‘)
  brooke = Animal.random(:name => brooke‘)
  Reservation.random(:date => Date.new(2009, 1, 1)) do
    use brooke
    use Procedure.random
  end

The random methods save a good deal of setup by defaulting unmentioned parameters and by hiding the fact that Reservations have_many Groups, Groups have_many Uses, and each Use has an Animal and a Procedure. But they still distract the eye with irrelevant information. For example, the controller method we’re writing really cares nothing for the existence of Reservations or Procedures–but the test has to mention them. That sort of thing makes tests harder to read and more fragile.

In constrast to this style of TDD, mocking lets the test ignore everything that the code can. Here’s a mock test for this controller method:

    should return a list of animals with no pending reservations do
      brooke = Animal.random(:name => brooke‘)
      jake = Animal.random(:name => jake‘)

      during {
        get /json/animals_that_can_be_taken_out_of_service‘, :date => 2009-01-01
      }.behold! {
        @timeslice.should_receive(:move_to).once.with(Date.new(2009,1,1))
        @timeslice.should_receive(:animals_that_can_be_reserved).once.
                   and_return([brooke, jake])
        @timeslice.should_receive(:hashes_from_animals_to_pending_dates).once.
                   with([brooke, jake]).
                   and_return([{brooke => [Date.new(2009,1,1), Date.new(2010,1,1)]},
                               {jake => []}])
      }
      assert_json_response
      assert_jsonification_of(’unused animals => [’jake‘])
    end

There are no Reservations and no Procedures and no code-discussions of irrelevant connections amongst objects. The test is more terse and–I think–more understandable (once you understand my weird conventions and allow for my inability to choose good method names). That’s an advantage of mocks.

Part 2: Dynamic languages let you remove even more irrelevant detail

But I’m starting to think we can actually go a little further in languages like Ruby and Objective-J. I’ll use different code to show that.

When the client side of this app receives the list of animals that can be removed from service, it uses that to populate the GUI. The user chooses some animals and clicks a button. Various code ensues. Eventually, a PersistentStore object spawns off a Future that asynchronously sends a POST request and deals with the response. It does that by coordinating with two objects: one that knows about converting from the lingo of the program (model objects and so forth) into HTTP/JSON, and a FutureMaker that makes an appropriate future. The real code and its test are written in Objective-J, but here’s a version in Ruby:

should coordinate taking animals out of service do
  during {
    @sut.remove_from_service(”some animals“, an effective date“)
  }.behold! {
    @http_maker.should_receive(:take_animals_out_of_service_route).at_least.once.
                and_return: some route
    @http_maker.should_receive(:POST_content_from).once.
                with(:date => an effective date‘,
                     :animals => some animals“).
                and_return(’post content‘)
    @future_maker.should_receive(:spawn_POST).once.
                  with(’some route‘, post content‘)
  }
end

I’ve done something sneaky here. In real life, remove_from_service will take actual Animal objects. In Objective-J, they’d be created like this:

  betsy = [[Animal alloc] initWithName: betsy kind: cow‘];

But facts about Animals–that, say, they have names and kinds–are irrelevant to the purpose of this method. All it does is hand an incoming list of them to a converter method. So–in such a case–why not use strings that describe the arguments instead of the arguments themselves?

    @sut.remove_from_service(”some animals“, an effective date“)

In Java, type safety rarely lets you do that, but why let the legacy of Java affect us in languages like Ruby?

Now, I’m not sure how often these descriptive arguments are a good idea. One could argue that integration errors are a danger with mocks anyway, and that not using real examples of what flows between objects only increases that danger. Or that the increase in clarity for some is outweighed by a decrease for others: if you don’t understand what’s meant by the strings, there’s nothing (like looking at how test data was constructed) to help you. I haven’t found either of those to be a problem yet, but it is my own code after all.

(I will note that I do add some type hints. For example, I’m increasingly likely to write this:

    @sut.remove_from_service([”some animals“], an effective date“)

I’ve put “some animals” in brackets to emphasize that the argument is an array.)

If you’ve done something similar to this, let’s talk about it at a conference sometime. In the next few months, I’ll be at Speakerconf, the Scandinavian Developer Conference, Philly Emerging Tech, an Agile Day in Costa Rica, and possibly Scottish Ruby Conference.

Some preliminary thoughts on end-to-end testing in Growing Object-Oriented Software

I’ve been working through Growing Object-Oriented Software (henceforth #goos), translating it into Ruby. An annoyingly high percentage of my time has been spent messing with the end-to-end tests. Part of that is due to a cavalcade of incompatibilities that made me fake out an XMPP server within the same process as the app-under-test (named the Auction Sniper), the Swing GUI thread, and the GUI scraper. Threading hell.

But part of it is not. Part of it is because end-to-end tests just are awkward and fragile (which #goos is careful to point out). If such tests are worth it, it’s because some combination of these sources of value outweighs their cost:

  • They help clarify everyone’s understanding of the problem to be solved.

  • Trying to make the tests run fast, be less fragile, be easier to debug in the case of failure, etc. makes the app’s overall design better.

  • They detect incorrect changes (that is, changes in behavior that were not intended, as distinct from ones you did intend that will require the test to be changed to make it an example of the newly-correct behavior).

  • They provide a cadence to the programming, helping to break it up into nicely-sized chunks.

In working through #goos so far (chapter 16), the end-to-end tests have not found any bugs, so zero value there. I realized last night, though, that what most bugged me about them is that they made my programming “ragged”–that is, I kept microtesting away, changing classes, being happy, but when I popped up to run the end-to-end test I was working on, it or another one would break in a way that did not feel helpful. (However, I should note that it’s a different thing to try to mimic someone else’s solution than to conjure up your own, so some of the jerkiness is just inherent to learning from a book.)

I think part of the problem is the style of the tests. Here’s one of them, written with Cucumber:

   Scenario: Sniper makes a higher bid, but loses
       Given the sniper has joined an ongoing auction
       When the auction reports another bidder has bid 1000 (and that the next increment is 98)
       Then the sniper shows that it's bidding 1098 to top the previous price
           And the auction receives a bid of 1098 from the sniper

       When the auction closes
       Then the sniper shows that it's lost the auction

This test describes all the outwardly-observable behavior of the Sniper over time. Most importantly, at each point, it talks about two interfaces: the XMPP interface and the GUI. During coding, I found that context switching unsettling (possibly because I have an uncommonly bad short- and medium-term memory for a programmer). Worse, I don’t believe this style of test really helps to clarify the problem to be solved. There are two issues: what the Sniper does (bid in an auction) and what it shows (information about the known state of the auction). They can be talked about separately.

What the Sniper does is most clearly described by a state diagram (as on p. 85) or state table. A state diagram may not be the right thing to show a non-technical product owner, but the idea of the “state of the auction” is not conceptually very foreign (indeed, the imaginary product owner has asked for it to be shown in the user interface). So we could write something like this on a blackboard:

Just as in #goos, this is enough to get us started. We have an example of a single state transition, so let’s implement it! The blackboard text can be written down in whatever test format suits your fancy: Fit table, Cucumber text, programming language text, etc.

Where do we stand?

At this point, the single Cucumber test I showed above is breaking into at least three tests: the one on the blackboard, a similar one for the BIDDING to LOSING transition, and something as yet undescribed for the GUI. Two advantages to that: first, a correct change to the code should only break one of the tests. That breakage can’t be harder to figure out than breaking the single, more complicated test. Second, and maybe it’s just me, but I feel better getting triumphantly to the end of a medium-sized test than I do getting partway through a bigger end-to-end one.

The test on the blackboard is still a business-facing test; it’s written in the language of the business, not the language of the implementation, and it’s talking about the application, not pieces of it.

Here’s one implementation of the blackboard test. I’ve written it in my normal Ruby microtesting style because that shows more of the mechanism.

context pending state do

  setup do
    start_app_at(AuctionSnapshot.new(:state => PENDING))
  end

  should respond to a new price by counter-bidding the minimum amount do
    during {
      @app.receive_auction_event(AuctionEvent.price(:price => 1000,
                                                    :increment => 98,
                                                    :bidder => someone else“))
    }.behold! {
      @transport_translator.should_receive(:bid).once.with(1098)
      @anyone_who_cares.should_receive_notification(STATE_CHANGE).at_least.once.
                        with(AuctionSnapshot.new(:state => BIDDING,
                                                 :last_price => 1000,
                                                 :last_bid => 1098))
    }
  end
end

Here’s a picture of that test in action. It is not end-to-end because it doesn’t test the translation to-and-from XMPP.

In order to check that the Sniper has the right internal representation of what’s going on in the auction, I have it fling out (via the Observer or Publish/Subscribe pattern) information about that. That would seem to be an encapsulation violation, but this is only the information that we’ve determined (at the blackboard, above) to be relevant in/to the world outside the app. So it’s not like exposing whether internal data is stored in a dictionary, list, array, or tree.

At this point, I’d build the code that passed this test and others like it in the normal #goos outside-in style. Then I’d microtest the translation layer into existence. And then I’d do an end-to-end test, but I’d do it manually. (Gasp!) That would involve building much the same fake auction server as in #goos, but with some sort of rudimentary user interface that’d let me send appropriately formatted XMPP to the Sniper. (Over the course of the project, this would grow into a more capable tool for manual exploratory testing.)

So the test would mean starting the XMPP server, starting the fake auction and having it log into the server, starting the Sniper, checking that the fake auction got a JOIN request, and sending back a PRICE event. This is just to see the individual pieces fitting together. Specifically:

  • Can the translation layer receive real XMPP messages?
  • Does it hand the Sniper what it expects?
  • Does the outgoing translation layer/object really translate into XMPP?

The final question–is the XMPP message’s payload in the right format for the auction server?–can’t really be tested until we have a real auction server to hook up to. As discussed in #goos, those servers aren’t readily available, which is why the book uses fake ones. So, in a real sense, my strategy is the same as #goos’s: test as end-to-end as you reasonably can and plug in fakes for the ends (or middle pieces) that are too hard to reach. We just have a different interpretation of “reasonably can” and “too hard to reach”.

Having done that for the first test, would I do it again for the BIDDING to LOSING transition test? Well, yeah, probably, just to see a two-step transition. But by the time I finished all the transitions, I suspect code to pass the next transition test would be so unlikely to affect integration of interfaces that I wouldn’t bother.

Moreover, having finished the Nth transition test, I would only exercise what I’d changed. I would not (not, not, not!) run all the previous tests as if I were a slow and error-prone automated test suite. (Most likely, though, I’d try to vary my manual test, making it different from both the transition test that prompted the code changes and from previous manual tests. Adding easy variety to tests can both help you stumble across bugs and–more importantly–make you realize new things about the problem you’re trying to solve and the product you’re trying to build.)

What about real automated end-to-end tests?

I’d let reality (like the reality of missed bugs or tests hard to do manually) force me to add end-to-end tests of the #goos sort, but I would probably never have anywhere near the number of end-to-end scenario/workflow tests that #goos recommends (as of chapter 16). While I think workflows are a nice way of fleshing out a story or feature, a good way to start generating tests, and a dandy conversation tool, none of those things require automation.

I could do any number of my state-transition tests, making the Sniper ever more competent at dealing with auctions, but I’d probably get to the GUI at about the same time as #goos.

What do we know of the GUI? We know it has to faithfully display the externally-relevant known state of the auction. That is, it has to subscribe to what the Sniper already publishes. I imagine I’d have the same microtests and implementation as #goos (except for having the Swing TableModel subscribe instead of being called directly).

Having developed the TableModel to match my tests, I’d still have to check whether it matches the real Swing implementation. I’d do that manually until I was dragged kicking and screaming into using some GUI scraping tool to automate it.

How do I feel?

Nervous. #goos has not changed my opinion about end-to-end tests. But its authors are smarter and more experienced than I am. So why do they love–or at least accept–end-to-end tests while I fear and avoid them?

Unthrilled

Here’s a test for the Cappuccino app I’m working on. It’s about what happens when, for example, you click on “blood collection for transfusion” in the right table here:

Procedure

- (void)testPutBackAProcedure
{
  [scenario
   previousAction: function() {
      [self procedure: Betical
            hasBeenSelectedFrom: [”alpha“, Betical“, order“]];
    }
  during: function() {
      [self putBackProcedure: Betical“];
    }
  behold: function() {
      [self listenersWillReceiveNotification: ProcedureUpdateNews
            containingObject: []];
      [self tablesWillReloadData];
    }
  andSo: function() {
      [self unchosenProcedureTableWillContain: [”alpha“, Betical“, order“]];
      [self chosenProcedureTableWillContain: []];
    }
   ];
}

Here’s the code to pass the test (after inlining one method):

- (void)unchooseProcedure: (id) sender
{
  [self moveProcedureAtIndex: [chosenProcedureTable clickedRow]
                        from: chosenProcedures
                          to: unchosenProcedures];

  [NotificationCenter postNotificationName: ProcedureUpdateNews
                                    object: chosenProcedures];
  [chosenProcedureTable reloadData];
  [unchosenProcedureTable reloadData];
}

Did the test clarify my design thinking? No, not really. Will it be useful for regression? I doubt it. Is it good documentation for the app’s UI behavior? No.

Something seems wrong here.

Erasing history in tests

Something I say about the ideal of Agile design is that, at any moment when you might ship the system, the code should look as if someone clever had designed a solution tailored to do exactly what the system does, and then implemented that design. The history of how the system actually got that way should be lost.

An equivalent ideal for TDD might be that the set of tests for an interoperating set of classes would be an ideal description-by-example of what they do, of what their behavior is. For tests to be documentation, the tests would have to be organized to suit the needs of a learner (most likely from simple to complex, with error cases deferred, and - for code of any size - probably organized thematically somehow).

That is, the tests would have to be more than what you’d expect from a history of writing them, creating the code, rewriting tests and adding new ones as new goals came into view, and so forth. They shouldn’t be a palimpsest with some sort random dump of tests at the top and the history of old tests showing through. (”Why are these three tests like this?” “Because when behavior X came along, they were tests that needed to be changed and it was easiest to just tweak them into shape.”)

I’ve seen enough to be convinced that, surprisingly, Agile design works as described in the first paragraph, and that it doesn’t require superhuman skill. The tests I see - and write - remind me more of the third paragraph than the second. What am I missing that makes true tests-as-documentation as likely as emergent design is?

(It’s possible that I demand too much from my documentation.)