Thu, 29 Jan 2004
Order of tests
Suppose you have a set of tests, A through Z. Suppose you had N teams
and had each team implement code that passed the tests, one at a time,
but each team received the tests in a different order. How
different would the final implementations be? Would some orders lead
to less backtracking?
I decided to try a small version of such an experiment at the
Master of
Fine Arts in Software trial run. Over Christmas vacation, I
collaborated with my
wife to create five files of FIT tests that begin to describe a
veterinary clinic.
(She's head of the
Food Animal Medicine
and Surgery section of the University of Illinois veterinary
teaching hospital - that's her in the top middle picture at the bottom of
the page.) I implemented each file's worth of tests before we created
the next file.
- 000-typical-animal-progress.html
- 1A-orders.html
- 1B-when-payment-ends.html
- 1C-state-pays.html
- 1D-accounting.html
An interesting thing happened. When I got to the fifth file (1D), I
had to do a lot of backtracking. One key class emerged, sucking code
out from a couple of other classes. I think a class
disappeared. After cruising through the first four files, it felt
like I'd hit a wall. I'd made some bad decisions with the second file (1A),
stuck with them too long, and was only forced to
undo them with the fifth file. (Had I been attentive to the small,
still voice of conscience in my head, I might have done better. Or
maybe not.)
At the trial run, we spent four or five hours implementing. Sadly,
only one of the teams finished. They did 1D before 1A. (Their order
was 000-1D-1C-1B-1A.) What was interesting was that they
thought 1D was uneventful but 1A was where they had to do some
serious thinking. I got the feeling that their reaction upon
hitting 1A was somehow similar to - though not the same as - my
reaction upon hitting 1D. That's interesting.
Here are some choice quotes:
Brian: Am I right in remembering that D was no problem, but that
things got interesting at A (which is the opposite of what I observed
while taking them in the other order)?
Avi: That's right.
'A' changed some of the "ground rules" that we had been assuming about
the system. I think the biggest deal was that, up to that point, all
"orders" had been linear transitions from one status to another - from
intensive care to normal boarding to dead, for example. Suddenly,
there were all different kinds of orders that interacted in complex
ways, some of them could be active simultaneously, and they had an
effect on far more things than just the daily rate. At this point,
both the state of the system and the conditional behavior based on the
current state, became complex enough that many more things needed to
be modelled as classes that previously had gotten away with being
simple data types. It was the first time the code was threatening to
become anything like the kind of OO design you would have done if you
had sat down and drawn UML diagrams from the start.
Chad: It felt to me like that feeling I get when I'm doing
something in Excel and I run into a scenario where pivot tables just
aren't cutting it. Suddenly, I need a multi-dimensional view of the data,
and I realize that the tool I have isn't going to work. So, it was kind
of a flat to multi-dimensional transition.
Since we were intentionally avoiding the creation of new classes or
abstractions of any kind (as an experiment), we were facing a rewrite to
move further.
Given the fact that our brittle code was starting to take
the shape of classes that *wanted* to spring into existence, I wonder how
much better the code would have been if we would have done classic
test-driven development without the forced stupidity. Unfortunately, it's
impossible to conduct a valid experiment to test this without a
prohibitively large sample size. Who knows--you may have found an example
that will generally cause developers to box themselves into a corner.
If Avi and I could forget the exercise completely, it would be fun to go
back and try to do TDD while overly abstracting everything to see if we
ran into the same issues.
Another pair had an experience slightly similar to mine. They did
000-1C-1B-1A and then
started on 1D. One of them says:
The only discontinuity we felt was at D where we realised we needed to
have an enhanced accounting mechanism. The rest of the tests exhibit
the expected feeling of tension and then release as we added stuff to
the fixture and then refactored it out. D felt different to me because
unlike the others (in our ordering) D did two things:
It was a significant increment in requirements above and
beyond the simple balance model. It was a larger step from a code
complexity level than the others.
It broke an assumption that was woven through the accounting code.
What occurred to me at the time was that this is an example of change
that you'd like not to happen in a real system. We didn't finish D but
it would have been easy to fix. If that had happened in the last
iteration before UAT it would have been a lot scarier.
Interestingly I didn't feel we had made a mistake, we had decided to
not look ahead and do the trivialest thing, we had just learnt
something new and needed to deal with it.
What do I conclude from this? Well, nothing, except that it's
a topic I want to pay attention to. I don't think we'll ever see a
convincing experiment, but perhaps through discussion we'll
develop some lore about ways to get smoother
sequences
of tests.
If anyone wants to play with the tests, you can
download
them all. You'll also want the FIT jar file; it has a fixture I
use in the tests. Warning: you will need to ask clarifying questions
of your on-site customer with expertise in running a university
large animal clinic. Oh, you haven't got one? Mail me.
## Posted at 11:32 in category /mfa
[permalink]
[top]
|