Monday, March 2, 2009

Architectural Tipping Points

My company is in the midst of a major merger, literally doubling in size. We've acquired other companies before; usually in adjacent markets to grow our footprint and typically these companies were between 3% and 4% of our size and the entire integration effort took about 4 months (on average). The current project involves a company that was 100% of our previous size and will take upwards of two years. In the simplest terms, the prevailing philosophy was that we'd add more rows to our existing databases when we moved the customer records into our existing systems. It was never quite that easy, but that was the general idea. Adding rows to a database is not considered an architectural change, so architecture reviews were not typically part of merger mania.

What follows is true, although the scenario is a composite of several applications, i.e. the names have been changed to protect, basically, me. Let's say you have an on-line web-based application that likes to inform the user how many of their social network friends are online, and to do that the application sequentially traipses through the repository looking for connections with each login. I am not indicating that this is the best way to accomplish this, I am only suggesting that given the time constraints to which were are constantly held, it is a not an unreasonable place to end up.

Naturally, as the size of the file grows, the more time it takes to traipse and collect all of the friend's names. At some point the amount of time it takes to traipse exceeds the slack time in the system and login requests start to build up. Managing a backlog of requests takes systems resources, which causes the traipsing to take even longer - and this is the tipping point that causes the whole system to fail. Now of course there are ways around this and that is exactly point. A solution to a problem may be completely satisfactory, as was the case when our application was originally designed. Normally, the mere addition of more records would not cause us to think a re-architecture of the application was in order, but sometime it does.

Take a look at the napkin drawing here. Note that the application started out with about 26,000 records and grew through multiple mergers to 28,000, then to 32,000. As the number of records grew there was a slight increase in the response time of the login code from just under a second to just over. Then as the number of records crossed the 32,000 mark something strange happens as the response time jumps two fold. It happens again when the number of records crosses 40,000. These are tipping points and they occur because the application's ability to process a single execution thread is impacted by the management of the threads.

In another example, there is a batch process that reads a file looking for exception records which it spits out into another file. Exactly one hour after the batch process starts, an exception-handling batch process begins. This sequence has worked for a decade and is so timed in order have all records and exceptions processed before the start of business the next day. Well, what happens when the first batch process has to trudge through more records than it can handle in one hour? Again, this is not an insurmountable problem, but it typifies why mergers and acquisitions should involve a careful review of application architectures. (It might also be a pretty good argument for event-driven systems, rather than time-based processes, but I digress).

Follow by Email