Choosing “a” Database, not “the” Database

Chris Oldwood from The OldWood Thing

One thing I’ve run across a few times over the years is the notion that an application or system has one, and only one, database product. It’s as if the answer to the question about where we should store our data must be about where we store “all” our data.

Horses for Courses

I’ve actually touched on this topic before in “Deferring the Database Choice” where our team tried to put off the question as long as possible because of a previous myopic mindset and there was a really strong possibility that we might even have a need for two different styles of database – relational and document-oriented – because we had two different types of data to store with very different constraints.

In that instance, after eventually working out what we really needed, we decided to look at a traditional relational database for the transactional data [1], while we looked towards the blossoming NoSQL crowd for the higher-volume non-transactional data. While one might have sufficed for both purposes the organisational structure and lack of operational experience at the time meant we didn’t feel comfortable putting all our eggs in that one NoSQL basket up front.

As an aside the Solution Architect [2] who was assigned to our team by the client definitely seemed out of their comfort zone with the notion that we might want to use different products for different purposes.

Platform Investment

My more recent example of this line of reasoning around “the one size fits all” misnomer was while doing some consulting at a firm in the insurance sector, an area where mainframes and legacy systems pervade the landscape.

In this particular case I had been asked to help advise on the architecture of a few new internal services they were planning. Two were really just caches of upstream data designed to reduce the per-cost call of 3rd party services while the third would serve up flood related data which was due to be incorporated into insurance pricing.

To me they all seemed like no-brainers. Even the flood data service just felt like it was probably a simple web service (maybe REST) that looks up the data in a document oriented database based on the postcode key. The volume of requests and size of the dataset did not seem remarkable in any way, nor the other caches. The only thing that I felt deserved any real thought was around the versioning of the data, if that was even a genuine consideration. (I was mostly trying to think of any potential risks that might vaguely add to the apparent lack of complexity.)

Given the company already called out from its mainframe to other web services they had built, this was a solved problem, and therefore I felt there was no reason not to start knocking up the flood data service which, given its simplicity, could be done outside-in so that they’d have their first microservice built TDD-style (an approach they wanted to try out anyway). They could even plug it in pretty quickly and just ignore the responses back to the mainframe in the short term so that they could start getting a feel for the operational aspects. In essence it seemed the perfect learning opportunity for many new skills within the department.

An Undercurrent

However, while I saw this as a low-risk venture there were questions from further up effectively about choosing the database. I suspected there were concerns about the cost but some rudimentary calculations based around a three-node cluster with redundant disks versus storage for the mainframe showed that they weren’t even in the same ballpark and we’re not even talking SSDs here either. (This also ignores the fact that they were close to maxing out the mainframe anyway.)

One of the great things about databases in these modern times is that you can download the binaries and just fire one up and get playing. Given the dataset fitted the document-oriented paradigm and there were no transactions to speak of I suggested they pick either MongoDB or Couchbase and just get started as it was the paradigm they most needed to get acquainted with, the specific vendor (to me) was less of a concern in the shorter term as the data model was simple.

Nevertheless, rather than build something first and get a feel for what makes most sense, they wanted to invite the various big NoSQL vendors in and discuss contracts and products up-front. So I arranged for the three main contenders at the time to visit the company’s offices and give a pitch, followed by some Q&A time for the management to ask any burning questions. It was during the first of these three pitches that I began to realise where the disconnect lay between my vision and theirs.

While I had always been working on the assumption that the company was most comfortable with mainframes and relational databases and that they wanted to step outside that and move to a less monolithic architecture, perhaps using the Strangler Pattern to break out the peripheral services into independent self-contained ones, they still saw a single database product sitting at the heart. Yes, the services might be built separately, and the data may well be partitioned via namespaces or collections or whatever, but fundamentally the assumption was that the data storage was still effectively monolithic.

A False Economy

In retrospect I shouldn’t really have been that surprised. The reason the mainframe had probably survived for so long was that the data was seen as the crown jewels and the problems of redundancy and backup had been solved long ago and were pretty robust. In fact if anything went wrong the vendor could helicopter some experts in (which they had done in the past). This was not the level of service offered by the new kids on the block and the company was still far from getting comfortable with cloud hosting and managed service providers which were are starting to spring up.

Hence, where I was looking at the somewhat disposable nature of the new services purely as an opportunity for learning, others higher up were looking at it as a stepping stone to moving all their data across to another platform. Coupled with this was the old-fashioned view that the decision needed to be made up-front and needed to be the right one from the off [3].

A Different Investment

Even with this misconception acknowledged and the shining cost savings to be had there was still a heavy reluctance to go with something new. I believe that in the end they put their investment into more mainframe storage instead of investing in their people and the organisation’s longer term future.

 

[1] There was definitely an element of “availability bias” here as the organisation had a volume licensing agreement with a relational database vendor.

[2] A role which highlighted their Ivory Tower approach at the time but has since fallen away as architecture has thankfully started leaning more towards shared ownership.

[3] Some of the impetus for “Don’t Fail Fast, Learn Cheaply” came from conversations I had with this organisation about their approach to career development.

Abstraction with Database Views

Chris Oldwood from The OldWood Thing

After being away from the relational database world for a few years it’s been interesting coming back and working on a mature system with plenty of SQL code. It’s been said that SQL is the assembly language of databases and when SQL code is written only using its primitives (types and tables) it’s easy to see why.

Way back in 2011 I wrote “The Public Interface of a Database” which was a distillation of my thoughts at the time about what I felt was generally wrong with much of the database code I saw. One aspect in particular which I felt was sorely underutilised was the use of views to build a logical model over the top of the physical model to allow a more emergent design to unfold. This post documents some of the ways I’ve found views to be beneficial in supporting a more agile approach to database design.

Views for Code Reuse

The first thing that struck me about the recent SQL code I saw was how much there was of it. Most queries were pretty verbose and as a consequence you had to work hard to comprehend what was going on. Just as you see the same tired examples around Orders => OrderItems => Products so the code had a similar set of 3 table joins over and over again as they formed the basis for so many queries.

One of the primary uses for database views is as a code reuse mechanism. Instead of copy-and-pasting the same bunch of joins everywhere:

FROM Orders o
INNER JOIN OrderItems oi
ON o.Id = oi.OrderId 
INNER JOIN Products p
ON oi.ProductId = p.Id

we could simply say:

FROM OrdersOrderItemsProducts

This one simplification reduces a lot of complexity and means that wherever we see that name we instantly recognise it without mentally working through the joins in our head. Views are composable too meaning that we can implement one view in terms of another rather than starting from scratch every time.

Naming

However, if the name OrdersOrderItemsProducts makes you wince then I don’t blame you because it’s jarring due to its length and unnaturalness. It’s a classic attempt at naming based on how it’s implemented rather than what it means.

I suspect a difficulty in naming views is part of the reason for their lack of use in some cases. For our classic example above I would probably go with OrderedProducts or ProductsOrdered. The latter is probably preferable as the point of focus is the Products “set” with the use of Orders being a means to qualify which products we’re interested in, like “users online”. Of course one could just easily say “unread messages” and therefore we quickly remember why naming is one of the two hardest problems in computer science.

Either way it’s important that we do spend the time required to name our views appropriately as they become the foundation on which we base many of our other queries.

Views for Encapsulation

Using views as a code reuse mechanism is definitely highly beneficial but where I think they start to provide more value are as a mechanism for revealing new, derived sets of data. The name ProductsOrdered is not radically different from the more long-winded OrdersOrderItemsProducts and therefore it still heavily reflects the physical relationship of the underlying tables.

Now imagine a cinema ticketing system where you have two core relationships: Venue => Screen => SeatingPlan and Film => Screening => Ticket => Seat. By navigating these two relationships it is possible to determine the occupancy of the venue, screen, showing, etc. and yet the term Occupancy says nothing about how that is achieved. In essence we have revealed a new abstraction (Occupancy) which can be independently queried and therefore elevates our thinking to a higher plane instead of getting bogged down in the lengthy chain of joins across a variety of base tables.

Views for Addressing Uncertainty

We can also turn this thinking upside down, so that rather than creating something new by hiding the underlying existing structure, we can start with something concrete and re-organise how things work underneath. This is the essence of refactoring – changing the design without changing the behaviour.

When databases were used as a point of integration this idea of hiding the underlying schema from “consumers” made sense as it gave you more room to change the schema without breaking a bunch of queries your consumers had already created. But even if you have sole control over your schema there is still a good reason why you might want to hide the schema, nay implementation, even from much of your own code.

Imagine you are developing a system where you need to keep daily versions of your customer’s details easily accessible because you regularly perform computations across multiple dates [1] and you need to use the correct version of each customer’s data for the relevant date. When you start out you may not know what the most appropriate way to store them because you do not know how frequently they change, what kinds of changes are made, or how the data will be used in practice.

If you assume that most attributes change most days you may well plump to just store them daily, in full, e.g.

| Date       | Name      | Valuation | ... | 
| 2019-03-01 | Company A | £102m     | ... |  
| 2019-03-01 | Company B | £47m      | ... |  
| 2019-03-02 | Company A | £105m     | ... |  
| 2019-03-02 | Company B | £42m      | ... |  
| 2019-03-03 | Company A | £105m     | ... |  
| 2019-03-03 | Company B | £42m      | ... |

On the contrary, if the attributes rarely change each day then maybe we can version the data instead:

| Name      | Version | Valuation | ... |
| Company A | 1       | £147m     | ... |
| Company A | 2       | £156m     | ... |
| Company B | 1       | £27m      | ... |

So far so good, but how do we track which version belongs to which date? Once again I can think of two obvious choices. The first is much like the original verbose table and we record it on a daily basis:

| Date       | Name      | Version |
| 2019-03-01 | Company A | 1       |
| 2019-03-01 | Company B | 1       |
| 2019-03-02 | Company A | 1       |
| 2019-03-02 | Company B | 2       |

The second is to coalesce dates with the same version creating a much more compact form:

| From       | To         | Name      | Version |
| 2019-03-01 | (null)     | Company A | 1       |
| 2019-03-01 | 2019-03-01 | Company B | 1       |
| 2019-03-02 | (null)     | Company B | 2       |

Notice how we have yet another design choice to make here – whether to use NULL to represent “the future”, or whether to put today’s date as the upper bound and bump it on a daily basis [2].

So, with all those choices how do we make a decision? What if we don’t need to make a decision, now? What if we Use Uncertainty as a Driver and create a design that is easily changeable when we know more about the shape of the data and how it’s used?

What we do know is that we need to process customer data on a per-date basis, therefore, instead of starting with a Customer table we start with a Customer view which has the shape we’re interested in:

| Date | Name | Valuation | ... | 

We can happily use this view wherever we like knowing that the underlying structure could change without us needing to fix up lots of code. Naturally some code will be dependent on the physical structure, but the point is that we’ve kept it to a bare minimum. If we need to transition from one design to another, but can’t take the downtime to rewrite all the data up-front, that can often be hidden behind the view too.

Views as Interfaces

It’s probably my background [3] but I can’t help but notice a strong parallel in the latter two examples with the use of interfaces in object-oriented code. George Box reminds us that “all models are wrong, but some are useful” and so we should be careful not to strain the analogy too far but I think there is some value in considering the relationship between views and tables as somewhat akin to interfaces and classes, at least for the purposes of encapsulation as described above.

On a similar note we often strive to create and use the narrowest interface that solves our problem and that should be no different in the database world either. Creating narrower interfaces (views) allows us to remain more in control of our implementation by leaking less.

One final type related comparison that I think worthy of mention is that it’s easier to spot structural problems when you have a “richer type system”, i.e. many well-named views. For example, if a query joins through ProductsOrdered to get to UserPreferences you can easily see something funky is going on.

Embracing Change

When you work alongside a database where the SQL code and schema gets refactored almost as heavily as the services that depend on it is a pleasurable experience [4]. Scott Ambler wrote a couple of books over a decade ago (Refactoring Databases: Evolutionary Database Design and Agile Database Techniques) which convinced me long ago that it was possible to design databases that could embrace change. Making judicious use of views certainly helped achieve that in part by keeping the accidental complexity down.

Admittedly performance concerns, still a dark art in the world of databases, gets in the way every now and but I’d rather try to make the database a better place for my successors rather than assume it can’t be done.

 

[1] In investment banking it’s common to re-evaluate trades and portfolios on historical dates both for regulatory and analytical purposes.

[2] Some interesting scenarios crop up here when repeatability matters and you have an unreliable upstream data source.

[3] I’m largely a self-taught, back-end developer with many years of writing C++ and C# based services.

[4] Having a large suite of database unit tests, also written in T-SQL, really helped as we could use TDD on the database schema too.

Optimistic SQL

Chris Oldwood from The OldWood Thing

One of the benefits of learning other programming languages is the way it teaches you about other paradigms and idioms. This is the premise behind the “Seven Xxx in Seven Weeks” range of books. Although I have the database one on my bookshelf I’ve only ever skimmed it as at the time I bought it I suddenly found myself leaving the world of the classic RDBMS behind and working with other types of DB for real; most notably the document-oriented kind.

Although some of these products like MongoDB and Couchbase have come a long way from their early beginnings as highly available key-value stores they often still lack the full-on transaction support of the old stalwarts like SQL Server and PostgreSQL. Coupled with a high-availability service you have to think differently about how you react to concurrency conflicts as explicit locking is almost certainly never the answer [1].

The impetus for this post was going back into the world of SQL databases and being slightly bemused by a stored procedure that appeared to implement an “upsert” (an UPDATE or INSERT depending on whether the row already exists) as I realised it wasn’t how I’d approach it these days.

The Existing SQL Approach

Initially I was somewhat flummoxed why it was even written the way it was as there appeared to be no concurrency issues in play at all, it was a single service doing the writing, but I later discovered that an accident of the implementation meant there were two writers internally competing and they chose to resolve this in the database rather than remove the root source of concurrency in the service.

The upsert was basically written like this:

  • Try SELECTing the existing row.
  • If it exists, UPDATE it.
  • If it doesn’t exist, INSERT it.

In the service code there were a number of comments describing why the transaction level was being bumped up to “serializable” – it was effectively to deal with the concurrency they had introduced within the code by creating two competing writers. On top of that the initial SELECT statement in the upsert applied a HOLDLOCK which also effectively makes the transaction serializable because it puts a range lock on the row’s key (even if that key doesn’t exist yet).

The Document DB Approach

The last few years away from the relational world meant that I was used to dealing with these kinds of conflicts at a slightly lower level. Also, dealing with document updates in the service rather than writing them as SQL mean that updates were done in a server-side loop rather than pushing the concurrency issue down into the database, hence it would look more like this:

  • Try selecting the document.
  • If it exists, update it and try writing it back.
  • If it doesn’t exist, try creating it.
  • If any write fails start over from the beginning.

Due to the lack of transactions and locking, write conflicts are commonly detected by using a version number attribute that gets used in the update predicate [2]. (A write failure, via a “document not found” error, means the predicate failed to match the specific document and version and therefore a conflict has occurred.)

Another SQL Approach

So what does all this have to do with upserts in SQL?

Well, what I found interesting was that my gut reaction was to question why there is the initial select there as I would have written it as:

  • Try to UPDATE the row.
  • If no rows were updated, then INSERT it.

This particular order makes an assumption that updates are more prevalent than inserts and as a I rule I’d say that checking @@ROWCOUNT to see if anything was written is far less ugly than adding a TRY…CATCH block in T-SQL and attempting to verify that the insert failed due to a primary key violation.

That all seemed fairly obvious but I had forgotten that with the document DB approach you tend to expect, and handle, write failures as part of handling concurrency, but in this case if two connections both attempted the insert concurrently it’s theoretically possible that they could both fail the UPDATE step and then one of the INSERTs would succeed and the other would fail resulting in a primary key violation. However the code in the service was not written to detect this and retry the operation (as you would with a document DB) which is why the initial SELECT is there – to lock the “unwritten row” up front which ensures that another transaction is blocked until the row is then inserted or updated. This way no client logic needs to handle the concurrency problem.

However I believe we can still achieve the same effect by adding the same HOLDLOCK hint to our initial UPDATE so that if the row does not exist other writers will be blocked by the range lock until the subsequent INSERT goes through. Hence the initial SELECT is, I believe, redundant.

The MERGE Approach

At this point I remembered that way back in the past SQL Server introduced the MERGE operation which effectively allows you to write an upsert with a single statement as you factor both the hit and miss logic into different branches of the statement. This caused me to go looking to see what the start of the art in upsert techniques were, possibly with performance comparisons to see how much faster this must be (given that SQL Server clearly has the potential to optimise the query plan as it better knows our intent).

I started digging and was somewhat surprised when I came across the page “Performance of the SQL MERGE vs. INSERT/UPDATE”. I was expecting to have my hypothesis validated but discovered that the answer was far from clear cut. Naturally I then googled “SQL Server upsert performance” to see what else had been written on the subject and I discovered this wasn’t an anomaly so much as a misunderstanding about what problem the MERGE statement is really intended to solve.

You should of course never take performance improvements at face value but “measure, measure, measure” yourself. I wasn’t avoiding doing that, I was looking to see if there might be any pitfalls I needed to be wary of when benchmarking the approach.

At this point I haven’t gone any further with this as it’s more of a personal investigation (there is no actual performance issue to solve) but it just goes to show that writing SQL is as much an art as it’s always been.

 

[1] Some document databases, such as Couchbase, do support locking of documents, but there are heavy restrictions so you tend to find another way.

[2] In the particular example I was looking at no version number was needed in the SQL predicate because the data had a total ordering independent of the write order (it was tracking the minimum and maximum of a value over the day).

Weak and Strong Idempotency

Chris Oldwood from The OldWood Thing

In my recent post “PUT vs POST and Idempotency” I touched on the different degrees of idempotency. In retrospect I think it’s better to describe our idempotency guarantee as weak or strong. Either way this post looks at why you might choose one over the other.

Weak Idempotency

At one end of the spectrum we can choose to classify our unique operations with a single scalar value such as a large number, UUID or complex string value (with other embedded values like the date & time). In this scenario we assume that a simple value represents the identity of the operation in much the same way that a reference (or pointer) is the unique address of a value in languages like C# and Java (or C++).

In SQL terms your write with an idempotency check might look something like this:

insert into Operations values(...)
where not exists
(
  select 1
  from Operations
  where OperationId <> @operationId
)

For a document oriented database like MongoDB where you might store everything in a single object to achieve atomic updates you could do an equivalent find-and-modify-where style query. If the data model requires multiple writes you might still be able to order them carefully to achieve the desired effect (See “Observable State versus Persisted State”).

Strong Idempotency

In contrast, at the other end, we define an idempotent operation based not only on some unique operation identifier but also the arguments to the operation itself, i.e. the request body in HTTP. In this instance we are not willing to accept that a replay can be distinguished solely by it’s handle but also wish to ensure the actual details match too, just like a “deep” value equality comparison.

In SQL terms we potentially now have two steps, one to attempt the write and, if a conflict occurs, a subsequent read to validate it:

insert into Operations values(...)
where not exists
(
  select 1
  from Operations
  where OperationId <> @operationId
)

select *
from Operations
where OperationId = @operationId

If replays are rare (and I’d expect them to be) you should find this adds only a little extra overhead but also allows you to report broken requests properly.

Identity and Equality

Although it might seem like these weak and strong guarantees are directly analogous to the concept of identity and equality in normal programming there is a subtle difference.

Whilst in the weak case we are happy for the unique operation identifier to act as our sole definition of equality, in the strong guarantee both the identifier and the input values [1] take part in the comparison. In a programming language we tend to choose either the reference based comparison or the value based approach but not both [2].

Failure Modes

In theory it should be adequate for us to take the simpler approach, and as the provider of a service it’s almost certainly slightly cheaper for us to do that. However the downside is that it makes it harder to diagnose missed updates caused by the service erroneously treating them as replays. But that’s just an edge case right?

During development when your clients are coding against your API for the first time they may make a simple mistake and not correctly generate unique requests. This is not quite as uncommon as you may think.

Another potential source of mistaken uniqueness can come from transforming requests, e.g. from a message queue. If you’re integrating with a legacy system what you may think is unique might in fact turn out to be an artefact of the small data set you sampled and it could turn out only to be mostly unique [3].

Answering these kinds of support queries (“what happened to my request?“) can easily chip away at a team’s time. The more effort you put into making it harder for consumers to incorrectly use your API the less time you’ll need to support them.

Debug Switch

Although the need to verify a replay should be pretty infrequent you could easily wrap the validation logic in a feature switch so that development and test environments provide better diagnostics whilst production has less work to do. But you should measure before considering dropping such a low-cost feature.

Monitoring

What will help both the service and it’s consumers is if you can afford to capture when replays do occur so that the monitoring can alert if there appears to be a sudden rise in them. For example a client may go into an failure state that keeps replaying the same message over and over again. Alternatively the rise might suggest some configuration problem where upstream traffic is being duplicated. From a functional point of view the entire system could be behaving correctly but wasting resources in the process that could lead to a more catastrophic failure state.

Although a client should never need to behave differently from a functional perspective when a replay occurs, the flip side to the service tracking the event is for the client to report it as well. They could do this if, say, the return code changes in the replay case. In HTTP for example there is 200 OK and 201 Created which could be used to tell the client what happened.

Middle Ground

As I described in my previous post there can also be points between the two extremes. One of the problems with modern document oriented databases is that they have limits on the size of a single document. Depending on the anticipated depth of history this may exceed the practical document limit and therefore require some compromises to be made.

If no false replays can be tolerated than perhaps the model needs splitting across documents which removes the ability to do simple atomic updates. If the history grows slowly and data retention is negotiable then maybe the fine details of much older operations can be discarded to make room for the new.

Brave New World

The new era of databases continues to bring interesting challenges and trade offs due to their differing goals from a traditional big iron RDBMS. Pushing back on some requirements (e.g. “Don’t Be Afraid to Throw Away Data”) is essential if we are to balance their constraints with correctness.

 

[1] You may store server derived properties as well, like a “created at” timestamp. These are of course ignored in any comparison.

[2] Of course breaking an invariant like the relationship between Equals() and GetHashCode() is a popular mistake which can make objects appear to go missing in maps & sets.

[3] I know of one organisation where many people mistakenly believed their customer identifiers were unique, but we discovered they were only truly unique if you took their legacy back-office sources into account too.