Database design at scale

When developing large scale enterprise applications, the choice and the design of the database is crucial to the financial and operational success of the system.

There are three factors that come into play when choosing a database.

  1. The data model
  2. Operational capabilities of the database in question
  3. Tooling around the database

These three factors are not independent of each other. Complications arise from the interdependencies between them. For example, a relational data model is much more suited adhoc aggregation and querying than any other data model. Consequently, there a number of analytic tools around relational databases. This makes it difficult for a systems engineer to analyse and compare the different database solutions available and choose the one that would suite his/her need the most.

Single node

The simplest approach to database design is to have a single server with all the data. With this approach, you would avoid a whole class of problems that come along with multi node approaches.

At large scales, though, your data and access requirements could quickly outrun the capability of the machine. When this happens, you would have no choice but to throw a lot of resources like memory, network bandwidth etc at the machine which could be expensive.

The single node approach suffers from another problem that the node forms a single point of failure for the database system. Ideally we would like to have redundant nodes which can take over should the node fail.

Replica nodes

To avoid the problem of single point of failure when having one node with all the data, we create “hot standby” nodes. The nodes are plain replicas of the single node to which all read and write operations are directed.

Though conceptually the idea of replicas are simple, it is fraught with operational issues which should be solved. Some of the problems are listed below.

New Node Warming

Imagine that we have a system operational in production with one node (which we would call master) and we would like to add a replica. For the replica to be considered valid, it should have all the data present in the master. If there is a large amount of data in the master, this could be a huge logistical issue.

Replication of meta data

In any database system, we would to create, modify and remove meta data like schemas, indexes etc. We would want this meta data, which could be non trivial in large systems to be replicated automatically as well.

Master election and automatic fail over

In the case of the failure of master node, we would like one of the replica nodes to take over automatically so that the system is always available to the end users. If there are many replicas, we call this the problem of “electing a new master”.

Once a new master is elected, all reads and writes are should directed to this node. This could be achieved using a middle ware which becomes aware of the new election and starts directing traffic to the newly elected node.

Another class of problems arise when the old master  comes back up again. Decisions should be made whether the newly elected master continues to be the master or whether the old master takes over. In either case, we would have to bring the old master up to date with the latest data before it could join the cluster.

Split brain problem

Automatic fail over demands the existence of a middle ware which routes all traffic. Now which middle ware becomes a single point of failure for the system. To avoid this, we would like to have redundancy in the middle ware level as well. Redundant systems with state is a huge time bomb.

Situations might arise when one part of the middle ware thinks that system A is the master and another part thinks that system B is the master. When this happens, the middle ware could start routing traffic to both the nodes; in which case, the nodes would diverge and could require manual intervention to bring them back into sync again.

Transaction overhead

To ensure that either all nodes see a change, or no node sees it is a problem in itself. This would require all the nodes to take part in a transaction which could be wasteful effort, since the chances that more than two nodes would fail is rare anyway.

Low utilization

Low utilization is the most fundamental problem with replica nodes. We would have invested in (number of replicas) times the resources that a single node would consume and all that we are gaining is mere automatic fail over of a node which is not likely to fail.

Multi master

A multi master configuration just treats all the nodes in the cluster as a master capable of handling reads and writes. This solves the problem of low utilization but comes with problems of its own


Remember the split brain problem? A multi master configuration is split brain problem on steroids. At this scale, it is no longer possibly to manually reconcile divergent nodes and there should be an automated way of doing that.

Eventual consistency over Transactionality

Implementing distributed transactions in replica nodes which have data flowing in only one direction (from master to replicas) is hard enough. Implementing the same in a multi master set up could lead to very poor utilization of resources due to pessimistic locks usually associated with transactions.


The most fundamental problem with a multi master setup though, is the fact that it could not be applied to all domains and all data models. A relational data model with its referential integrity constraint for example is not naturally suited for a multi master setup. Also some domains like banking which have data that should be globally consistent might not be inherently suited for eventual consistency and reconciliation.


Shards are the most complicated designs in a database system. Shards avoid the problem of low utilization by partitioning data across multiple nodes. They are designed around the fact that “not all data are created equal”.

There are some kinds of data that should be more consistent and available and some data like audits that need not be available and could be eventually consistent. In a domain specific manner, certain entities need not have any relation to other entities. For example, personal accounts in the banking systems could be treated separately from corporate accounts.

All these factors give us an opportunity to store data in separate data sources there by managing the sheer size of the database. As with any approach, sharding is not without its share of pitfalls.

Complex client/middleware

Once you have sharded the data, reads and write are more complicated since the client needs to know which shard to write to or read from. This could be abstracted away from the client using a middle ware, but still this is a problem for the system designer.


Imagine that you have a database of people who sell items. If one person has too many items, you would want to move the person to a different shard. When you do this, you would have to move not only the person’s record but also all the items that belong to the person. Ideally we would like to do this in an automated fashion.


Certain portions of the database might not be available for read or write when automatic resharding is in progress.

Cross shard data movement

When your usage patterns change and you would have to combine data from across the shards, the amount of data movement over the network could be inhibitively expensive to say the least. In these cases, you would have to juggle around with your sharding scheme to avoid this data movement.

Distributed map reduce

Given that sharding is something that we would have to live with in large systems, distributed map reduce helps us reduce the processing time and data movement by partitioning the queries, distributing the work load and transporting only the result over the network. Such databases are said to follow the ACID 2.0 (Associative, commutative, Idempotent and distributed)  paradigm, for these group operations help in distributing and combining the results. But that is a topic for another post.




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s