Sharding (10)

Is sharding not as popular or more difficult with Relational/SQL databases?

Monitor MySQL metrics with Datadog.
Graph and set alerts on MySQL performance, plus data from the rest of your apps + infrastructure.
Janko Jerinic
Janko Jerinic, MSc Electrical Engineering & Computer Science, University of Belgrade, School of Electrical Engineering (2009)

Well, yes and no.

Most traditional RDBMS’s, like Oracle, SQL Server, MySql, Postgres, et al, are designed to be standalone, single servers and, as such, they do not have internalmechanisms that provide sharding functionality by default.

That doesn’t mean that application-level sharding is impossible with them and, in fact, many large distributed systems have done exactly that – companies like Quora, or Facebook. So, you can indeed horizontally scale storage and load across multiple RDBMS’s, it just doesn’t come out of the box, and the performance is fine. You can’t do joins across different “shards”, or servers, but neither can you using a NoSql, naturally sharded, database.

However, there actually are RDBMS’s who support sharding naturally, to some extent. Think about it – SQL is basically just an interface on top of a query optimizer, query executor, and a storage engine. There is nothing to prevent a SQL database from implementing internal sharding.

Two examples that come to my mind are – Amazon Redshift and Microsoft APS Parallel Data Warehouse. The first one I use quite a bit and the second one I actually worked on. Both of these systems will allow you to define partition keys which they will use to distribute data across so-called compute nodes. SQL queries are executed by collocating that distributed data as necessary and aggregating results of execution from multiple nodes. Now, these systems aren’t infinitely scalable – they come in predefined cluster sizes, but the data is, indeed, partitioned and there is nothing to prevent a relational system from being fully distributed. In fact, we do have one – Google’s Spanner.

Greg Kemnitz
Greg Kemnitz, Postgres internals, embedded device db internals, MySQL user-level

In order to support relational joins, sharding in distributed relational data-worlds is usually implemented at the application or “policy” level, as cross-instance joins are generally very expensive and slow.

A simple example would be that users whose names start with ‘A’ go to Instance 1, those with ‘B’, go to instance 2, etc. All the data associated with each group of users would live on the separate instances, with the idea that during normal production operations, there’s no need to execute cross-instance queries.

Most NoSQL databases essentially skip the whole concept of joins, and store all the data associated with a particular thing in a Document. Since all the data you need is in a particular document, you can shard based on one of the main attributes of your documents, and using this Shard Key, you can know which instance in your dataworld the document you care about happens to be in.

This data model works very well for certain types of applications, but less well for applications where the data is basically tabular and joining is routine. So, in very large data-worlds, you often end up with a hybrid of NoSQL and SQL databases.

But it is worth knowing that it is very important to choose shard keys carefully in both distributed relational and in multi-instance NoSQL deployments.

Daniel Kuck-Alvarez
Daniel Kuck-Alvarez, I hold a BS degree in Computer Science

Relational databases do not usually have a mechanism for sharding.

Sharding refers to a data storage practice in which groups of data are kept on separate systems. For example, you may choose to keep data on half of your users on one database and keep the data about the other half on a separate database stored on a separate machine.

The objective of sharding is to reduce the work each machine has to do. Many NoSQL databases have sharding built in and it is transparent to the software accessing the data. The software has one endpoint to reach the data and doesn’t know which database it is actually accessing.

Most relational databases do not have this capability built in (maybe none of them do). If software wants to shard across two relational databases, it has to keep track of the connections to each database separately.

The major relational databases have master-slave configuration options so that many separate databases may be kept in sync. In these configurations, each database has a full copy of the data, not just a portion of it. The software still has the task of deciding which database to access, but whichever it chooses, the database will have the complete data set. 

David Brower
David Brower, Husband, dad, programmer/architect, occasional blogger, onetime sound engineer.

Sharding is one form of partitioning, and most DBMS’s have partitioning. What’s interesting about sharding is use across multiple machines for horizontal scaling. Horizontal scaling in relational DBMS’s is difficult. Only a few do it well, and they are commercial and people don’t like to pay for it, so they’ll settle for the more relaxed semantics of sharded, eventually consistent distributed databases that don’t do transactions as well, or do joins nicely.

There’s a rich development field going right now trying to put decent enough SQL on top of weaker data storage engines.

David Kittrell
David Kittrell, Sr Consultant, E-discovery & Data Visualization

Read up on Teradata. While I agree with other responses regarding popularity and generic DBMS, Teradata technology from the 80s implemented physical sharding methods to support large-scale DBs (the “terabyte” size implied by the name) at a time when a few hundred megabytes were big commercial data stores. After bouncing around from AT&T, NCR. etc., they are still found in high-end analytics domains. Nice tech and a true DB innovator.