2019 winter customer success report
According to 2019 Winter Customer Success Report, Altibase is one of the few operational DB vendors highly rated by customers @FeaturedCustomers. Learn more: https://bit.ly/2EVCVZJ
What’s all the fuss about in-memory databases for IoT?
So I thought I’d look at some of the novel trends in in-memory data processing: in-memory databases as well as data fabrics and data streaming engines.
The use of memory in computing is not new. But while memory is faster than disk by an order of magnitude, it is also an order of magnitude more expensive. That has for the most part left memory relegated to acting as a caching layer, while nearly all of the data is stored on disk. However in recent years, the cost of memory has been falling, making it possible to put far larger datasets in memory for data processing tasks, rather than use it simply as a cache.
It’s not just that it is now possible to store larger datasets in memory for rapid analytics, it is also that it is highly desirable. In the era of IoT, data often streams into the data centre or the cloud – the likes of sensor data from anything from a production line to an oilrig. The faster the organization is able to spot anomalies in that data, the better the quality of predictive maintenance. In-memory technologies are helping firms see those anomalies close to, or in, real-time. Certainly much faster than storing data in a disk-based database and having to move packets of data to a cache for analytics.
I expect take-up of in-memory data processing to accelerate dramatically, as companies come to grips with their data challenges and move beyond more traditional data analytics in the era of IoT. In-memory databases are 10 to 100 times faster than traditional databases, depending on the exact use case. When one considers that some IoT use cases involve the collection, processing and analysis of millions of events per second, you can see why in-memory becomes so much more appealing.
There’s another big advantage with in-memory databases. Traditionally, databases have been geared toward one of two main uses: handling transactions, or enabling rapid analysis of those transactions – analytics. The I/O limitations of disk-based databases meant that those handling transactions would slow down considerably when also being asked to return the results of data queries. That’s why data was often exported from the transactional database into another platform – a data warehouse – where it could more rapidly be analyzed without impacting the performance of the system.
Hybrid operational and analytical databases
With in-memory databases, it’s becoming increasingly common for both operational and analytic workloads to be able to run in memory rather than on disk. With an in-memory database, all (or nearly all) of the data is held in memory, making reads and writes an order of magnitude faster – so much so that both transactional duties and analytic queries can be handled by the same database.
There are a number of in-memory database players vying for what has become an even more lucrative market in the era of IoT. The largest incumbent database vendors such as Oracle, IBM and Microsoft have added in-memory capabilities to their time-tested databases. SAP has spent many millions of dollars educating the market about the benefits of its in-memory HANA database, saying it will drop support for all other third party databases under its enterprise software by 2025. There are also smaller vendors vying for market share such as Actian, Altibase, MemSQL and VoltDB.
Data grids & fabrics
Then there is the in-memory data grid (sometimes known as a data fabric) segment. This is an in-memory technology that you ‘slide’ between the applications and the database, thereby speeding up the applications by keeping frequently-used data in memory. It acts as a large in-memory cache, but using clustering techniques (hence being called an in-memory grid) it’s possible to store vast amounts of data on the grid.
In recent years their role has evolved beyond mere caching. They still speed up applications and reduce the load on the database, and have the advantage of requiring little or no rewriting of applications, or interference with the original database. But now as well as caching, they are being pressed into action as data platforms in their own right: they can be queried (very fast, in comparison with a database), they add another layer of high availability and fault tolerance – possibly across data centers – and they are increasingly being used as a destination for machine learning.
There are data grid offerings from a handful of vendors, amongst them Oracle, IBM, Software AG, Amazon Web Services, Pivotal, Red Hat, Tibco, GigaSpaces, Hazelcast, GridGain Systems and ScaleOut Software.
Data streaming engines
The third category, streaming, is also notable in the context of the Internet of Things. Data streaming involves the rapid ingestion and movement of data from one source to another data store. It employs in-memory techniques to give it the requisite speed. Streaming engines ingest data, potentially filter some of it, and also perform analytics on it. They can raise alerts, help to detect patterns, and start to form a level of understanding of what is actually going on with the data (and hence with the sensors, actuators or systems that are being monitored).
While streaming was largely confined to the lowest-latency environments, such as algorithmic trading in the financial sector, more and more use cases in the IoT space are latency sensitive: e-commerce, advertising, online gaming and gambling, sentiment analysis and more.
There are relatively few vendors with data streaming technology. But they include IBM with Streams, Amazon Web Services’ Kinesis in the cloud, Informatica with its Ultra Messaging Streaming Edition, SAS’ Event Stream Processing (ESP), Impetus Technologies with its StreamAnalytix and also TIBCO, Software AG and SAP (which bought StreamBase Systems, Apama and Aleri, respectively).
Smaller competitors include DataTorrent, which has a stream processing application that sits on a Hadoop cluster and can be used to analyze the data as it streams in, and SQL-based event-processing specialist SQLstream. Another young company is Striim.
In the open source space, Apache Spark Streaming and Apache Storm both offer streaming – most vendors have added support for Spark rather than Storm. But that, as they say, is a story for another day.
Is sharding not as popular or more difficult with Relational/SQL databases?
Well, yes and no.
Most traditional RDBMS’s, like Oracle, SQL Server, MySql, Postgres, et al, are designed to be standalone, single servers and, as such, they do not have internalmechanisms that provide sharding functionality by default.
That doesn’t mean that application-level sharding is impossible with them and, in fact, many large distributed systems have done exactly that – companies like Quora, or Facebook. So, you can indeed horizontally scale storage and load across multiple RDBMS’s, it just doesn’t come out of the box, and the performance is fine. You can’t do joins across different “shards”, or servers, but neither can you using a NoSql, naturally sharded, database.
However, there actually are RDBMS’s who support sharding naturally, to some extent. Think about it – SQL is basically just an interface on top of a query optimizer, query executor, and a storage engine. There is nothing to prevent a SQL database from implementing internal sharding.
Two examples that come to my mind are – Amazon Redshift and Microsoft APS Parallel Data Warehouse. The first one I use quite a bit and the second one I actually worked on. Both of these systems will allow you to define partition keys which they will use to distribute data across so-called compute nodes. SQL queries are executed by collocating that distributed data as necessary and aggregating results of execution from multiple nodes. Now, these systems aren’t infinitely scalable – they come in predefined cluster sizes, but the data is, indeed, partitioned and there is nothing to prevent a relational system from being fully distributed. In fact, we do have one – Google’s Spanner.
In order to support, sharding in distributed relational data-worlds is usually implemented at the application or “policy” level, as cross-instance joins are generally very expensive and slow.
A simple example would be that users whose names start with ‘A’ go to Instance 1, those with ‘B’, go to instance 2, etc. All the data associated with each group of users would live on the separate instances, with the idea that during normal production operations, there’s no need to execute cross-instance queries.
Most NoSQL databases essentially skip the whole concept of joins, and store all the data associated with a particular thing in a. Since all the data you need is in a particular document, you can shard based on one of the main attributes of your documents, and using this , you can know which instance in your dataworld the document you care about happens to be in.
This data model works very well for certain types of applications, but less well for applications where the data is basically tabular and joining is routine. So, in very large data-worlds, you often end up with a hybrid of NoSQL and SQL databases.
But it is worth knowing that it is very important to choose shard keys carefully in both distributed relational and in multi-instance NoSQL deployments.
Relational databases do not usually have a mechanism for sharding.
Sharding refers to a data storage practice in which groups of data are kept on separate systems. For example, you may choose to keep data on half of your users on one database and keep the data about the other half on a separate database stored on a separate machine.
The objective of sharding is to reduce the work each machine has to do. Many NoSQL databases have sharding built in and it is transparent to the software accessing the data. The software has one endpoint to reach the data and doesn’t know which database it is actually accessing.
Most relational databases do not have this capability built in (maybe none of them do). If software wants to shard across two relational databases, it has to keep track of the connections to each database separately.
The major relational databases have master-slave configuration options so that many separate databases may be kept in sync. In these configurations, each database has a full copy of the data, not just a portion of it. The software still has the task of deciding which database to access, but whichever it chooses, the database will have the complete data set.
Sharding is one form of partitioning, and most DBMS’s have partitioning. What’s interesting about sharding is use across multiple machines for horizontal scaling. Horizontal scaling in relational DBMS’s is difficult. Only a few do it well, and they are commercial and people don’t like to pay for it, so they’ll settle for the more relaxed semantics of sharded, eventually consistent distributed databases that don’t do transactions as well, or do joins nicely.
There’s a rich development field going right now trying to put decent enough SQL on top of weaker data storage engines.
Read up on Teradata. While I agree with other responses regarding popularity and generic DBMS, Teradata technology from the 80s implemented physical sharding methods to support large-scale DBs (the “terabyte” size implied by the name) at a time when a few hundred megabytes were big commercial data stores. After bouncing around from AT&T, NCR. etc., they are still found in high-end analytics domains. Nice tech and a true DB innovator.
6 Ways Cloud Technology and Big Data are Changing Supply Chain Management