Essential questions about Time Series
Today, the bulk of Big Data is temporal. We see it in the media and among our customers: smart meters, banking transactions, smart factories, connected vehicles … IoT and Big Data go hand in hand.
We have dealt with time series several times in the past, through comparative studies and the production of PoCs (Proof of Concept), or by putting into production some of the available solutions. The technologies we used in the past include Grafana + OpenTSDB + HBase, Warp 10, Druid, Cassandra, …
Are these solutions essential when dealing with Time Series? From which volume of data do these solutions provide a decisive advantage? Are they mature?
The answer to the question “when should we switch to a specific database?” is similar to how we would legitimate a transition from a traditional database to a Big Data platform: as soon as the current solution is no longer able to address its purpose, whether due to costs, difficulty in maintaining an operational service or volume. This volume is not limited to data storage but also to the profile of queries, their number (concurrency), the speed of execution (latency), write access, etc.
In recent years, several databases dedicated to the management of large quantities of time-series databases appeared, most of which are Open Source. The solutions for storing Time Series are diverse. They include:
- Sorted Key Value Set databases (including Columns Family) like HBase and OpenTSDB, Cassandra and KairosDB, and InfluxDB.
- Specific solutions like Druid and Prometheus.
- New entrants like TimescaleDB and TileDB.
Almost all of these solutions can be considered mature and are in production. For example, we first tested OpenTSDB in 2011, the prehistory of Big Data. Due to their specialization, a good understanding of the use cases and running comparative PoCs is necessary before validation.
All of these solutions are Open Source, apart from InfluxDB. More precisely, its core solution is Open Source, however, the code is proprietary and the scalability features (clustering) are accessible only with the Enterprise offer. This may interfere with the “Open Source First” vision of many companies.
There are many aspects to compare including performance, ecosystem, functionality or operability. Also, it is necessary to delimit which functionalities should be supported by the tool versus the one supported by the application itself. Should we consider the database as a raw storage, in which case HBase and Cassandra are excellent candidates, or as embedding Time Series specific features (eg GroupBy and TopN queries), requiring a specialized solution such as Druid or InfluxDB.
Beware of doubtful comparisons. For example, a customer asked us our opinion over a comparison between InfluxDB vs. ElasticSearch. It had no relevance if the subject is treated in terms of performance for several hundred of terabytes. It comes down to comparing the performance of a family car with that of a race car. We do not buy a family car for its speed. ElasticSearch is not intended to be a database optimized for the storage of Time Series but, like a relational database, its model allows it to do so at the expense of performance.
The data can be stored as a file and be processed in batch mode, either on Hadoop (eg HDFS + ORC + Hive) or on the Cloud (eg S3 + Spot Instance + Spark). These architectures are technically simple to implement and justifiable for certain Analytics and Data Science treatments or when we are waiting for future use cases.
Ultimately, more ambitious solutions require the collect and the processing of data in continuous flow (near real-time) as well as interactive queries with very low latency and/or very strong concurrency. This type of requests requires a storage solution that exchanges flexibility against a strong specialization.
To choose between a batch architecture or a streaming architecture, there are several possible approaches:
- Questions are still outstanding: the use cases are not yet very clear; the components, the material and the quantity to be ordered are not yet defined; we want a little more time for reflection or experimentation; the data must be stored as soon as possible in order to build the future dataset and deliver it to populations of Data Analysts and Data Scientists. So the wisest is to keep a batch architecture with eg storage in HDFS or S3.
- The constraints around the data are strong: one or more use cases require unique accesses, for example, 1 or N sensors over a given period of time, with very low latency and in large quantities; the data is streamed and must be made available in near real time throughout the ingestion chain; the system is strongly solicited in number of writing. In this case, a Time Series specific database is required.
These two approaches can be complementary to each other:
- Because they address different queries: low latency and strong competition (OLTP) against high read bit rate (OLAP).
- On different retentions, with for example the file system storing long-term data and an in-memory specialized database for short-term data.