Real-time analytics over large datasets has become a widespread need across Internet companies. Minimizing the time gap between data production and data analysis enables data driven companies to generate insights and make decisions in a timelier manner, ultimately allowing them to move faster. In order to provide real-time analytics, a database system needs to be able to continuously ingest streams of data — commonly generated by web logs — and answer queries only a few seconds after the data has been generated. The ingestion of these highly dynamic datasets becomes even more challenging at scale, given that some real-time streams can emit several million records per second.
Moreover, in order to unleash the full potential of the data, all database queries should finish in hundreds of milliseconds and provide a truly interactive experience, but realistically, scanning large datasets in such a short time-frame requires massive parallelization and thus a vast amount of resources. However, over the last few years at Facebook we have observed several use cases where all queries are heavily filtered and only interested on a very particular subset of a much larger dataset. For example, a query might only be interested in one metric for a particular demographic, such as only people living in the US or from a particular gender, measure the volume of conversation around a particular topic, or query for a specific group or mention of a particular hashtag. Considering that the filters applied depend on which aspects of the dataset an Analyst is interested on, they are mostly ad-hoc, making traditional one-dimensional pre-defined indexes less effective.
Cubrick is a distributed multidimensional in-memory DBMS developed from the ground up at Facebook to address this scenario. In order to provide interactive analytics over highly dynamic datasets, Cubrick implements a new strategy for organizing columnar in-memory data that allows indexed filtering on every dimension of the dataset, while still being very efficient to update. This coupled with a specialized and optimized query engine makes Cubrick uniquely suitable for interactive real-time analytics and enables it to reach a scale unseen before on current database solutions.
In our paper Cubrick: Indexing Millions of Records per Second for Interactive Analytics, being presented at VLDB in New Delhi, India, this week, we describe Cubrick’s new data organization technique called Granular Partitioning, and detail Cubrick’s internal data structures, distributed model, query execution engine and offer insights into its current implementation at Facebook.
State of Art
Traditional database techniques used to improve filter performance by skipping unnecessary data are either based on maintaining indexes (auxiliary data structures) or pre-sorting the dataset. Maintaining auxiliary indexes such as B+Trees to speed-up access to particular records is a well-known technique implemented by most DBMSs and leveraged by virtually every OLTP DBMS. However, the logarithmic overhead of maintaining indexes updated is usually prohibitive in OLAP workloads as table size and ingestion rates scale. Most types of indexes (notably secondary indexes) also incur in storage footprint increase to store intermediate nodes and pointers to the data, in such a way that creating one index per column may double the storage usage. Moreover, correctly deciding which columns to index is challenging in the presence of ad-hoc queries.
A second approach to efficiently skip data at query time is pre-sorting the dataset. Column-oriented databases based on the C-STORE architecture maintain multiple copies of the dataset ordered by different keys — also called projections — that can be leveraged to efficiently evaluate filters over the columns within the sort key. Even though a structure similar to a LSM-Tree (Log Structured Merge-Tree) is used to amortize the computational cost of insertions, a large amount of data re-organization is required to keep all projections updated as the ingestion rate scales. Besides, one has to decide beforehand which projections to create and their corresponding sort keys, which can be difficult to define on datasets composed of ad-hoc queries. Ultimately, since every new projection is a copy of the entire dataset, this approach is not appropriate for memory based DBMSs where the system tries to fit as much of the dataset in memory as possible to avoid burdensome disk accesses.
A New Approach
Instead of sorting the dataset or maintaining secondary indexing data structures, we took a different approach. We extended the traditional notion of database partitioning by assuming that all tables in the system are range partitioned by every dimension column. This coupled with the fact that the cardinality of each dimension column is known (or estimated) beforehand, allows us to understand the dataset as a big hypercube composed of smaller hypercubes, much like an n-dimensional Rubik’s cube. Each smaller hypercube (or brick, in Cubrick’s terminology) holds an id assigned by a numbering function and stores data sparsely, column-wise and in an unordered and append-only fashion. Lastly, we assume that all string values are dictionary encoded and internally represented as monotonically increasing integers. This assumption enabled us to develop a very optimized and lean database engine only able to operate over primitive data types.
Just like other multidimensional databases systems, every column in Cubrick is defined as either a metric or a dimension, where dimensions are commonly used for filtering and grouping, and metrics used on aggregate functions. Figure 1 illustrates how records are assigned to bricks on an example dataset composed by two dimensions Region and Gender, both having cardinality 4 and range sizes 2 and 1, respectively, and two metrics, Likes and Comments.
This simple yet effective data organization allows for very efficient record insertions, considering that there’s a constant time function that maps a record to its corresponding target brick, and that data inside a brick is unordered. In addition, at query time bricks can be easily matched against a query’s filter and pruned out in case there’s no intersection with the search space.
In order to evaluate how fast records can be ingested by Cubrick and the amount of CPU used on the ingestion pipeline, Figure 8 shows the number of records being ingested per second on a single node cluster, compared to CPU usage. The experiment shows that CPU usage in a single node cluster is still modest (below 20%) even when ingesting about 1 million records per second.
Caption: Single node cluster CPU utilization when ingesting streams with different volumes
Figure 7 presents the latency of multiple queries over a 10TB dataset running on a 72 node cluster, in order to evaluate how effective our indexing strategy is. The X-axis represents the column over which a filter was applied, while the color scale illustrates the restrictivity of this filter, or the percentage of the dataset that matches the filter. The experiments show that there’s a clear correlation between the color and position on Y, independently of the position on X. In other words, queries get faster when filters are applied, no matter in which column.
Caption: Latency of queries being filtered by different dimensions over a 10TB dataset on a 72 node cluster.
For the complete set of experiments and results please refer to our VLDB’16 paper.
Cubrick has been adopted by multiple realtime (and batch) interactive analysis use cases within Facebook over the last few years, and is quickly becoming a more mature full-featured DBMS offering. There is still ongoing work primarily focused around ways to better handle datasets with different data distribution characteristics, and on how to make the cube schema more dynamic. We believe Cubrick is the first step forward but there are still several unexplored and interesting opportunities for research in this direction.
Additional Facebook Work at VLDB’16
Also being presented at VLDB this week is The shortest path is not always a straight line. Leveraging semi-metricity in graph analysis. paper authored by Dionysios Logothetis.