And Why a Database is not Enough
Companies and organizations are increasingly using existing data to generate additional values. Traditionally known as the task of business administration analysts, data analysis has become indispensable to organizations. Companies need efficient long-term data architectures to support them in the ongoing effort to stay on top of current data-related challenges. In what follows, we discuss the aspects and technical intricacies that need to be addressed in building a data architecture.
A data platform is not a database. Whereas databases are the foundation of data platforms, they do not equip you to handle analytics. A data platform, on the other hand, acts as an additional level on top of a database that is optimized to serve that purpose. In what follows, we illustrate the amount of functionality that is required to sustain a basic long-term data strategy within a company.
What is a Data Platform?
A data platform is a service that allows you to ingest, process, store, access, analyze, and present data. These are the defining features of what we call a data platform. We can break down a data platform into the following parts:
- Data Warehouse: ingest, process, store, access
- Intelligence: analyze and present
- Data Science: Statistics and Artificial Intelligence (a special form of analysis)
Storing and processing data is at the heart of a data platform. However, this is only the beginning. For an overview of the most common data management tasks, we have compiled a list of 18 criteria concerning the data engineering and analytics cycle:
- Data Architecture (infrastructure, scaling, database)
- Import Interfaces
- Data Transformation (ETL)
- Process Automation
- Data Historization
- Data Versioning
- Surrogate Key Management
- Analysis / Reporting
- Data Science Work Area
- External Access / API
- Multi-User Development Process
- In-Platform Documentation
In the following sections, we provide a short introduction to each of these criteria without going into technical details.
1. Data Architecture
The core data architecture is a key aspect of a data platform architecture but by far not the only one. To find a suitable storage and database solution that meets your requirements, you can choose from three basic options:
Mature database systems with built-in intelligence for the efficient handling of large datasets have the most expressive analysis tools. However, these are more complex when it comes to maintenance. Today, these systems are also available as distributed systems and can handle extremely large datasets. Examples include PostgreSQL with Citus or Greenplum cluster solution, MariaDB/MySQL with Galera Cluster, Amazon Redshift, Oracle, MSSQL, etc.
NoSQL Sharding Database
Also designed to handle extremely large datasets, these systems sacrifice some of the classical features of relational databases for more power in other areas. They offer much less analytical power but are easier to manage, have high availability, and allow for easy backups. There are efforts to mimic the analytical power of SQL that is available in the relational database world with additional tools such as Impala or Hive. If you have huge amounts of data, specific requirements for streaming, or real-time data, you should take a look at these specialized data systems. Examples include Cassandra, the Hadoop Ecosystem, Elasticsearch, Druid, MongoDB, Kudu, InfluxDB, Kafka, neo4j, Dgraph, etc.
It is possible to design a data strategy solely on files. File structures such as the Parquet file standard enable you to use affordable storage to accommodate very large data sets distributed over a multitude of storage nodes or on a Cloud Object Store like Amazon S3. The main advantage is that the data storage system alone is sufficient to respond to data queries. In the two examples described above, on the contrary, you need to run services on extra compute nodes to respond to data queries. With a solution such as Apache Drill, you can query parquet files with similar comfort to SQL.
When looking for the hardware architecture to support your data architecture, you have a few basic options:
- You can build your platform relying on services offered by major cloud vendors. On cloud platforms such as AWS, Azure, or Google Cloud, you can plug together a selection of simple services to create a data platform that covers our list of criteria. This might look simple and cheap on a small scale but can turn out to be quite complex and expensive when you scale up and need to customize.
- In contrast, there are platforms based on self-operated hardware including cloud virtual machines and individual software stacks. Here you have a maximum of flexibility but also need to address many of the criteria on our list by creating your code and custom solutions.
- Finally, complete independent cloud data platforms such as Repods, Snowflake, Panoply, or Qubole cover, to a greater or lesser extent, all items from our list of 18 criteria for the data engineering and analytics cycle.
A key criterion when it comes to platform choice, performance is mostly influenced by the database subsystem you choose. Our rule of thumb: The higher your performance requirements, the more specialized your database system choice should be.
3. Import Interfaces
We categorize import interfaces into three different sections.
Files remain the most common form of data today.
II. Web Services
Plenty of web services with relevant data are available online.
Although many organizations store their data in traditional databases, in most cases direct database access is not exposed to the internet and therefore remains unavailable to cloud data platforms. Web services can be placed in between on-premise databases and cloud services to handle security aspects and access control. Another alternative is the use of ssh-tunneling over secure jump hosts.
IV. Real-time Streams
4. Data Transformation ETL
Data imported into the data platform usually has to undergo some data transformations before it can be used for analysis. This process is traditionally called ETL (Extract, Transform, Load). Data transformation processes usually create a table from the raw data, assign data types, filter values, join existing data, create derived columns/rows, and apply all kinds of custom logic to the raw data. Creating and managing ETL processes is sometimes called data engineering. This is the most time-consuming task in any data environment. In most cases, this task takes up to 80% of the overall human efforts. Larger data warehouses can contain thousands of ETL processes with different stages, dependencies, and processing sequences.
5. Process Automation
When you have many sources, targets, and multiple data transformation processes in between, you also have numerous dependencies. This comes with a certain run schedule logic. The automation of processes is part of every data platform and involves a variety of processes of increased complexity. Alone for the scheduling of processes, a variety of dedicated tools such as Apache Airflow, Automate, Control-M or Luigi have been made available.
Process automation also requires you to manage the selection of data chunks that are to be processed. For instance, in an incremental load scenario, every process execution needs to incrementally pick specific chunks of source data to pass on to the target. Data Scope Management is usually implemented by a metadata-driven approach. There are dedicated metadata tables that keep track of the process state of each chunk and can be queried to coordinate the processing of all chunks.
Larger data warehouse systems can easily contain hundreds of tables with hundreds of automated ETL processes managing the data flow. Errors appearing at runtime are almost unavoidable. Many of these errors have to be handled manually. With this amount of complexity, you need a way to monitor the processes on the platform.
7. Data Historization
The need to manage longer histories of data is at the core of each data platform effort. The data warehousing task itself could be summarized as the task of merging separate chunks of data into a homogenous data history. As data is naturally generated over time, there arises the need to supplement an existing data stock with new data. Technically speaking, time ranges in tables are tracked using dedicated time range columns. Data historization is the efficient management of these time ranges when new data arrives. The most common approach for the management of such emerging data histories, data historization is different from data versioning in the sense that data historization is concerned with real-life timestamps whereas versioning is usually concerned with technical insert timestamps (see the section below).
8. Data Versioning
By versioning data, you can track data corrections over time for the later recovery of old analyses. Versioning allows you to apply non-destructive corrections to existing data. When comparing versioning capabilities, you have to consider the ease of creating versions and the ease of recovering or querying versions. Versioning can be handled on different system levels:
a) Create version snapshots on the storage subsystem (similar to backups).
b) The underlying database system might come with support for version tracking.
c) Versioning might be handled by the data warehouse system.
d) Versioning can be implemented as a custom transformation logic in userspace.
9. Surrogate Key Management
Data platforms are used to consolidate data from many sources with different identifiers for the respective objects. This creates the need for new key ranges for the imported objects and the need to maintain them throughout consecutive imports. These new keys are called surrogate keys. Creating and maintaining these keys efficiently is no simple task.
10. Analysis / Reporting
The purpose of a data platform is to prepare raw data for analysis and store this data for longer histories. Analyses can be conducted in a variety of ways.
A variety of Business Intelligence Tools (BI Tools) are concerned solely with the task of creating analytical and human-readable data extracts. To prepare data chunks for presentation, a data platform provides features to create data extracts and aggregates from the larger data stock.
Answering specific business questions by intelligently querying the data stores requires a great deal of user proficiency in analytical query languages. BI Tools aim to simplify these tasks by providing point-and-click interfaces to answer basic questions such as “Number of visitors per month in-store” or “Sum of revenue in region X”. These tools also enable users to visualize the information via comprehensive graphics. In almost all cases, power users still want to be able to bypass these tools and conduct their own queries. Popular examples for BI-Tools include Tableau, Qlik, Looker, Chartio, and Superset, among many others.
11. Data Science
Training machine learning models is a requirement that today’s data platforms have to serve. Most sophisticated methods are implemented using not SQL, but Python or R together with a wide variety of specialized libraries such as NumPy, Pandas, SciKit Learn, TensorFlow, PyTorch, or even more specialized libraries for natural language processing or image recognition.
Since these tasks can be computationally demanding, extra compute hardware is often required in addition to the existing analytics hardware. While this opens up a large variety of tools for you to pick from, you are also facing the challenge of hosting and managing compute resources to back the potentially demanding machine learning jobs.
12. External Access / API
All the collected data on the platform is there to be used for different purposes. Possible channels that are considered here are:
a) SQL Access for direct analysis also by e.g. BI tools
b) API Access (REST Request) as a service for websites or apps
c) Notifications via Text Push or Email for end-users or administrators
d) File Exports for further processing or data delivery to other parties
The usability of the platform depends on the targeted audience. The main concern here is how easy it is to create and manage objects (such as users, data warehouses, tables, transformations, reports, etc. ) in the platform. Often there exists a trade-off between the level of control a user gets and the level of simplicity. Here we have to distinguish between the functionality that the platform provides and the user-generated content inside the platform. In most cases, the user-generated content requires the use of code, since the whole subject of data engineering and analysis is by nature complex and requires a high level of expressiveness.
14. Multi-User Workflow
This category evaluates the support for user interactions and the sharing of work and data. This aspect involves real-time updates of user actions, collaborative work, sharing, and role assignments, as well as a way to discuss and comment on the platform.
15. In-Platform Documentation
A data platform is used to implement a high degree of custom complexity with a multitude of participating users over a longer period.
This requires detailed documentation of the user-provided content. Here we assess how the platforms support this task. Documentation can always be prepared outside of the platform, but this implies the risk of information divergence as external documentation quickly becomes outdated.
Data platforms require a certain degree of user proficiency. Proper documentation detailing the platform features is therefore required for the professional use of a platform.
Data Platform Security can be separated into security of storage (data in rest), interaction (data in transport), and access control.
18. Cost Structure
We identify three major cost drivers of a data platform:
- Infrastructure (Hardware)
Today, most of the software stack can be implemented in high quality using open-access software. Licensed software, however, usually require less maintenance effort and low-level system know-how.
Compute hardware can be used by cloud providers on a pay-per-use basis. The same also applies to storage infrastructure.
To estimate your hardware costs, you need to consider an infrastructure covering the following components:
- Data Transformations
- Data Science
- Hosting of Content
Even though the database is usually the largest component of a data platform, it is by far not the only one.
Data platforms can in no way be reduced to their underlying core database. Instead, data platforms can be seen as ecosystems of services that require perpetual balancing.
Our proposed list of 18 criteria has provided a basic entry point for evaluating data platforms in terms of their suitability as long-term manageable data platforms. These criteria are primarily relevant for organizations aiming toward the aggregation of longer data histories for more comprehensive statistics and forecasts.