How to Evaluate Data Platforms for Your Organization

And Why a Database is not Enough

Introduction

A data platform is not a database. Whereas databases are the foundation of data platforms, they do not equip you to handle analytics. A data platform, on the other hand, acts as an additional level on top of a database that is optimized to serve that purpose. In what follows, we illustrate the amount of functionality that is required to sustain a basic long-term data strategy within a company.

What is a Data Platform?

  • Data Warehouse: ingest, process, store, access
  • Intelligence: analyze and present
  • Data Science: Statistics and Artificial Intelligence (a special form of analysis)

Storing and processing data is at the heart of a data platform. However, this is only the beginning. For an overview of the most common data management tasks, we have compiled a list of 18 criteria concerning the data engineering and analytics cycle:

  1. Data Architecture (infrastructure, scaling, database)
  2. Import Interfaces
  3. Data Transformation (ETL)
  4. Performance
  5. Process Automation
  6. Monitoring
  7. Data Historization
  8. Data Versioning
  9. Surrogate Key Management
  10. Analysis / Reporting
  11. Data Science Work Area
  12. External Access / API
  13. Usability
  14. Multi-User Development Process
  15. In-Platform Documentation
  16. Documentation
  17. Security
  18. Costs

In the following sections, we provide a short introduction to each of these criteria without going into technical details.

In Detail

Image by Max Ostrozhinskiy

1. Data Architecture

Relational Database
Mature database systems with built-in intelligence for the efficient handling of large datasets have the most expressive analysis tools. However, these are more complex when it comes to maintenance. Today, these systems are also available as distributed systems and can handle extremely large datasets. Examples include PostgreSQL with Citus or Greenplum cluster solution, MariaDB/MySQL with Galera Cluster, Amazon Redshift, Oracle, MSSQL, etc.

NoSQL Sharding Database
Also designed to handle extremely large datasets, these systems sacrifice some of the classical features of relational databases for more power in other areas. They offer much less analytical power but are easier to manage, have high availability, and allow for easy backups. There are efforts to mimic the analytical power of SQL that is available in the relational database world with additional tools such as Impala or Hive. If you have huge amounts of data, specific requirements for streaming, or real-time data, you should take a look at these specialized data systems. Examples include Cassandra, the Hadoop Ecosystem, Elasticsearch, Druid, MongoDB, Kudu, InfluxDB, Kafka, neo4j, Dgraph, etc.

File-based Systems
It is possible to design a data strategy solely on files. File structures such as the Parquet file standard enable you to use affordable storage to accommodate very large data sets distributed over a multitude of storage nodes or on a Cloud Object Store like Amazon S3. The main advantage is that the data storage system alone is sufficient to respond to data queries. In the two examples described above, on the contrary, you need to run services on extra compute nodes to respond to data queries. With a solution such as Apache Drill, you can query parquet files with similar comfort to SQL.

When looking for the hardware architecture to support your data architecture, you have a few basic options:

  1. You can build your platform relying on services offered by major cloud vendors. On cloud platforms such as AWS, Azure, or Google Cloud, you can plug together a selection of simple services to create a data platform that covers our list of criteria. This might look simple and cheap on a small scale but can turn out to be quite complex and expensive when you scale up and need to customize.
  2. In contrast, there are platforms based on self-operated hardware including cloud virtual machines and individual software stacks. Here you have a maximum of flexibility but also need to address many of the criteria on our list by creating your code and custom solutions.
  3. Finally, complete independent cloud data platforms such as Repods, Snowflake, Panoply, or Qubole cover, to a greater or lesser extent, all items from our list of 18 criteria for the data engineering and analytics cycle.

2. Performance

3. Import Interfaces

I. Files

Files remain the most common form of data today.

II. Web Services

Plenty of web services with relevant data are available online.

III. Databases

Although many organizations store their data in traditional databases, in most cases direct database access is not exposed to the internet and therefore remains unavailable to cloud data platforms. Web services can be placed in between on-premise databases and cloud services to handle security aspects and access control. Another alternative is the use of ssh-tunneling over secure jump hosts.

IV. Real-time Streams

Real-time data streams as delivered by messaging routers (speaking WAMP, MQTT, AMQP, etc.) are still underutilized today but are gaining in significance with the rise of IoT.

4. Data Transformation ETL

5. Process Automation

Image by Joel Filipe

When you have many sources, targets, and multiple data transformation processes in between, you also have numerous dependencies. This comes with a certain run schedule logic. The automation of processes is part of every data platform and involves a variety of processes of increased complexity. Alone for the scheduling of processes, a variety of dedicated tools such as Apache Airflow, Automate, Control-M or Luigi have been made available.

Process automation also requires you to manage the selection of data chunks that are to be processed. For instance, in an incremental load scenario, every process execution needs to incrementally pick specific chunks of source data to pass on to the target. Data Scope Management is usually implemented by a metadata-driven approach. There are dedicated metadata tables that keep track of the process state of each chunk and can be queried to coordinate the processing of all chunks.

6. Monitoring

7. Data Historization

8. Data Versioning

a) Create version snapshots on the storage subsystem (similar to backups).

b) The underlying database system might come with support for version tracking.

c) Versioning might be handled by the data warehouse system.

d) Versioning can be implemented as a custom transformation logic in userspace.

9. Surrogate Key Management

10. Analysis / Reporting

Image by Ned Dorman

The purpose of a data platform is to prepare raw data for analysis and store this data for longer histories. Analyses can be conducted in a variety of ways.

A variety of Business Intelligence Tools (BI Tools) are concerned solely with the task of creating analytical and human-readable data extracts. To prepare data chunks for presentation, a data platform provides features to create data extracts and aggregates from the larger data stock.

Answering specific business questions by intelligently querying the data stores requires a great deal of user proficiency in analytical query languages. BI Tools aim to simplify these tasks by providing point-and-click interfaces to answer basic questions such as “Number of visitors per month in-store” or “Sum of revenue in region X”. These tools also enable users to visualize the information via comprehensive graphics. In almost all cases, power users still want to be able to bypass these tools and conduct their own queries. Popular examples for BI-Tools include Tableau, Qlik, Looker, Chartio, and Superset, among many others.

11. Data Science

Since these tasks can be computationally demanding, extra compute hardware is often required in addition to the existing analytics hardware. While this opens up a large variety of tools for you to pick from, you are also facing the challenge of hosting and managing compute resources to back the potentially demanding machine learning jobs.

12. External Access / API

a) SQL Access for direct analysis also by e.g. BI tools

b) API Access (REST Request) as a service for websites or apps

c) Notifications via Text Push or Email for end-users or administrators

d) File Exports for further processing or data delivery to other parties

13. Usability

14. Multi-User Workflow

15. In-Platform Documentation

Image by Joel Filipe

A data platform is used to implement a high degree of custom complexity with a multitude of participating users over a longer period.

This requires detailed documentation of the user-provided content. Here we assess how the platforms support this task. Documentation can always be prepared outside of the platform, but this implies the risk of information divergence as external documentation quickly becomes outdated.

16. Documentation

17. Security

18. Cost Structure

  1. Licenses
  2. Infrastructure (Hardware)
  3. Staff

Today, most of the software stack can be implemented in high quality using open-access software. Licensed software, however, usually require less maintenance effort and low-level system know-how.

Compute hardware can be used by cloud providers on a pay-per-use basis. The same also applies to storage infrastructure.

To estimate your hardware costs, you need to consider an infrastructure covering the following components:

  • Database
  • Data Transformations
  • Analytics
  • Data Science
  • Automation
  • Monitoring
  • Hosting of Content

Even though the database is usually the largest component of a data platform, it is by far not the only one.

Conclusion

Data platforms can in no way be reduced to their underlying core database. Instead, data platforms can be seen as ecosystems of services that require perpetual balancing.

Our proposed list of 18 criteria has provided a basic entry point for evaluating data platforms in terms of their suitability as long-term manageable data platforms. These criteria are primarily relevant for organizations aiming toward the aggregation of longer data histories for more comprehensive statistics and forecasts.

Phd in Mathematics and data lover. CEO of Record Evolution, the creators of REPODS.