In my previous post, Hitchhiker’s guide to Big Data, we explored the differences and similarities of Big Data versus Business Intelligence and why both are orthogonal to each other.
Based on this knowledge, this post proposes developing an architecture which suits the majority of the use cases. It puts emphasis on flexibility, but also considers cost of development and operations.
The hallmark diagram of the Big Data space is FirstMark’s “Big Data Landscape” (below). It lists all the major players in the various aspects of the Big Data space. (Full text is found here)
This overview leaves it to us to understand all the tools – their individual pros and cons and challenges – and then figure out which to use when and how to combine them.
To begin tackling this landscape, let’s first narrow down some terminology.
Analyzing data is, mathematically speaking, nothing more than taking a set of data from the source, applying a function and visualizing the result: y=f(x).
Example: The dataset “x” shall be sales data in the ERP database. The function: a sum, grouped by customer’s region and the result “y” visualized as a bar chart.
This is pretty much what happens in reporting – with one important issue: The ERP system gets new data – fine – but data is updated frequently as well. My classic example is when a customer moves from one region to another and suddenly the entire past(!) sales revenue is attributed to the new region. Not a good thing when two reports of the same data show different results.
The solution is to build a Data Warehouse, where the data is maintained and historically correct, thus there would be two versions of the customer record with valid-from-to-dates. One positive side effect of copying the data is, the function f() can be split: Into a common part that is used always and a visualization-specific part. An example could be the calculation of sales revenue. Why would every single query execute the if-then-else decision tree to figure out what contributes to sales revenue, what does not? Much better to store the sales revenue as additional value, then the BI query is just a sum(revenue). Another advantage, although a purely technical one, is that a Data Warehouse can have a simplified data model to improve query speed. The data is well structured, easy to find (thanks to the simplified data model), fast to query and queries do not slow down the ERP system as they run on a different server.
That is the value proposition of a Data Warehouse and it is still valid.
Both of these approaches assume that the source dataset “x” is known upfront and its structure does change only a little over the years. While this assumption is valid for databases, in the grand scale of things this is a severe limitation. Starting with Excel sheets that should match but don’t. Ending at completely unstructured text where everything is different. In the Reporting and Data Warehouse world, somebody has to create a common and agreed-on data model and then the data is squeezed into – semantics first and content later.
The approach of a Data Lake is the reverse, first the content and then apply semantics to it. Or in more practical terms, store the raw data and provide tools for the users to play around with it.
In practice, due to the large number of information types, the large volumes and the number of rules to be known to each consumer of the data, many Data Lakes are more an accumulation of mud. Of course there is gold buried somewhere deep inside and of course any tool can be used, but finding it requires a lot of digging.
This does not necessarily mean that a Data Lake is a failed proposition, though it does require more resources. It needs more metadata to guide the user to the interesting places, more powerful tools, more knowledge and overall experience by the consumer. Not to mention, more processing power as both content and structure transformations are done at query time. In short, a Data Lake is typically for the Data Scientists out there, rather than the average user.
Another difference is the cost of storage between the two – storing a TB of data in a Data Warehouse database is way more expensive. And storing a PB of data is virtually impossible. No problem for a Big Data system, however.
Adding the fact that even Data Warehouse systems often have a staging area, thus combining the Data Warehouse as a Business Intelligence platform with the Big Data storage as Data Lake resp. the DWH Staging area makes sense.
To recap, in a classic Data Warehouse architecture there would be structured data sources like the ERP system, Excel sheets and more. This data is read on a daily basis from the source systems by an ETL tool, transformed into the Data Warehouse model to gain all the advantages of a DWH. Sources where the data volume would be too large – web logs as an example – are aggregated by a predefined logic and this data is stored as well in the DWH.
Unstructured data is usually not loaded. If there are predefined rules, e.g. source is a web forum and the text analytics process extracts the sentiment in combination with a dictionary, then no problem. On the other hand, if the text analytics configuration or its dictionary has to be changed to suit individual queries better, this cannot be achieved.
The next logical step is to preserve all the data and make it available. Here, all sources contribute their data into the Data Lake – including the weblog and unstructured data – and there are two consumers: the Data Warehouse and the Data Scientist.
Note: The Data Lake is not a copy of the source systems, it always appends to data. Hence it is historically correct as well.
The Data Lake contains the historical data and the Data Warehouse can be completely rebuilt from that. In that sense, the Data Warehouse somewhat resembles a Data Mart. An example would be deleting a customer in the ERP system. This information would be added/inserted into the Data Lake by adding the record and an additional column holds the delete flag. To be more precise, for each record there should be two additional columns, the change_timestamp (=timestamp when the data was received by the Data Lake) plus the change_type (=deleted yes/no).
To transform the data into easier-to-understand Data Warehouse structures, we can, of course, use the same ETL tool. But assuming the Data Scientist finds interesting algorithms, it would be beneficial to be able to also use these algorithms when loading the Data Warehouse. The current go-to tool for this chosen by Data Scientists is Apache Spark (not as user friendly as ETL tools very powerful), hence this architecture diagram assumes it is used for both. But everything is possible here and should be decided on a case by case basis.
As reminder, an ETL tool is about Extraction and Transformation. So its “E” part could be used to extract the change data from the source system and write it into the Data Lake. This data would then be read from the Data Lake and the regular Data Warehouse transformation would happen.
Another limitation of the classic Data Warehouse approach is its inherent batch-orientated thinking. While this does have upsides – some say that a Data Warehouse has to be loaded once a day only so users get the same data when refreshing the query – there are some obvious limitations as well. Either way, in today’s fast paced world, real-time is key. The term “real-time” is used here as a synonym for “low latency”, meaning the time between a change in the source to this change being visible in the Data Lake and Data Warehouse should be seconds (or less).
From a technical point of view this has a couple of issues. In the Data Lake…
This is where Apache Kafka comes into the picture. Kafka is not so much a messaging system, not even a queuing system but best described as a distributed transaction log. Databases also have transaction logs but they are neither distributed nor easy to be consumed. Their use case is totally different: Databases use transaction logs for the strong transactional guarantees and for recovery, hence they cannot be distributed.
In Kafka, incoming data is stored to disk in a distributed manner and consumers can decide to either read through all changes of the last days, read from the last successful read onward or to subscribe to new incoming data and be notified within sub seconds. Because of the data distribution, the update frequency can be higher by factors, millions of change messages per second in fact, compared to databases. But there are limitations in payload size – say one message should be less than 1MB. With the combination of Kafka and the Data Lake, the best of both worlds can be combined, eliminating each other’s downsides of the other.
The architecture diagram asks all sources to push their data to a central http endpoint which delegates the data. Structured data is validated and put into Apache Kafka and therefore offered to all consumers immediately. Files with unstructured data are put into the Data Lake and the file metadata also goes into Kafka. For example, a .jpeg file is uploaded and stored at a certain path in a Data Lake directory. The file location plus image metadata like width and height is put into Kafka. Therefore consumers would know right away when a new file was uploaded and its location. And in addition queries like “only large files” can be executed.
One Kafka consumer is the Data Lake writer, materializing all Kafka data into the Data Lake as the long term persistent store.
The consumers of data have the option to read the raw data from the Data Lake for batch orientated processing. And can register themselves to listen on Kafka messages to get notified about changed data right away. In the case of the Data Warehouse loader, the initial load would be done from the Data Lake reading its tables. And the program would listen on changes on the various Kafka messages to incorporate them into the Data Warehouse.
With this combination of Kafka and Hadoop, the above four problems are solved without doing much. The files in the Data Lake are large because the individual messages are collected in Kafka and written to the Data Lake only once per day or whenever there is enough data. This also solves the problem of immutable files as no changes need to be merged into existing data. And the file count remains low – one file per day in worst case. The change_time and change_type column is added by the http endpoint, hence its timestamp is strictly ascending – no side effects from delayed commits in databases.
Not to mention, it works for structured and unstructured data as both are offered in real-time. And large files do not pollute the Kafka transaction logs as the Kafka message contain the file location but not the file’s binary content.
How would the perfect Data Lake system look like? Coming from the relational world, the answer seems to be obvious. It would be some sort of a database where data is inserted, updated or deleted. For every type of information there is a table with well defined structures. In other words, there is an active process controlling the table structure, the data consistency and the data distribution. Some common downsides of databases are:
In contrast, the Data Lake is simply a large file system, where none of the above problems come into play. All that’s required is that each file contains the table structure and indexing internally, because then query performance will be database-like. There is no write concurrency as the file never changes, no data dictionary is needed as each file contains the table structure itself and no indexes are needed as the file is indexed internally. So by giving up on the insert/update/delete functionality and supporting insert only – something needed for historical data anyhow – the overhead of an active database can be avoided without giving up anything else.
To be more precise, in Hadoop/Spark a single table is represented by a directory(!), and not a single Parquet file. The table structure is the summary of all file structures.
Example: In Spark the user would execute a query select firstname, lastname, newfield from /home/data/customer_master;. The columns of this directory are the union of all file structures, e.g. customer_master_2017_10_01 has firstname and lastname but the customer_master_2017_10_02 file also has the column newfield, hence the directory has all three fields. Only that the newfield is null for the older files. See schema evolution for this feature.
This gives the best of both worlds – no active database process is needed and adding columns is as simple as writing a new file into that directory.
To validate the above thoughts, a quick look into the current IoT Reference Architecture might be interesting.
The ultimate source here is a device, a Thing shown at the very left side. This sensor data is preprocessed by an Edge Component, that is a computing device responsible for all its local sensors, e.g. all sensors of a windmill. The goal here is twofold:
This Edge Component usually implements a MQTT communication protocol which the cloud (https) endpoint (=Device Connectivity) understands. It persists the data immediately in a Message Broker, which is Apache Kafka in SAP IoT.
For Kafka, two Consumers are defined. First the Stream Processor which materializes the data in the Data Lake in different formats for different use cases. The Raw Data is the original message received, the Time Series data does optimize the raw data for better querying, e.g. it receives an event timestamp but stores a valid-from-to timestamp. And it loads something we would have called a Data Warehouse with aggregated data.
The second consumer (=Real-Time Processing) is for events to show alarms right away and the such.
The only difference to the above architecture is that the Data Lake is not a file system with Parquet files but an active database from the Big Data space. And that Apache Spark is used internally but not made available to a Data Scientist (as of today).