The Fundamentals of Cloud Data Platforms

Today, massive amounts of data are being created with ever-increasing speed. This data has a lot of value if it is understood right.

Teemu Ekola / November 13, 2020

Data has been stored electronically for as long as we’ve had something that can be called a computer program. The amount of data stored has grown during the decades as programs got more complex and computer networks evolved.

Today, massive amounts of data are being created with ever-increasing speed. This data has a lot of value if it is understood. To analyze these enormous amounts of data, a new kind of Big Data infrastructure is needed. The infrastructure is best utilized in scalable cloud environments, and we call this modern version a Cloud Data Platform.

Complex and abstract things, such as vast computer systems, are often described with the means of analogous real-world examples. For Cloud Data Platforms, the real world has a nice analogy: A cloud data platform is essentially a (data) logistics network. Suppliers are the various information systems generating data.

Distributors, in turn, are the data pipelines consisting of integrations and automated data transformation tasks delivering the data to stores ready for consumption. Stores are the data warehouses or Data APIs that serve the (data) products. They can be browsed through a catalogue — a data catalogue.

Finally, the consumers are the data reporting tools and solutions, various applications or data analysts and scientists who use data to train machine learning models, for example.

Ok, sounds understandable — hopefully. But what has happened in the industry to make cloud data platforms possible? What has happened that was not possible before? Well, let’s get into it.

The emergence of Big Data analytics

First of all, we’re working in an industry where probably three of the biggest megatrends in the whole IT industry meet — that is cloud, big data and AI. It is an industry where the world’s largest companies fight fiercely and where 18-month-old technology may be deprecated.

The speed of change and development is just massive. It is also an industry where enormous open source platforms reside in symbiosis with the vendor-specific offering.

In fact, big companies commit to open source projects with hundreds of dedicated developers. A good example of this is Apache Spark. Due to this, the business logic is somewhat different from the traditional software product industry. It would be pure stupidity to compete with the largest players in the game, so the key is to be able to use the provided technology, and continuously adapt with the change.

The reason why these companies are interested in putting such a huge effort on big data and AI technology development and tools is in the end very clear. These companies eventually get a lot of revenue from the usage of their cloud. The best services and tools attract users, and really the only way to create those great services for big data is by scaling.

Frameworks in the cloud enable scalability

Managing and analyzing big data is all about scaling. Within this area massive leaps in frameworks have been taken during the last ten years. Distributed file systems have enabled us to manage massive amounts of unstructured data, Hadoop Distributed File System (HDFS) being the game changer initially.

The data lakes that lie in the center of data platforms are essentially distributed file systems. The management and control, as well as interfaces to access the files, have been improved a lot, but the basic idea is the same — to provide a storage space where the user (the engineer) doesn’t need to understand and implement the tedious and complex problem of replicating data without losing it, and without having more than one view of it.

Another framework-level enabler has been the data processing frameworks that again take away the need to understand the complexity of parallel processing. Or put it the other way around, let the user (again the engineer) focus on the data analytics or machine learning algorithm development. The framework can worry about compute cluster initialization, node-to-node communication, task scheduling and optimization, and many other tasks that are needed to make a cluster of multiple Compute nodes look like the algorithm is run on a single server.

In the early days of Hadoop, MapReduce was the framework, but quite soon the in-memory processing framework Spark took over. Now the usability of these tools has been developed, and big data tools are offered as readily usable services like Databricks.

The beauty in these frameworks is basically the same — the framework genuinely takes care of the initialization and scaling of a whole compute cluster for the time compute is needed and runs it down when it is not needed anymore. For the people who are working in the industry, this is already everyday life. But seriously, this is pretty cool — and advanced.

In addition to these beautiful frameworks, the cloud has genuine and tangible benefits as a platform. For example, Microsoft Azure has Databricks (as well as the rest of the needed cloud data platform services) available as a managed service, which means that the resources Databricks runs on are automatically up to date with the latest operating system versions and security batches. That’s something that is causing a major burden and cost of maintenance for solutions relying on a self-hosted virtual machine infrastructure.

Why is the Cloud Data Platform the transmission line of data-driven transformation?

One of the biggest trends in organizations is to become data-driven. Becoming data-driven means being able to analyze data and base decisions on the analyses.

Again, being able to analyze the data means that data and analytics tools need to be available. A typical situation in organizations preventing this is that the data has gotten siloed in various systems, and it is not available for analytics. It might also be that the management of the data is done manually, and adding new, even publicly available data sources is slow and expensive.

This will often result in a situation where newly hired data analysts or data scientists lose valuable time looking for, cleaning and understanding the data. And in the worst-case scenario, if a common platform is missing, the machine learning model that is finally built is only deployed on the data scientist’s laptop without any possibility to leverage it in a wider scope, and with no one else updating it.

Here is where the cloud data platform comes into play. It is a system to orchestrate the data for analytics or third party applications. Typical logical parts of a cloud data platform are:

An integration layer managing the integrations of source system data to the data platform.
A data storage layer, typically a data lake, as also unstructured data is widely used.
A data modeling layer, which logically consists of both the data model and a modeling tool.
A data publication layer, which typically has both a cloud data warehouse (or data marts) and data APIs.
A data pipeline management component, which is used to define the data flows from source systems to data products on the publication layer.
A Big Data processing component, which is often used in multiple roles such as to process (for example clean or mask) the data in the data lake and for advanced analytics.
A data science environment to build machine learning models, train them, validate them, deploy them into production and finally automate the deployment.
A Data Catalogue, for users of the data platform to browse the data products available, understand the way how they are created and even to give feedback for other users about the data products.
A monitoring component. Like all vast production-ready IT solutions, a cloud data platform has tools to monitor the operations of the platform, meaning that the components are running error-free, but also checking that the data flows in the pipelines and its quality is as expected.

However, there are emerging technologies that are about to change this somewhat stable structure. One of them is the lake house concept, which we will be discussing in another blog article soon.

Final words

Ok, hopefully all of you who had no previous experience of cloud data platforms now understand that they are essentially data logistics networks, which orchestrate the data for analytics or applications.

And by the way, if the parties working in the data logistic network are talking different languages, the work quickly becomes slow and frustrating — that’s why we need to have a data glossary.

If the suppliers are allowed to deliver their goods with or without a description of the content (metadata), if they are using whatever names they wish about their products (master data), or if nobody has any control on the quality of the goods the suppliers deliver (data quality), it is soon apparent that the logistics network will halt. We obviously need someone to govern the whole thing — so, govern the data.

Join our #datainsiders community

The passion in data unites us. Get the latest insights & facts to your inbox.

Teemu Ekola

Tietoevry alumni

With extensive experience in product management, healthcare regulations and cloud data analytics Teemu is a seasoned expert, who seamlessly bridges the gap between innovation and compliance. Teemu is committed to driving excellence and fostering progress across the industry via initiatives such as the Pediatric Health Data Space developed in PHEMS project.

Author

Teemu Ekola

Tietoevry alumni

More from the author

30.9.2021

Bring DataOps to life with the help of data warehouse automation

The Fundamentals of Cloud Data Platforms

Data has been stored electronically for as long as we’ve had something that can be called a computer program. The amount of data stored has grown during the decades as programs got more complex and computer networks evolved.

The emergence of Big Data analytics

Frameworks in the cloud enable scalability

Why is the Cloud Data Platform the transmission line of data-driven transformation?

Final words

Join our #datainsiders community

Author

More from the author

Read more related blog posts

Can Digital Product Passports close the circular economy loop?

How Nordic banks can stay customer-focused in times of change

Revolutionizing mill operations with Artificial Intelligence

Services

Software

Industries

Our sustainability work What We Offer

Learn more about our initiatives

Get to know us What We Offer

Details and analysis What We Offer

Quicklinks

Discover more What We Offer

The Fundamentals of Cloud Data Platforms

Data has been stored electronically for as long as we’ve had something that can be called a computer program. The amount of data stored has grown during the decades as programs got more complex and computer networks evolved.

The emergence of Big Data analytics

Frameworks in the cloud enable scalability

Why is the Cloud Data Platform the transmission line of data-driven transformation?

Final words

Join our #datainsiders community

Author

More from the author

Share this post

Read more related blog posts

Can Digital Product Passports close the circular economy loop?

How Nordic banks can stay customer-focused in times of change

Revolutionizing mill operations with Artificial Intelligence