Modern big data platform – technology stack overview (part 1)
Recent trends in the big data stack
Big data is a broad range of various technologies and use cases. For clarity, I want to first highlight the areas I cover in this blog entry. I come from the world of medical data modalities, which in general makes me a person who processes batches rather than streams. So let me make the first assumption here – we cover batch processing scenarios. Also, we all know that modern AI-based applications are often part of the big data use cases; however, they are out of scope of this read. We don’t focus on the specialized requirements that stand in front of complex machine learning computations. At least, not today.
Recent trends in the big data stack
When it comes to batch processing of big data, there are in fact two major phenomena that have changed the game:
The first one is obviously the cloud computing revolution and all the related beauties, like Infrastructure as a Code (IaC), DevOps with its various flavors, and a general trend of making hardware much “softer”. It is great that nowadays you can provide a lot of infrastructure without dealing with any metal.
The second major phenomenon is the evolution and increasing adoption of Kubernetes (K8s) as the resource scheduler for the data processing use cases. K8s is frequently used as a host of complex microservices with high availability in place and advanced load balancing. These days, I’ve noticed that Kubernetes can replace YARN and take over big data workloads to serve Hadoop clusters.
It is worth mentioning that both of the above trends are interconnected. Kubernetes is considered as a cloud-native technology. “Cloud nativeness” is another buzzword that has spread across the IT industry. My understanding of this term is that the cloud-native technology is well integrated with the cloud ecosystem and is able to take advantage of the cloud provisioning model. They are able to dynamically react to changing conditions, like scaling up during periods of higher usage or adapting to increased resource demands. Kubernetes is a perfect candidate to operate in the cloud, however, it is not the only option.
What cloud brings to the table
I think the most important cloud feature that has a huge impact on big data use cases is almost limitless scalability provided at a reasonable cost. Let’s be clear: cloud computing is not more expensive than on-premises data centers provided that:
infrastructure is configured and managed properly according to the best practices and
you can have a huge computational power at a similar cost.
But does it really mean that everybody should shut down their on-premises data centers and move to the cloud?
The devil is in the details, of course. Various technologies fit the cloud differently in practice. So it depends on the actual technology stack, the type of data, and processing patterns. In my other article, I discuss Hadoop as one of the potential technologies to migrate from on-premises (in-house) data centers to cloud-based infrastructure and services.
This is actually a great example of a powerful technology that suffers from certain limitations when used in the cloud. Therefore, it is more often replaced by Kubernetes-based platforms or cloud native offerings, and each case requires an adequate analysis. This is just a technological aspect. Other aspects should also be considered, such as compliance and local laws or specific security and privacy requirements related to the physical location of the data you manage and process.
Conclusion
Apart from scalability, there are other significant advantages of using the public cloud. I have a developer background (however, not developing that much lately), and what I really like about using the cloud is that it allows you to stay on the cutting edge. Public cloud providers like Amazon, Google, or Microsoft are constantly competing and releasing newer, better, and more advanced versions of their software and tools. It has a tremendous impact on the quality and efficiency of software development. It varies from company to company, of course. However, developing on-premises usually means developing technology that is at least a few years old. It causes a technological debt trap and forces people to be dependent on technologies that are about to be deprecated. It is quite common that iInternal infrastructure and platform teams have limited capabilities to let their users stay on the cutting edge, and it is the main issue for big tech companies. If they want to gain a significant advantage over other companies, they should strive to stay on the side of novelty as much as possible.