Modern Big Data Platform – why do enterprises choose Databricks?
Reflecting on my decade-long journey navigating Big Data's evolution, advocating for the transition from legacy Hadoop to efficient, cloud-native solutions like Databricks.
I started my adventure with Big Data around 10 years ago. At that time, Hadoop was a game changer when it came to dealing with a significant volume. My first Big Data project was about indexing millions of documents and loading them to Apache Solr. It was one of the very first projects implemented on Cloudera Distributed Hadoop (CDH), and our indexing jobs were written using MapReduce programming patterns. Later, it was rewritten using the Apache Spark engine (Spark 2.x, to be precise).
Legacy
In many organizations, Cloudera was, and still is, one of the most widely used Hadoop distribution platforms in the enterprise setting. What was great about CDH was the fact that it was a whole package of Big Data tools and components. It was obviously Hadoop itself with its YARN scheduler and HDFS. Additionally, it included Hbase, Hive, Oozie, and even Apache Solr. Monitoring and logs aggregation were included and accessed through a user-friendly Cloudera Manager web-based application. Security components enabled integration with the enterprise authentication and authorization ecosystem. HUE (Hadoop User Experience) was also a key platform component that enabled convenient interaction with HDFS, browsing, scheduling, and monitoring job or SQL script executions against Hive and Impala. At that time, it was an outstanding and modern platform for dealing with huge volumes.
However, the data ecosystem was constantly evolving. The cloud revolution, as well as further Spark development and Kubernetes maturity, changed the game rules. Read about why modern infrastructure outperforms Hadoop and what the modern Big Data ecosystem looks like in my previous articles.
The reality is that Hadoop is getting replaced here and there by alternative platforms like Databricks or Snowflake. The primary reasons are as follows:
the cloud-native architecture of alternatives
the ease of infrastructure management
cost-effectiveness.
Why many enterprises select Databricks
Databricks is one of the first choices to replace traditional Hadoop distributions. Usually, it is aligned with the implementation of the cloud adoption strategy. There are limited choices when considering the migration of on-premise Big Data workloads to the cloud.
The common approach is to simply use one of the cloud-based Hadoop distributions. Every major public cloud has its Hadoop. AWS goes with EMR, GCP has its DataProc, and Azure provides HDInsight. However, Hadoop, based on YARN, is not a cloud-native technology, and it cannot benefit fully from the breakthrough cloud feature. For a better understanding, I encourage you to read my other article.
If your organization has been using Cloudera Distributed Hadoop, an obvious option is migrating to the latest Cloudera platform offering, that is, Cloudera Data Platform. It is a much more modern approach to handling Hadoop tasks. It offers integration with the public cloud to achieve better scalability or can leverage Kubernetes as an alternative scheduler. For many organizations, especially those with compliance limitations of moving entirely to the cloud, this is a beneficial way to refresh the Big Data ecosystem. However, maintaining on-premise data centers and utilizing a hybrid cloud approach could be more expensive and less cost-efficient than a full transition to cloud solutions.
For companies using data over SQL-based interfaces and mostly using Hive (or Impala) to query their Big Data, an interesting option would be adopting platforms such as Snowflake, Redshift, or BigQuery. The problem started when most Big Data workloads were coded using Spark. In this scenario, migrating Scala or Python code to SQL could require extreme effort and, as such, become a no-go option.
Finally, we get to Databricks. Suppose your organization uses traditional Hadoop and runs Spark jobs coded in Scala or Python to populate Hive or Impala tables, enabling downstream applications to access data through convenient SQL-based interfaces. You want to transition from on-premises infrastructure to modern cloud infrastructure to control the cost of your workload. If so, you should definitely look at Databricks. It’s not that I am a special advocate of Databricks. It’s just the most convenient place to move away from your old Hadoop cluster.
I appreciate Databricks for three major reasons. I like its development environment, which enables efficient Big Data jobs. I like Delta Lake, which is a better alternative to Hive. And foremost, I cherish the infrastructure management, or more specifically, the significant reduction of this management.
Let me elaborate on this in more detail in the next part.