Poor implementation will strike you back
There are many reasons why the big data pipelines drift towards a nightmare, and in this article, I'll highlight a few key factors.
This is a well-known slogan: the architecture of your big data platform and the implementation of your pipelines are extremely important. How many times have I heard this? The reality, however, is that there are no perfect pipelines out there, and each one has room for improvement. Of course, this is a matter of balance between perfection and practicality. You want to keep your pipelines’ flaws to a minimum. If you end up with poorly designed pipelines, you will have trouble maintaining, testing, and further developing on top of them. There are many reasons why the big data pipelines drift towards a nightmare, and it is hard to cover them all. In this article, I'll highlight a few key factors.
Architecture vs implementation
Before we begin, let's first draw a line between architecture and implementation. Architecture operates at a higher level and focuses on the overall structure of the system. Architecture is common across multiple pipelines. A good example of the architecture is Medallion Architecture from Databricks. It describes overall concepts like the way source data should enter the platform, where the data should be conformed and cleaned, and how to expose the data for analysis or downstream systems. Architecture focuses more on the scalability of the system, the key components and technologies selection, as well as integration points.
What is important to mention is that architecture does not touch implementation details. Implementation is actually a detailed way you achieve specific requirements. This is the configuration of your software components, specific code that performs the transformations and loads your data into data structures. If you think about it, you can have a perfectly valid and correct architecture but still have wrong, non-performing, and simply poor implementation and design of your pipelines. Unfortunately, even great architecture does not prevent you from bad implementation. Actually, it is quite common that well-designed platforms built from state-of-the-art components host solutions that are a nightmare for maintenance and testing.
Why things go wrong
There are many reasons why your implementation goes in the wrong direction. If you think about it, nobody really wants things to go wrong, but they frequently do. Have you ever heard anybody saying that they like to develop unmaintainable software? Probably not… Engineers, business analysts, product owners, and sponsors – all of them would like to develop state-of-the-art solutions that follow best practices. In reality, the implementation often unexpectedly goes into the state where further development becomes difficult, not to mention technological upgrades or migrations.
From my experience, I can think of a few most common reasons for this:
Poor requirements management and constant scope change
Never paid technological debt
Bad approach to pipeline testing
Resistance to change
Let me write a few sentences about each one of the reasons. Keep in mind that these are my subjective observations, and I am sure you could have experienced it differently. I would be very interested in your reasons, actually.
Meeting requirements
The first trap in implementations is actually created by the people for whom the pipeline is developed – the customers. The first red light comes when you, as a pipeline developer or designer, don’t understand the given requirement – or, more precisely, you don’t know why the heck anybody needs something that sounds nonsense to you. Alarming things occur when you find the logic of “fixing“ source data during processing to be questionable, encounter nondeterministic lookup algorithms, or realize the necessity of generating many alternative processing paths based on conditions met in data instead of conforming the datasets from the sources early in the process. Usually, unconventional processing comes from the very good idea of increasing data quality. The problem is that if you are fixing data in between the transformations and in various places in the pipeline, you end up with a result that is extremely difficult to verify. By the way, this is actually the reason why it is best to fix quality issues in the source directly.
The other thing when it comes to requirements is the fact that they frequently change, especially in the era of Scrum. Don’t get me wrong; I am a big advocate of Scrum and agile development. The thing is, being agile is often misunderstood, and a good product owner is a valuable part of the whole team. Being able to understand and question the requirements of your product owner is also an important skill. The Scrum team is a single team and a good discussion and understanding between the developers and the product owners can result in a better product. If the requirement is not convenient to develop but is truly important, it should serve as motivation for the development team to ensure its fulfillment. But if the requirement is difficult to implement and the value is questioned the need for developing this should be refined. As you can see the responsibility sits at both sites.
Never paid technological debt
This is actually connected to requirements, more specifically, the rush to implement them. In the ideal world, we know the requirements upfront, they are not contradictory, and they fit the reality that we can see in the datasets. However, the world is far from ideal. As a result, we, as developers, constantly develop a tradeoff between the solution we could be proud of and the working implementation that is not perfect but does the job. The trick is to keep this balance in an acceptable way.
The workarounds and shortcuts are fine as long as you have a plan to get rid of them. Your debts have to be paid. But to pay the debt, you need to have sufficient resources – in our big data projects that would be the time of your team that you could spend on fixing the implementation. The reality is that right after meeting some rough deadlines, projects enter another sprint or release, and again they fall into meeting the next requirements or fixing issues coming from badly considered features. The debt grows and it is much harder to pay it. At some point you realize that it is simply impossible to pay it without designing a whole thing from scratch.
As a developer, you need to fight for the time needed to pay the tech debt. As a product owner or sponsor, you need to spend some resources on your debts. Otherwise, it will strike you back. Your solution could become simply unmanageable and excessively expensive. Testing, upgrading, and further development will become extremely challenging, potentially leading straight to bankruptcy…
Pipelines testing
I have worked with many smart colleagues. Once, during a discussion about testing our genomic pipelines, my fellow automated quality assurance engineer shared a brilliant analogy. He likened testing a complex pipeline to checking a plane before takeoff: you don’t have to take a flight to know that the plane is ready to fly. I found this analogy extremely smart.
Testing is a complex process and should be done on various levels. When working with data processing pipelines, end-to-end testing is often considered the most important and reliable testing strategy. And there are many good reasons for this. The problem is to keep end-to-end tests running while pipelines are developed simultaneously. Consequently, there is a continuous struggle to ensure the functionality of these tests, which requires constant and never-ending updates. Some changes made to the pipeline result in a test failure. You need to analyze why these failures occur. Usually, the tests need to be updated to fit the changed pipeline (incorrect assertions, different processing orders, changed configuration, etc.).
End-to-end tests are often based on a significant portion of data that we can expect in a production setup. This increases the execution times and necessary computation resources, which in turn increases the cost of these tests. Actually, they are often much more expensive than actual production runs. Moreover, by design, end-to-end tests are sequential; they are often rerun from the beginning after a failure. It sounds like a pure waste of time and money.
You don’t have to take a flight to know that the plane is ready to fly
End-to-end tests are like flying a plane to ensure that it can fly. Of course, it’s crucial to confirm that the plane flies, and you need to test it somehow. However, just as planes should mostly fly when transporting passengers, the pipelines should run end-to-end when delivering the production workloads. There are other strategies to keep your pipelines tested or design your tests in a less monolith way, which are worth considering.
Resistance to change
I was uncertain whether to cite resistance to change as a reason for poor implementation. However, the more I think about it, the more convinced I become that it is actually a valid reason. With so many cool technologies being released every day, things could become much easier if you know how to benefit from these novelties. Technologies such as cloud infrastructure, resource management via Kubernetes, and newer versions of big data engines like Spark are changing the overall big data ecosystem. Failing to implement these innovations could slow you down, expose you to barriers, or even make your pipelines unsupported.
Developers need to catch up and observe the big data ecosystem to understand how the new technologies can improve their work. Of course, it is often easier and sometimes natural to implement things the way we did in the past. However, we have no guarantee that it is the optimal approach. This is another tradeoff we face. It doesn’t mean we should try every new exotic technology we come across. We need to be reasonable and use our expertise to assess and make the right choices.
Summary
There are no perfect implementations. Every big data platform and every complex system can be built in a better, more efficient way. Nevertheless, you can be closer to a maintainable solution that is reasonably priced and operates reliably in production. Alternatively, you can work on a monster that is impossible to debug, expensive to test, and develop. In this article, I’ve described a few subjective reasons why we end up with poor implementations. It is worth considering and avoiding some of those for your data products.