>> Home / data-engineering / data, opinion
∵ Trevor Barnes ∴ 2024-10-05 ∞ 9'
This has been a very tricky definition to narrow down at the companies I have worked. What would be considered a "good" data engineer at some companies wouldn't even get in the door at the other, but not necessarily because of a skill or knowledge gap. The difference could be broken down to a lack of a solid definition for a data engineer. One company might look for an understanding of Scala programming, AWS EMR, Hadoop, Lambda, and Kinesis, while the other needs focus on Linux servers, python, airflow, and shell scripting.
Fundamentally, a "data engineer" to me is anyone with the knowledge to implement a common and robust set of software discipline to move information from source to consumer. Fundamentals of Data Engineering pretty much aligns with this philosophy. While their definition is a little more robust, it captures the gist of what a Data Engineer can expect from a fundamental level. Data Engineers are data lifecycle managers. The need to make sure that from generation to storage to ingestion to transformation to serving that security, data management, DataOps, data architecture, orchestration, and software engineering best practices are followed.
One of the topics that really stuck with me when I took the University of Washington's "Big Data" program was the back and forth between data systems focusing on parallelizing (MPP) across multiple compute nodes, and Moore's law making it easier to process data on a single compute node. 1980-2000 focused on MPP, 2000-2010 focused back on highly capable data warehouses, and 2010-2020 focused on data lakes. Now we are in a mix between utilizing MPP platforms like AWS EMR, Databricks, Redshift, but also being able to do more with less with architectures like Data Mesh that allow tools like DBT to aggregate data and version control it across multiple endpoints. Couple that with the growth of optimizing compute for ETL with python packages like Polars operating on AWS Lambda and you can build rather robust architectures that can handle large amounts of data very quickly for minimal costs.
One thing I do not agree with in "Fundamentals of Data Engineering" is that the authors believe the utilization of several different tools in each step of the lifecycle thus creating "A type Data Engineers" (those that focus on tech that abstract away a lot of the code) will be the most optimal solution for most situations. While in theory this sounds like a good idea, connecting all the different toolsets together has not always been an easy task and then creates a whole host of new problems that require a different type of DevOps knowledge to manage.
It is best to choose tools as a component of an ecosystem rather than the merits of the tool on its own accord. As an example, if your data stack utilizes Airbyte, DBT, and Databricks it might be tempting to choose a tool like Airflow as your orchestrator of choice, especially when several data engineers have experience with creating pipelines in Airflow. In reality, it might be significantly better long term to go with a tool like Dagster instead of Airflow because of the out-of-the-box integrations with Airbyte, DBT, and Databricks that Dagster offers.
I have heard that you should "stand on the shoulder of giants" and avoid writing bespoke code, and I agree with this. For the most part, tools like Airbyte, DBT, Dagster, and the like allow users to get up and running ETL without needing to write UIs and connectors from scratch. I think that they can certainly help shape the architectures, but like I mentioned earlier, it gets a point where you are just managing log files from several different applications trying to figure out who is the culprit when something goes wrong. So now you have to become an expert in how each open source tool functions and what underlying tech they are using. Deploying Airbyte? Hopefully your Kubernetes, Containerization, Python, PyArrow, and DBT skills are good or else you are going to be shrugging your shoulders when your first error happens. Databricks? Better learn how to read the messy Java Errors and pick up some basic understanding of Scala, too, or might as well just not deploy at all.
I have been witness in the past to these types of deployment errors and have fallen victim to them myself. Early in my learning, I tried to deploy a bunch of Singer pipelines without fundamentally understanding linux, cron, the CLI, and how python connects to RestAPIs and databases. I could get stuff up and running following the tutorials, but had no clue how to debug the errors. Was it a Postgres error? A dbfs error? A python error?
A this point you might think I am leading into the statement "Just write your own tools!", but you can fall into a deeper trap creating bespoke tools.
There is nothing more satisfying than writing custom pipelines from scratch and watching them hour by hour and day by day operate without problem. You spent countless hours reading docs for modules and libraries, writing tests and watching as data from your program moves from point A to point B flawlessly, but then you share your code'. Everything you thought made sense and was good coding habits, you now realize is a spaghetti mess that won't be useful outside of you using it. You try to get your peers to understand it, but you now realize that even with your code comments you have no idea why you wrote that function that way. I mean sure, it cuts times from air byte by 90% and uses 95% less compute than that heavy open source tool you deployed for moving a bunch of data from simple RDBs, but is it any good if no one can use it?
I have seen this in pretty much every job I have worked, and unfortunately I have been guilty of perpetuating these bad habits. It is important to realize when your programs are worse than alternatives, either Open Source or maybe another person's code that you work with. I have seen the main inhibitor of quality Data Engineering is people getting overly attached to their outdated and over complex spaghetti code.
As analysts and engineers, we need to be constantly be evaluating our systems and verifying that they are still optimal, but what do we look at? I am proposing the following table for grading:
Field | Description | Score 1-5 (5 is best) |
---|---|---|
Cost | Is this solution cost effective? | 1 |
Agility | Is this developed using agile best practices? | 3 |
Scalability | Can this scale to the largest datasets you expect in the future? | 5 |
Simplicity | Is it simple to use or contribute code or use the API/SDK? | 2 |
Reuse | How easy is it to deploy on other systems? | 3 |
Interoperability | Can it work with other systems? | 4 |
Your min score is 6 and max score is 30. I would assume that if any of these fields get a 1 that you would not want to proceed with it unless there is a plan to improve that field. I also assume that you would want at least a minimum score of 20 before proceeding with the system.
I will use my Rust ETL as an example.
Field | Score | Reason |
---|---|---|
Cost | 5 | It can't get any cheaper, the system is extremely compute efficient (a rust quality more than anything doing with my code) and you can easily deploy it to lambda which is one of the cheapest ways to use compute |
Agility | 2 | While it is version controlled, there is no roadmap or features planned with very little comments, I look to improve this score in the future |
Scalability | 5 | I am using tokio, async, and parallelism, which when deployed allows for infinite scaling, I also am utilizing Polars "streaming" which allows me to spill larger than memory datasets to disk |
Simplicity | 3 | While it does have a CLI interface and you can construct pipelines using YAML, any bugs for a non-Rust dev will be very tricky to debug. Using expect() instead of unwrap() should help give clearer errors, though |
Reuse | 4 | The code can be deployed on nearly every platform once compiled for that platform |
Interoperability | 3 | While connectors need to be manually written if it doesn't exist already, if the connector exists it should allow you to pull and push data as needed |
It looks like my rust based ETL scores high as far as performance and cost goes, but lacks in ease of use. But like I mentioned, Rust is mainly the reason the code is efficient and scalable. This means if I were to build a data engineering team around this tool or if it was in the selection process for a current Data Engineering team, they would need to focus on being highly agile, and make sure that code is clear, concise, and well documented.
Data Engineering from the abstract perspective is relatively easy to define, it is just when you need to define it in the same way you need to define something as abstract as the color blue, it becomes difficult. Data Engineering means a lot of different things to different people and companies. If you find yourself moving data from point A to point B on a daily basis and you are also responsible for the security, quality, and control of the tools or code that runs those pipelines, then you are a data engineer.