Trevor’s Blog: Data Engineering

Stop using Python for Data Engineering!

Trevor Barnes — Sun, 13 Oct 2024 20:00:29 GMT

Start writing today. Use the button below to create a Substack of your own

Why? Because I am going to say the meme "Python Slow", but in all honesty it is much more than that. If executives knew how much money is being thrown out the window because we are moving Tb's to Pb's of data each day in a language that is not optimized for anything other than getting people who have never programmed before to be able to run their first program in less than a minute, heads would roll. Now, I am not saying that you can't use a Python API to do data engineering, you just need to make sure that the query engine under the hood is not Python based. With each passing day, it seems like a new query engine is being built, whether its Apache Spark, DuckDB, Polars, Datafusion, or many other smaller projects that are rapidly gaining traction, if you aren't actively working to refactor your pandas (and potentially Dask) dataframes to an engine that manages the data via C++, C, Java, or Rust under the hood, you aren't doing anyone any favors.

Anecdotes gonna Anecdote

Recently, I was tasked with refactoring a Python script that was taking about 20 hours to run. I was wondering how much data was being processed in order for a job to take that long to run, it must have been dozens of terabytes of data or highly complex computations running on a small ec2 instance out there. Well, come to find out it's running on a i4.16xlarge machine. That's 64vCPU and 512Gb of RAM! So, this means that the data source they are pulling from must be I/O constrained, right? Again, not the issue. It was a sharded, read-replica Postgres instance that could eat up any load that was thrown at it, especially given its production criticality. So, the source will let you read data as fast as you ask for it, and the queries when you filtered for the unique ID were blazingly fast. My next thought was "whoever wrote the script must really not know what they are doing". So I dove into the code base... pure Python written by 2 very talented Software Engineers that were using optimizations in Python that looked like they were straight out of the "Fluent Python" book, this Python was far better than anything I could have hope to write. So, the machine is large enough, the source is fast enough, the code is optimized enough, how the hell am I supposed to speed up this code?

"Just refactor to Rust, bro!"

Enter stage right, meme 2. I needed to move this data that was growing at an exponential rate and quickly to our analytics warehouse, but how? I had been learning some Rust and C in my spare time out of curiosity and decided "Screw it, I'll meme this project, we are refactoring to Rust". I was prepared for this to take weeks, fighting the borrow checker, dealing with lifetime errors, "map error, trait does not exi...", refactoring newb shoddily written Rust code so it could maybe see a 20% speed boost over its Python counterpart. Writing Rust actually wasn't that bad. I know I wasn't writing it as "idiomatically" as I should, but hey I could always clean it up later.

Well, roughly 1.2k lines of poorly written, beginner Rust code later... that Python script that was taking 20 hours to complete, ran in.....drum roll...... 45 minutes on Rust. How, idk? Clones everywhere, poorly optimized data structures, using sqlx and Tokio crates incorrectly, but Rust was just faster. My boss was impressed, but I wasn't, I felt like some type of fraud, could it really be that easy? I asked to spend more time cleaning the code, but everyone was just happy that they could get the data quickly.

What's going on?

As someone who just had a passing interest in systems languages, I was now hooked on systems programming for Data Engineering. It just kind of clicked "Why not use the most performant languages to support the largest and most cumbersome data processes within a business?" The Data Engineering community seems to be stuck on Python and Pandas.

Before I explain what I think is happening I mean no disrespect to Business Intelligence, Data Analysts, and Data Engineers. This has just been my experience working at several different companies, and one of those being in FAANG. I am not from a Computer Science or Software Development background. My degree was in Aviation Technology, I have more credentials to be a pilot and aircraft mechanic than a Software Engineer, but I noticed a lot of my colleagues from the BI space normally have a trajectory of Analyst -> BI -> Data Engineer. Meaning, most Data Engineers I have run into have only ever worked within Python and have never really needed to understand the machine on a more granular level than "code goes in, number comes out".

I believe modern Data Engineering is suffering from the same issues as modern web dev. Too many people from non-CS backgrounds who take a Udemy course, or go to a bootcamp, or read "Automate the Boring Stuff with Python" (one of the best programming books ever, btw) and then do some leetcode to learn how to sort a Python array to pass the coding test for an entry level Date Engineering position, then work their way up from there. There is no shame in this, this is what I did years ago to get an entry level position at Amazon!

How to fix it?

Even at Amazon, I worked with Data Engineers that did the same thing as me, not from CS backgrounds and never worked as a Software Developer. Then I met Dennis... I call Dennis the first true Data Engineer I met. Dennis was a former Software Developer and more importantly a former Marine. He didn't take bullshit, but also didn't give it either. When he talked about Data Engineering, it was on a different level. Before this time, data was discussed as tables and dataframes being pushed from one place to the next, as long as the data landed from point A to point B, who cared about anything else, customer was happy, we were happy.

Then Dennis explained that it wasn't good enough, data will grow and your pipelines will clog, jobs will start failing which will cause other pipelines to start failing and he was right. Our data was growing rapidly and the cracks in our queries on Redshift were starting to give. Moving from Python and pipelines doing CSV copies into Redshift had to go. We were now worried about column compression, ZOrdering, partitioning, when to use parquet vs avro vs json vs csv, "can we move this pipeline to EMR to reduce the load on our cluster", "we need to start utilizing spectrum", separate compute from storage. Dennis gave me my first real lesson in Data Engineering after years of doing "Data Engineering".

Upon reflection I was in the middle of the first big data revolution. We were moving from a Data Engineering world of "just ship the data" to understanding if we optimized our data pipelines and storage, we could easily save hundreds of thousands if not millions of dollars a year.

Big Data Revolution 2: Electric Bugaloo

There has been a rise in Big Data Applications for Data Engineering that focus on optimizing compute. In the first big data revolution we optimized storage with new compression algorithms and starting to utilize formats like parquet/delta. This big data revolution will be focused on utilizing those algorithms to do more with less on systems level languages. We are already starting to see the shift:

Polars -> Rust
Apache Spark - Photon -> C++
Apache Spark - Comet -> Rust
CUdf -> C
Datafusion -> Rust
DuckDB -> C++
Apache Arrow (in-memory format used by a couple of the technologies above)

These are the future tools of Data Engineering. So should you go out and learn one of these languages in order to stay relevant in Data Engineering? Absolutely not, you will be able to be an extremely effective Data Engineer just knowing Python.

So what's the deal with the clickbait title? Well, I truly believe the future of Data Engineering is Python APIs for systems languages. Every single bullet point technology above has a Python API that you can use that will instantly speed up your data pipelines. If I am using the Python version of Polars, will it be faster than Rust*? No. That is because while running those jobs, Python is limited by it's runtime and GIL, while I can just slap a #[tokio::main] on my main function in Rust and asynchronously run X number of threads and tasks in parallel with tokio::task.

Conclusion

All this is to say, while you don't need to learn a systems language to leverage highly performant Data Engineering programs, I think every programmer whether a web dev, data scientist, data engineer, or any other will always become better at their programming job if they learn computer science fundamentals. You can still utilize Python, but just use it as the wrapper around a systems language based library.

Footnote:

I want to clarify that the Python implementation of Polars runs the Rust operations faster than if they were in pure Rust. The reason for this being that Polars can have the Rust compiler solely focus on optimizing Polars library only when written for the Python API, where if you were to use Polars crate in a Rust application, Rust has to try to optimize for other crates, too. More is explained by the creator of Polars: https://github.com/pola-rs/polars/issues/8391

Reflections on Data Engineering Described

Trevor Barnes — Fri, 05 Apr 2024 22:49:25 GMT

What is Data Engineering?

This has been a very tricky definition to narrow down at the companies I have worked. What would be considered a "good" data engineer at some companies wouldn't even get in the door at the other, but not necessarily because of a skill or knowledge gap. The difference could be broken down to a lack of a solid definition for a data engineer. One company might look for an understanding of Scala programming, AWS EMR, Hadoop, Lambda, and Kinesis, while the other needs focus on Linux servers, python, airflow, and shell scripting.

Fundamentally, a "data engineer" to me is anyone with the knowledge to implement a common and robust set of software discipline to move information from source to consumer. Fundamentals of Data Engineering pretty much aligns with this philosophy. While their definition is a little more robust, it captures the gist of what a Data Engineer can expect from a fundamental level. Data Engineers are data lifecycle managers. The need to make sure that from generation to storage to ingestion to transformation to serving that security, data management, DataOps, data architecture, orchestration, and software engineering best practices are followed.

The Data Engineering “Pendulum”

One of the topics that really stuck with me when I took the University of Washington’s “Big Data” program was the back and forth between data systems focusing on parallelizing (MPP) across multiple compute nodes, and Moore’s law making it easier to process data on a single compute node. 1980-2000 focused on MPP, 2000-2010 focused back on highly capable data warehouses, and 2010-2020 focused on data lakes. Now we are in a mix between utilizing MPP platforms like AWS EMR, Databricks, Redshift, but also being able to do more with less with architectures like Data Mesh that allow tools like DBT to aggregate data and version control it across multiple endpoints. Couple that with the growth of optimizing compute for ETL with python packages like Polars operating on AWS Lambda and you can build rather robust architectures that can handle large amounts of data very quickly for minimal costs.

Modern Data Engineering

One thing I do not agree with in ”Fundamentals of Data Engineering” is that the authors believe the utilization of several different tools in each step of the lifecycle thus creating “A type Data Engineers” (those that focus on tech that abstract away a lot of the code) will be the most optimal solution for most situations. While in theory this sounds like a good idea, connecting all the different toolsets together has not always been an easy task and then creates a whole host of new problems that require a different type of DevOps knowledge to manage.

It is best to choose tools as a component of an ecosystem rather than the merits of the tool on its own accord. As an example, if your data stack utilizes Airbyte, DBT, and Databricks it might be tempting to choose a tool like Airflow as your orchestrator of choice, especially when several data engineers have experience with creating pipelines in Airflow. In reality, it might be significantly better long term to go with a tool like Dagster instead of Airflow because of the out-of-the-box integrations with Airbyte, DBT, and Databricks that Dagster offers.

Open Source and COTS Overload

I have heard that you should “stand on the shoulder of giants” and avoid writing bespoke code, and I agree with this. For the most part, tools like Airbyte, DBT, Dagster, and the like allow users to get up and running ETL without needing to write UIs and connectors from scratch. I think that they can certainly help shape the architectures, but like I mentioned earlier, it gets a point where you are just managing log files from several different applications trying to figure out who is the culprit when something goes wrong. So now you have to become an expert in how each open source tool functions and what underlying tech they are using. Deploying Airbyte? Hopefully your Kubernetes, Containerization, Python, PyArrow, and DBT skills are good or else you are going to be shrugging your shoulders when your first error happens. Databricks? Better learn how to read the messy Java Errors and pick up some basic understanding of Scala, too, or might as well just not deploy at all.

I have been witness in the past to these types of deployment errors and have fallen victim to them myself. Early in my learning, I tried to deploy a bunch of Singer pipelines without fundamentally understanding linux, cron, the CLI, and how python connects to RestAPIs and databases. I could get stuff up and running following the tutorials, but had no clue how to debug the errors. Was it a Postgres error? A dbfs error? A python error?

A this point you might think I am leading into the statement “Just write your own tools!”, but you can fall into a deeper trap creating bespoke tools.

Being Bespoke

There is nothing more satisfying than writing custom pipelines from scratch and watching them hour by hour and day by day operate without problem. You spent countless hours reading docs for modules and libraries, writing tests and watching as data from your program moves from point A to point B flawlessly, but then you share your code…. Everything you thought made sense and was good coding habits, you now realize is a spaghetti mess that won’t be useful outside of you using it. You try to get your peers to understand it, but you now realize that even with your code comments you have no idea why you wrote that function that way. I mean sure, it cuts times from air byte by 90% and uses 95% less compute than that heavy open source tool you deployed for moving a bunch of data from simple RDBs, but is it any good if no one can use it?

I have seen this in pretty much every job I have worked, and unfortunately I have been guilty of perpetuating these bad habits. It is important to realize when your programs are worse than alternatives, either Open Source or maybe another person’s code that you work with. I have seen the main inhibitor of quality Data Engineering is people getting overly attached to their outdated and over complex spaghetti code.

Grading Data Engineering Systems

As analysts and engineers, we need to be constantly be evaluating our systems and verifying that they are still optimal, but what do we look at? I am proposing the following table for grading:

Field Description Score 1-5 (5 is best) Cost Is this solution cost effective? 1 Agility Is this developed using agile best practices? 3 Scalability Can this scale to the largest datasets you expect in the future? 5 Simplicity Is it simple to use or contribute code or use the API/SDK? 2 Reuse How easy is it to deploy on other systems? 3 Interoperability Can it work with other systems? 4

Your min score is 6 and max score is 30. I would assume that if any of these fields get a 1 that you would not want to proceed with it unless there is a plan to improve that field. I also assume that you would want at least a minimum score of 20 before proceeding with the system.

I will use my Rust ETL as an example.

Field Score Reason Cost 5 It can’t get any cheaper, the system is extremely compute efficient (a rust quality more than anything doing with my code) and you can easily deploy it to lambda which is one of the cheapest ways to use compute Agility 2 While it is version controlled, there is no roadmap or features planned with very little comments, I look to improve this score in the future Scalability 5 I am using tokio, async, and parallelism, which when deployed allows for infinite scaling, I also am utilizing Polars “streaming” which allows me to spill larger than memory datasets to disk Simplicity 3 While it does have a CLI interface and you can construct pipelines using YAML, any bugs for a non-Rust dev will be very tricky to debug. Using expect() instead of unwrap() should help give clearer errors, though Reuse 4 The code can be deployed on nearly every platform once compiled for that platform Interoperability 3 While connectors need to be manually written if it doesn’t exist already, if the connector exists it should allow you to pull and push data as needed

It looks like my rust based ETL scores high as far as performance and cost goes, but lacks in ease of use. But like I mentioned, Rust is mainly the reason the code is efficient and scalable. This means if I were to build a data engineering team around this tool or if it was in the selection process for a current Data Engineering team, they would need to focus on being highly agile, and make sure that code is clear, concise, and well documented.

Conclusion

Data Engineering from the abstract perspective is relatively easy to define, it is just when you need to define it in the same way you need to define something as abstract as the color blue, it becomes difficult. Data Engineering means a lot of different things to different people and companies. If you find yourself moving data from point A to point B on a daily basis and you are also responsible for the security, quality, and control of the tools or code that runs those pipelines, then you are a data engineer.

Modern Data Engineering

Trevor Barnes — Thu, 14 Mar 2024 21:57:25 GMT

Introduction

My Past Experience

I have been working on setting up Data Platforms and Data Engineering tools for nearly a decade. The start of my career in this field I was working at a large enterprise (cough Boeing cough) and we had their entire Supply Chain analytics ecosystem on a remote desktop using "Task Scheduler" on Windows machines to run R scripts to do data transforms on source systems (mostly Oracle and MSSQL) and push the results into Teradata. While Boeing isn't the bastion of proper process (just check the current news) they were certainly not alone of having this be a standard process for their Data Engineering pipelines. Actually, my experience and knowledge building out this "Data Engineering Lifecycle" is what landed me a job at Amazon, pretty much doing the same thing, but swap out a physical remote desktop for a virtual machine in the cloud (AWS EC2) and Teradata for Redshift.

I eventually helped shift our team at Amazon from using R scripts and ShinyR to AWS EMR and Apache Spark based applications. The throughput was astounding and we went from only being able to move a couple hundred MB at a time to GB and even TB of data. While Redshift is a distributed compute platform, there was no API that allowed for programmatic transforms like Apache Spark. With Apache Spark, we saw a simplification and greater throughput of Data Engineering pipelines. We could now write our code in Scala/Java/R/Python directly ontop of the storage (S3) and scale and kill compute nodes as needed! No need to spin up a massively expensive Redshift cluster when we could just share a pool of compute nodes with several other teams on a "as needed" basis. The savings were massive!

Stepping into Modern Day

At this time, Databricks was becoming popular and largely helping to abstract a lot of the pain points of administering a large Apache Spark platform to run your scripts. I immediately fell in love with the tech and hold an "Apache Spark Developer" accreditation (even got a fancy coat for it, too). But it wasn't just Databricks, it seemed like several tools were starting to gain popularity. Airflow, Dagster, Prefect, Airbyte, Stringer, Stitch, Fivetran, Great Expectations, DBT, Flink, Kafka, Redis, Spark, Flume, Foundry, Tableua, Snowflake, Looker, Cassandra, Segment, Redash, and the list goes on. You now no longer needed a team of Software Devs to manage the minutia of pipelines, but you also needed a new team now to manage all the licensing and debugging of tools that you are now vendor locked into, preventing any future possibility of an easy transition to newer tech.

I read blog after blog talk about the proper way to implement and connect all of this tooling, while some lead to success, a lot of experiences ended in quasi-failure states where Dr. Frankenstein's Data Engineering monster had destroyed the town. Type Tableau into google after Salesforce bought them and then fired their top developers and you will see the horror stories of people trying to move to different visualization software, but feel too locked into Tableau to get out, even though these small to medium sized companies are spending hundreds of thousands if not a million a year on licensing!

But looking at the cost of Databricks, I have started to consider a couple questions:

What is the next big thing in Data Engineering?
Do we need pipelines to be faster?
Do we need more efficient code to lower cost and impact on the environment?
Are we missing an opportunity to get the performance of C/C++, but with a memory safe language?

So what's the Answer?

I don't have one and don't thing I ever will, but look to gain clarity and insight on how to make things better.

I aim to explore not only the technical aspects of implementing a new tooling for Data Engineering, but one that looks at the process more connected in holistically. One where tooling is modular, but also aligned. Where complexity is abstracted into simplicity, but also where you can understand the underlying fundamentals, and more importantly, maybe the code can help you learn more about the fundamentals of data engineering.

Defining Data Engineering

Now we get to the crux of why I am writing this blog. I am operating off the theory that if we write Data Engineering tools in Rust, we can get better and more efficient compute performance than most Java/Scala/Python based analytics tools. We get memory safety, near C/C++ performance and in some instances better, type safety, and a large community of developers to help build this project. But first, I want go over some definitions of Data Engineering and what it means when referenced in posts

What is Data Engineering?

I would point to my book notes from my Data Engineering Described post to get a more holistic perspective on what data engineers are what they do. Mainly, Data Engineering is a set of technical and business practices and knowledge that moves information to another location where it can be aggregated, analyzed, process, stored, and distributed to relevant users.

What is the Goal?

For brevity and the point of this blog series exploring implementing a Rust program for Data Engineering we will be following the following definition:

We are type B (builders) data engineers focusing on the Storage -> Ingestion -> Transformation phase of the data engineering lifecycle and focus on the undercurrents of DataOps, Orchestration, and Software Engineering, with the explicit goal of: *Reducing cost of the data engineering lifecycle in order to increase the scalability, simplicity, and interoperability of our data engineering software. We will be operating from the perspective of a Stage 1 (Starting with Data) organization that is operating without any current data engineering infrastructure.

Each blog post will aim to slightly increase in complexity. We will start off simple and increase complexity as we move through each lesson. While the aim is not to teach Rust, I will do my best to explain concepts without diving too deep into details. I will try to point you to my other blog where I am working to put all my notes from learning Rust: rustycloud.org

Why Rust?

Why not? It is fast, type safe, memory safe, implements functional programming, you can do object oriented programming (sort of), a large community of users and an active development community.

I originally started learning Go and had an idea to do something similar in Go, but one day a coworker of mine I highly respect said "Dude, learn Rust instead, it will pay off better in the long run". I wasn't going to argue with someone who has a Doctorate in Cyber Security that worked at Red Hat. That's it... That is why I started learning Rust and have not looked back.

The Dreaded Rust Learning Curve

The learning curve for Rust is steep, but hey, if I was able to pick up some of the basics, I am sure you can, too. I will admit I am still deep in the process of learning Rust so some questions may be way out of my scope of knowledge, but will do the best I can to answer questions.

Realistically, the only two areas I get really tripped up sometimes is Borrowing/Ownership and Lifetimes.

Borrowing & Ownership

Ownership is a set of rules that govern how a Rust program manages memory. All programs have to manage the way they use a computer’s memory while running. Some languages have garbage collection that regularly looks for no-longer-used memory as the program runs; in other languages, the programmer must explicitly allocate and free the memory. Rust uses a third approach: memory is managed through a system of ownership with a set of rules that the compiler checks

From "The Rust Programming Language" available on NoStarch

Essentially, Rust will delete/remove variables that it assumes are no longer in use. For a better explanation please visit Ownership on Rusty Cloud

Lifetimes

Lifetimes are used to keep variables alive that would go out of scope, but need to stay in scope. I still have an in-work page on my rust blog here but look to get it up as soon as possible.

Other Rust 'Quirks'

There are other steep learning curve items that I see people talk about in forums like error handling when to use ? or unwrap implementing traits, zero-cost abstractions, no encapsulation, etc...

Conclusion

While I can't promise to create a solution to all your Data Engineering needs, I am looking to improve my skills and maybe if other people are interested, they could pester me to keep up the pace on posting or give me ideas for topics to comment on in future posts. I am hoping that we can increase our Data Engineering and Rust knowledge together!

Thanks for reading!