>> Home / data-engineering / rust, data, python, spark
∵ Trevor Barnes ∴ 2024-10-10 ∞ 9'
Why? Because I am going to say the meme "Python Slow", but in all honesty it is much more than that. If executives knew how much money is being thrown out the window because we are moving Tb's to Pb's of data each day in a language that is not optimized for anything other than getting people who have never programmed before to be able to run their first program in less than a minute, heads would roll. Now, I am not saying that you can't use a Python API to do data engineering, you just need to make sure that the query engine under the hood is not Python based. With each passing day, it seems like a new query engine is being built, whether its Apache Spark, DuckDB, Polars, Datafusion, or many other smaller projects that are rapidly gaining traction, if you aren't actively working to refactor your pandas (and potentially Dask) dataframes to an engine that manages the data via C++, C, Java, or Rust under the hood, you aren't doing anyone any favors.
Recently, I was tasked with refactoring a Python script that was taking about 20 hours to run. I was wondering how much data was being processed in order for a job to take that long to run, it must have been dozens of terabytes of data or highly complex computations running on a small ec2 instance out there. Well, come to find out it's running on a i4.16xlarge machine. That's 64vCPU and 512Gb of RAM! So, this means that the data source they are pulling from must be I/O constrained, right? Again, not the issue. It was a sharded, read-replica Postgres instance that could eat up any load that was thrown at it, especially given its production criticality. So, the source will let you read data as fast as you ask for it, and the queries when you filtered for the unique ID were blazingly fast. My next thought was "whoever wrote the script must really not know what they are doing". So I dove into the code base... pure Python written by 2 very talented Software Engineers that were using optimizations in Python that looked like they were straight out of the "Fluent Python" book, this Python was far better than anything I could have hope to write. So, the machine is large enough, the source is fast enough, the code is optimized enough, how the hell am I supposed to speed up this code?
Enter stage right, meme 2. I needed to move this data that was growing at an exponential rate and quickly to our analytics warehouse, but how? I had been learning some Rust and C in my spare time out of curiosity and decided "Screw it, I'll meme this project, we are refactoring to Rust". I was prepared for this to take weeks, fighting the borrow checker, dealing with lifetime errors, "map error, trait does not exi...", refactoring newb shoddily written Rust code so it could maybe see a 20% speed boost over its Python counterpart. Writing Rust actually wasn't that bad. I know I wasn't writing it as "idiomatically" as I should, but hey I could always clean it up later.
Well, roughly 1.2k lines of poorly written, beginner Rust code later... that Python script that was taking 20 hours to complete, ran in.....drum roll...... 45 minutes on Rust. How, idk? Clones everywhere, poorly optimized data structures, using sqlx and Tokio crates incorrectly, but Rust was just faster. My boss was impressed, but I wasn't, I felt like some type of fraud, could it really be that easy? I asked to spend more time cleaning the code, but everyone was just happy that they could get the data quickly.
As someone who just had a passing interest in systems languages, I was now hooked on systems programming for Data Engineering. It just kind of clicked "Why not use the most performant languages to support the largest and most cumbersome data processes within a business?" The Data Engineering community seems to be stuck on Python and Pandas.
Before I explain what I think is happening I mean no disrespect to Business Intelligence, Data Analysts, and Data Engineers. This has just been my experience working at several different companies, and one of those being in FAANG. I am not from a Computer Science or Software Development background. My degree was in Aviation Technology, I have more credentials to be a pilot and aircraft mechanic than a Software Engineer, but I noticed a lot of my colleagues from the BI space normally have a trajectory of Analyst -> BI -> Data Engineer. Meaning, most Data Engineers I have run into have only ever worked within Python and have never really needed to understand the machine on a more granular level than "code goes in, number comes out".
I believe modern Data Engineering is suffering from the same issues as modern web dev. Too many people from non-CS backgrounds who take a Udemy course, or go to a bootcamp, or read "Automate the Boring Stuff with Python" (one of the best programming books ever, btw) and then do some leetcode to learn how to sort a Python array to pass the coding test for an entry level Date Engineering position, then work their way up from there. There is no shame in this, this is what I did years ago to get an entry level position at Amazon!
Even at Amazon, I worked with Data Engineers that did the same thing as me, not from CS backgrounds and never worked as a Software Developer. Then I met Dennis... I call Dennis the first true Data Engineer I met. Dennis was a former Software Developer and more importantly a former Marine. He didn't take bullshit, but also didn't give it either. When he talked about Data Engineering, it was on a different level. Before this time, data was discussed as tables and dataframes being pushed from one place to the next, as long as the data landed from point A to point B, who cared about anything else, customer was happy, we were happy.
Then Dennis explained that it wasn't good enough, data will grow and your pipelines will clog, jobs will start failing which will cause other pipelines to start failing and he was right. Our data was growing rapidly and the cracks in our queries on Redshift were starting to give. Moving from Python and pipelines doing CSV copies into Redshift had to go. We were now worried about column compression, ZOrdering, partitioning, when to use parquet vs avro vs json vs csv, "can we move this pipeline to EMR to reduce the load on our cluster", "we need to start utilizing spectrum", separate compute from storage. Dennis gave me my first real lesson in Data Engineering after years of doing "Data Engineering".
Upon reflection I was in the middle of the first big data revolution. We were moving from a Data Engineering world of "just ship the data" to understanding if we optimized our data pipelines and storage, we could easily save hundreds of thousands if not millions of dollars a year.
There has been a rise in Big Data Applications for Data Engineering that focus on optimizing compute. In the first big data revolution we optimized storage with new compression algorithms and starting to utilize formats like parquet/delta. This big data revolution will be focused on utilizing those algorithms to do more with less on systems level languages. We are already starting to see the shift:
These are the future tools of Data Engineering. So should you go out and learn one of these languages in order to stay relevant in Data Engineering? Absolutely not, you will be able to be an extremely effective Data Engineer just knowing Python.
So what's the deal with the clickbait title? Well, I truly believe the future of Data Engineering is Python APIs for systems languages. Every single bullet point technology above has a Python API that you can use that will instantly speed up your data pipelines. If I am using the Python version of Polars, will it be faster than Rust*? No. That is because while running those jobs, Python is limited by it's runtime and GIL, while I can just slap a #[tokio::main]
on my main function in Rust and asynchronously run X number of threads and tasks in parallel with tokio::task
.
All this is to say, while you don't need to learn a systems language to leverage highly performant Data Engineering programs, I think every programmer whether a web dev, data scientist, data engineer, or any other will always become better at their programming job if they learn computer science fundamentals. You can still utilize Python, but just use it as the wrapper around a systems language based library.
Footnote: