>> Home / data-engineering / data, opinion
∵ Trevor Barnes ∴ 2024-10-01 ∞ 8'
I have been working on setting up Data Platforms and Data Engineering tools for nearly a decade. The start of my career in this field I was working at a large enterprise (cough Boeing cough) and we had their entire Supply Chain analytics ecosystem on a remote desktop using "Task Scheduler" on Windows machines to run R
scripts to do data transforms on source systems (mostly Oracle and MSSQL) and push the results into Teradata. While Boeing isn't the bastion of proper process (just check the current news) they were certainly not alone of having this be a standard process for their Data Engineering pipelines. Actually, my experience and knowledge building out this "Data Engineering Lifecycle" is what landed me a job at Amazon, pretty much doing the same thing, but swap out a physical remote desktop for a virtual machine in the cloud (AWS EC2) and Teradata for Redshift.
I eventually helped shift our team at Amazon from using R scripts and ShinyR to AWS EMR and Apache Spark based applications. The throughput was astounding and we went from only being able to move a couple hundred MB at a time to GB and even TB of data. While Redshift is a distributed compute platform, there was no API that allowed for programmatic transforms like Apache Spark. With Apache Spark, we saw a simplification and greater throughput of Data Engineering pipelines. We could now write our code in Scala/Java/R/Python directly ontop of the storage (S3) and scale and kill compute nodes as needed! No need to spin up a massively expensive Redshift cluster when we could just share a pool of compute nodes with several other teams on a "as needed" basis. The savings were massive!
At this time, Databricks was becoming popular and largely helping to abstract a lot of the pain points of administering a large Apache Spark platform to run your scripts. I immediately fell in love with the tech and hold an "Apache Spark Developer" accreditation (even got a fancy coat for it, too). But it wasn't just Databricks, it seemed like several tools were starting to gain popularity. Airflow, Dagster, Prefect, Airbyte, Stringer, Stitch, Fivetran, Great Expectations, DBT, Flink, Kafka, Redis, Spark, Flume, Foundry, Tableua, Snowflake, Looker, Cassandra, Segment, Redash, and the list goes on. You now no longer needed a team of Software Devs to manage the minutia of pipelines, but you also needed a new team now to manage all the licensing and debugging of tools that you are now vendor locked into, preventing any future possibility of an easy transition to newer tech.
I read blog after blog talk about the proper way to implement and connect all of this tooling, while some lead to success, a lot of experiences ended in quasi-failure states where Dr. Frankenstein's Data Engineering monster had destroyed the town. Type Tableau into google after Salesforce bought them and then fired their top developers and you will see the horror stories of people trying to move to different visualization software, but feel too locked into Tableau to get out, even though these small to medium sized companies are spending hundreds of thousands if not a million a year on licensing!
But looking at the cost of Databricks, I have started to consider a couple questions:
I don't have one and don't thing I ever will, but look to gain clarity and insight on how to make things better.
I aim to explore not only the technical aspects of implementing a new tooling for Data Engineering, but one that looks at the process more connected in holistically. One where tooling is modular, but also aligned. Where complexity is abstracted into simplicity, but also where you can understand the underlying fundamentals, and more importantly, maybe the code can help you learn more about the fundamentals of data engineering.
Now we get to the crux of why I am writing this blog. I am operating off the theory that if we write Data Engineering tools in Rust, we can get better and more efficient compute performance than most Java/Scala/Python based analytics tools. We get memory safety, near C/C++ performance and in some instances better, type safety, and a large community of developers to help build this project. But first, I want go over some definitions of Data Engineering and what it means when referenced in posts
I would point to my book notes from my [Data Engineering Described][4] post to get a more holistic perspective on what data engineers are what they do. Mainly, Data Engineering is a set of technical and business practices and knowledge that moves information to another location where it can be aggregated, analyzed, process, stored, and distributed to relevant users.
For brevity and the point of this blog series exploring implementing a Rust program for Data Engineering we will be following the following definition:
We are type B (builders) data engineers focusing on the Storage -> Ingestion -> Transformation phase of the data engineering lifecycle and focus on the undercurrents of DataOps, Orchestration, and Software Engineering, with the explicit goal of: *Reducing cost of the data engineering lifecycle in order to increase the scalability, simplicity, and interoperability of our data engineering software. We will be operating from the perspective of a Stage 1 (Starting with Data) organization that is operating without any current data engineering infrastructure.
Each blog post will aim to slightly increase in complexity. We will start off simple and increase complexity as we move through each lesson. While the aim is not to teach Rust, I will do my best to explain concepts without diving too deep into details. I will try to point you to my other blog where I am working to put all my notes from learning Rust: [rustycloud.org][5]
Why not? It is fast, type safe, memory safe, implements functional programming, you can do object oriented programming (sort of), a large community of users and an active development community.
I originally started learning Go and had an idea to do something similar in Go, but one day a coworker of mine I highly respect said "Dude, learn Rust instead, it will pay off better in the long run". I wasn't going to argue with someone who has a Doctorate in Cyber Security that worked at Red Hat. That's it... That is why I started learning Rust and have not looked back.
The learning curve for Rust is steep, but hey, if I was able to pick up some of the basics, I am sure you can, too. I will admit I am still deep in the process of learning Rust so some questions may be way out of my scope of knowledge, but will do the best I can to answer questions.
Realistically, the only two areas I get really tripped up sometimes is Borrowing/Ownership and Lifetimes.
Ownership is a set of rules that govern how a Rust program manages memory. All programs have to manage the way they use a computer's memory while running. Some languages have garbage collection that regularly looks for no-longer-used memory as the program runs; in other languages, the programmer must explicitly allocate and free the memory. Rust uses a third approach: memory is managed through a system of ownership with a set of rules that the compiler checks
Essentially, Rust will delete/remove variables that it assumes are no longer in use. For a better explanation please visit [Ownership on Rusty Cloud][6]
Lifetimes are used to keep variables alive that would go out of scope, but need to stay in scope. I still have an in-work page on my rust blog [here][7] but look to get it up as soon as possible.
There are other steep learning curve items that I see people talk about in forums like error handling when to use ?
or unwrap
implementing traits, zero-cost abstractions, no encapsulation, etc...
While I can't promise to create a solution to all your Data Engineering needs, I am looking to improve my skills and maybe if other people are interested, they could pester me to keep up the pace on posting or give me ideas for topics to comment on in future posts. I am hoping that we can increase our Data Engineering and Rust knowledge together!
Thanks for reading!