Execute A Multi-CSV Backfill From Google Cloud Storage In 50 Seconds
Leverage the Python Google Cloud Storage and BigQuery APIs to bulk download, transform and upload CSV files in < 1 minute.
My facial muscles still ache from the 3 years I plastered on a fake smile in order to work in hospitality. While I have fond memories of that period (which mostly consist of friends and me doing anything but the job), it’s a role I largely dreaded. Truthfully, I’m not built to interact with the general public in long increments. Needless to say, I’ve found my data engineering job to be a better fit for my analytical mind and increasingly hermetic personality.
This doesn’t mean, however, that there aren’t days and tasks I’d rather not endure. For my first year as a baby (junior) engineer I got assigned (read: stuck with) a lot of grunt work: Writing documentation, auditing table metadata by hand (until I eventually automated the task) and, worst of all, conducting time-consuming and mind-numbing backfills.
I don’t intend to imply that backfilling missing data is a grunt task. In fact, it’s critical to continually maintain both real-time (or near real-time) data and historic data to support time series analyses.
But man, is it a time suck.
One of my least favorite kind of backfills is when I have no choice but to manually download, inspect and upload static files, typically CSV or JSON, to BigQuery. However, to avoid this scenario many of my personal and professional pipelines feature a “back up” step in which a source file is stored in a cloud repository, in this case, Google Cloud Storage.
Assuming this is our starting point, there is an easy way to bulk download, process and upload files using the Cloud Storage and BigQuery APIs as well as a bit of Python scripting.
Here’s the scenario
I’m a hypothetical data engineer who has just gotten back from vacation and, to my horror, I’ve discovered that one of my pipelines’ upload step has failed during the entire duration of my absence. Since I’m an early riser and work remotely, I make this discovery about 30 minutes before the rest of my team logs on. Since I didn’t receive any messages while I was out, it doesn’t seem like anyone has noticed the outage… yet.
Build Your Pipeline To A Data Engineering Career
You’ve reached the limit of the public preview. The full version of this post includes the implementation details: The code, the edge cases, and the "why" behind the architecture.
When you join PipelineToDE, you get:
- The DA → DE Pathway Course: A structured roadmap to bridge the gap between analysis and engineering.
- Weekly Senior Deep Dives: Fresh, tactical insights on Python, Cloud (GCP/AWS), and modern orchestration delivered every week.
- Production-Ready Blueprints: Access to 80+ protected stories and code repos from my time in the trenches as a Senior DE
- The DE Job Board (Coming Soon): Exclusive access to a curated board of high-agency Data Engineering roles.