Flattening Your SQL Tables — Not As Easy As It Sounds
How to expose deeply nested SQL fields — and why you need to be careful when handling complex raw data.
Reshape And Declutter Your SQL Tables
Unlike practice datasets, much of the data you work with on the job as a data engineer, data scientist or data analyst will be quite messy.
Aside from containing NA values, duplicate values and other data eye sores, there is a good possibility your data could be structurally complex as well.
In fact, my very first pipeline I built professionally involved ingesting a data source that returned nested JSON objects.
In the nearly two years since that assignment I’ve worked with and created countless nested data sources for end users.
For individuals that deal with data upstream like data engineers, nesting data makes a ton of sense.
Nested data is neatly stored and, when ingested correctly, avoids creating duplicate records.
The issue is that this data is not always easy or practical to work with for downstream users.
For instance, if a data analyst needs to incorporate your table in a complex query with multiple joins or WITH blocks, then having to UNNEST() values becomes an additional hassle.
Therefore, to avoid passing the UNNEST baton further downstream, you may be tasked with creating a flattened version of your source table.
Although the SQL for this is just an extension of SELECT col FROM table, the unnest queries can become complex.
To illustrate the concept of flattening a SQL table I’ll use the BigQuery public dataset the_met, focusing on the vision_api_data table.
Build Your Pipeline To A Data Engineering Career
You’ve reached the limit of the public preview. The full version of this post includes the implementation details: The code, the edge cases, and the "why" behind the architecture.
When you join PipelineToDE, you get:
- The DA → DE Pathway Course: A structured roadmap to bridge the gap between analysis and engineering.
- Weekly Senior Deep Dives: Fresh, tactical insights on Python, Cloud (GCP/AWS), and modern orchestration delivered every week.
- Production-Ready Blueprints: Access to 80+ protected stories and code repos from my time in the trenches as a Senior DE
- The DE Job Board (Coming Soon): Exclusive access to a curated board of high-agency Data Engineering roles.