Approachable Array Aggregations

Solve a real-life SQL array problem in an easy-to-follow walkthrough.

Share
Hand pointing to chalk board with e = mc squared.
Photo by JESHOOTS.COM on Unsplash

One of the frequent sources of disagreements between data engineers and data analysts is how data should be accessed and, more importantly, stored. This is especially true when dealing with complex data types like STRUCT and ARRAY types. Because, in either scenario, these values have to be un-nested and transformed.

The question becomes: Who will do it?

Or, more accurately, who must do it?

Data engineers, like myself, argue that data stored in formats like JSON objects keep data sources compact and, critically, reduce storage costs and processing time for the additional rows needed should the data be stored “flattened.”

Further, visualization platforms that support SQL operations, like my go-to, Looker, allow developers and users to conduct operations downstream, so it often really makes sense to keep certain columns nested and deal with them on a case-by-case basis.

After all, depending on your stakeholders’ requests, there is a distinct possibility that you won’t even end up using all columns.

But let’s say, for the sake of example, that you’ve been firmly asked to unnest fields upstream. Let’s say you’re being nice and want to build out a view your BI and data analysts can use for dashboarding.

And, let’s assume, unfortunately, your data is a bit messy, arriving in the form of an array stored as a STRING type.

How do we break this out into data that can be used to arrive at actionable insights?

Build Your Pipeline To A Data Engineering Career

You’ve reached the limit of the public preview. The full version of this post includes the implementation details: The code, the edge cases, and the "why" behind the architecture.

When you join PipelineToDE, you get:

  • The DA → DE Pathway Course: A structured roadmap to bridge the gap between analysis and engineering.
  • Weekly Senior Deep Dives: Fresh, tactical insights on Python, Cloud (GCP/AWS), and modern orchestration delivered every week.
  • Production-Ready Blueprints: Access to 80+ protected stories and code repos from my time in the trenches as a Senior DE
  • The DE Job Board (Coming Soon): Exclusive access to a curated board of high-agency Data Engineering roles.