Auto-Tag Sensitive BigQuery Data & Never Touch The UI Again

Eliminate a tedious data governance chore after understanding privacy taxonomies and a simple BigQuery API implementation.

Share
Private sign.
Photo by Tim Mossholder on Unsplash

For as many hours as you spend developing, testing and maintaining robust data pipelines I bet you don’t think as deeply about who your end user might be. This is excusable because data engineering isn’t a discipline with a significant focus on the UX (user experience).

And, depending on the structure of your org, the next individual you might “hand off” your data source to might not even be your final stakeholder. For instance, I often build pipelines whose data sources become the domain of data analysts who build dashboards stakeholders can access.

Along its meandering journey from third-party vendor to your IDE to data warehouse to stakeholder access, many individuals will be able to view, query and manipulate the data you’re ingesting.

And this usually isn’t a good thing.

Data engineers and anyone with write access in your database spend an inordinate time thinking about how the resultant data should be presented, not who might have undue access. This is why implementing a data governance policy at an organization level is integral to data security, especially regarding sensitive data a.k.a. PII.

PaaS vendors like my go-to, Google Cloud Platform, allow you to define and enforce data governance at the dataset, table and even column level.

I’ve been working on and off on PII audits and data governance automations for the past 2–3 years and in that time I’ve struggled to find a resource that comprehensively explains how to programmatically batch tag multiple columns across multiple tables and datasets.

I recently discovered how to apply PII tags to existing schemas via the BigQuery API using the policy_tags parameter, an approach based on a StackOverflow answer.

Build Your Pipeline To A Data Engineering Career

You’ve reached the limit of the public preview. The full version of this post includes the implementation details: The code, the edge cases, and the "why" behind the architecture.

When you join PipelineToDE, you get:

  • The DA → DE Pathway Course: A structured roadmap to bridge the gap between analysis and engineering.
  • Weekly Senior Deep Dives: Fresh, tactical insights on Python, Cloud (GCP/AWS), and modern orchestration delivered every week.
  • Production-Ready Blueprints: Access to 80+ protected stories and code repos from my time in the trenches as a Senior DE
  • The DE Job Board (Coming Soon): Exclusive access to a curated board of high-agency Data Engineering roles.