Automate Your BigQuery Schema Definitions With 5 Lines of Python

Tired of manually writing my BigQuery schemas, I wrote a function that makes schema definition less time-consuming.

Share
Humanoid robot.
Automation in human form. Photo by Possessed Photography on Unsplash

Optimizing BigQuery Schema Definitions

One thing I’ve learned while programming is that I’d rather do more work now to do less work later.

That’s why, if I find myself performing a redundant operation, I begin exploring the possibility of writing a function or implementing another method to avoid doing mundane, repetitive work.

Schema definition, while important, is one of the most mundane, repetitive tasks I do as a data engineer. While this task can be completed in the BigQuery UI, since I primarily use the Python client, I define my schemas manually to ensure that GCP doesn’t parse anything wrong.

In doing so, I often end up with some pretty ugly, lengthy code that, even with a config file, can become unruly.

If you’re dealing with nested fields, it can become downright unpleasant, like the below snippet:

from google.cloud import bigquery

schema = [bigquery.SchemaField("abstract", "STRING", mode="NULLABLE"),
bigquery.SchemaField("web_url", "STRING", mode="NULLABLE"),
bigquery.SchemaField("snippet", "STRING", mode="NULLABLE"),
bigquery.SchemaField("lead_paragraph", "STRING", mode="NULLABLE"),
bigquery.SchemaField("print_section", "STRING", mode="NULLABLE"),
bigquery.SchemaField("print_page", "INTEGER", mode="NULLABLE"),
bigquery.SchemaField("source", "STRING", mode="NULLABLE"),
bigquery.SchemaField("multimedia", "RECORD", mode="REPEATED", fields=[
bigquery.SchemaField("rank", "STRING", mode="NULLABLE"),
bigquery.SchemaField("caption", "STRING", mode="NULLABLE"),
bigquery.SchemaField("credit", "STRING", mode="NULLABLE"),
bigquery.SchemaField("type", "STRING", mode="NULLABLE"),
bigquery.SchemaField("url", "STRING", mode="NULLABLE"),
bigquery.SchemaField("height", "INTEGER", mode="NULLABLE"),
bigquery.SchemaField("width", "INTEGER", mode="NULLABLE"),
bigquery.SchemaField("legacy", "RECORD", mode="REPEATED", fields=[
bigquery.SchemaField("xlarge", "STRING", mode="NULLABLE"),
bigquery.SchemaField("xlargewidth", "INTEGER", mode="NULLABLE"),
bigquery.SchemaField("xlargeheight", "INTEGER", mode="NULLABLE"),
bigquery.SchemaField("crop_name", "STRING", mode="NULLABLE")
        ])
    ]),
bigquery.SchemaField("headline", "RECORD", mode="REPEATED",
          fields=[
          bigquery.SchemaField("main", "STRING", mode="NULLABLE"),
          bigquery.SchemaField("content_kicker", "STRING", mode="NULLABLE"),
          bigquery.SchemaField("print_headline", "STRING", mode="NULLABLE"),
          bigquery.SchemaField("name", "STRING", mode="NULLABLE"),
          bigquery.SchemaField("seo", "STRING", mode="NULLABLE"),
          bigquery.SchemaField("sub", "STRING", mode="NULLABLE"),
                                                ])]

I realized, recently, that writing ‘bigquery.SchemaField’ out every time is unnecessary. Remembering the right parentheses and comma combination was annoying and prone to syntax errors.

So I took some time and made my life (and, hopefully, now yours) easier.

Below is my process. Feel free to copy/paste, replicate and integrate.

Build Your Pipeline To A Data Engineering Career

You’ve reached the limit of the public preview. The full version of this post includes the implementation details: The code, the edge cases, and the "why" behind the architecture.

When you join PipelineToDE, you get:

  • The DA → DE Pathway Course: A structured roadmap to bridge the gap between analysis and engineering.
  • Weekly Senior Deep Dives: Fresh, tactical insights on Python, Cloud (GCP/AWS), and modern orchestration delivered every week.
  • Production-Ready Blueprints: Access to 80+ protected stories and code repos from my time in the trenches as a Senior DE
  • The DE Job Board (Coming Soon): Exclusive access to a curated board of high-agency Data Engineering roles.