Automate Your BigQuery Schema Definitions With 5 Lines of Python
Tired of manually writing my BigQuery schemas, I wrote a function that makes schema definition less time-consuming.
Optimizing BigQuery Schema Definitions
One thing I’ve learned while programming is that I’d rather do more work now to do less work later.
That’s why, if I find myself performing a redundant operation, I begin exploring the possibility of writing a function or implementing another method to avoid doing mundane, repetitive work.
Schema definition, while important, is one of the most mundane, repetitive tasks I do as a data engineer. While this task can be completed in the BigQuery UI, since I primarily use the Python client, I define my schemas manually to ensure that GCP doesn’t parse anything wrong.
In doing so, I often end up with some pretty ugly, lengthy code that, even with a config file, can become unruly.
If you’re dealing with nested fields, it can become downright unpleasant, like the below snippet:
from google.cloud import bigquery
schema = [bigquery.SchemaField("abstract", "STRING", mode="NULLABLE"),
bigquery.SchemaField("web_url", "STRING", mode="NULLABLE"),
bigquery.SchemaField("snippet", "STRING", mode="NULLABLE"),
bigquery.SchemaField("lead_paragraph", "STRING", mode="NULLABLE"),
bigquery.SchemaField("print_section", "STRING", mode="NULLABLE"),
bigquery.SchemaField("print_page", "INTEGER", mode="NULLABLE"),
bigquery.SchemaField("source", "STRING", mode="NULLABLE"),
bigquery.SchemaField("multimedia", "RECORD", mode="REPEATED", fields=[
bigquery.SchemaField("rank", "STRING", mode="NULLABLE"),
bigquery.SchemaField("caption", "STRING", mode="NULLABLE"),
bigquery.SchemaField("credit", "STRING", mode="NULLABLE"),
bigquery.SchemaField("type", "STRING", mode="NULLABLE"),
bigquery.SchemaField("url", "STRING", mode="NULLABLE"),
bigquery.SchemaField("height", "INTEGER", mode="NULLABLE"),
bigquery.SchemaField("width", "INTEGER", mode="NULLABLE"),
bigquery.SchemaField("legacy", "RECORD", mode="REPEATED", fields=[
bigquery.SchemaField("xlarge", "STRING", mode="NULLABLE"),
bigquery.SchemaField("xlargewidth", "INTEGER", mode="NULLABLE"),
bigquery.SchemaField("xlargeheight", "INTEGER", mode="NULLABLE"),
bigquery.SchemaField("crop_name", "STRING", mode="NULLABLE")
])
]),
bigquery.SchemaField("headline", "RECORD", mode="REPEATED",
fields=[
bigquery.SchemaField("main", "STRING", mode="NULLABLE"),
bigquery.SchemaField("content_kicker", "STRING", mode="NULLABLE"),
bigquery.SchemaField("print_headline", "STRING", mode="NULLABLE"),
bigquery.SchemaField("name", "STRING", mode="NULLABLE"),
bigquery.SchemaField("seo", "STRING", mode="NULLABLE"),
bigquery.SchemaField("sub", "STRING", mode="NULLABLE"),
])]I realized, recently, that writing ‘bigquery.SchemaField’ out every time is unnecessary. Remembering the right parentheses and comma combination was annoying and prone to syntax errors.
So I took some time and made my life (and, hopefully, now yours) easier.
Below is my process. Feel free to copy/paste, replicate and integrate.
Build Your Pipeline To A Data Engineering Career
You’ve reached the limit of the public preview. The full version of this post includes the implementation details: The code, the edge cases, and the "why" behind the architecture.
When you join PipelineToDE, you get:
- The DA → DE Pathway Course: A structured roadmap to bridge the gap between analysis and engineering.
- Weekly Senior Deep Dives: Fresh, tactical insights on Python, Cloud (GCP/AWS), and modern orchestration delivered every week.
- Production-Ready Blueprints: Access to 80+ protected stories and code repos from my time in the trenches as a Senior DE
- The DE Job Board (Coming Soon): Exclusive access to a curated board of high-agency Data Engineering roles.