Parse 12 Months Of Credit Card Statements In 3 Minutes

How to use Python to read multi-page PDFs, transform unstructured data and SQL to format the final result in BigQuery.

Share
Gold American Express card on desk.
Photo by Ryan Born on Unsplash

One of the first obvious insights a data source yields is its structure. With the exception of rare scenarios, you’ll be able to determine the type and scope of data you’re dealing with as soon as you’re able to return raw data. When building API-based connections, as nearly all data engineers will, being able to successfully return an output in raw form like JSON, is the light bulb moment that signals forward progress.

And while JSON structures can quickly grow complex, introducing nested records and messy data, in my opinion, the most difficult data to deal with comes in one of the oldest formats: Text. Text data provides unique challenges and forces data engineers to channel their most niche SQL skills to complete operations like string matching and writing multi-step regular expressions (regex).

One of the most challenging, nuanced and highest-impact textual data you’ll encounter? Your credit card statement.


In a previous project I worked with my own transactional data to scrape and analyze my shopping data (unfortunately, due to time constraints, I’ve only completed the “scrape” portion of this work). The end goal: Building a small dashboard that tells me which items I buy most, how much I spend per order and per month. This work should signal my recent focus of extracting and centralizing my financial reporting.

Context: I waste several minutes a day toggling between apps trying to review balances and derive any kind of insight into my spending habits. Which begs the question: If an innovative and personalized data solution is my goal, why am I concentrating on the ingestion and transformation of old school files?

Unfortunately, financial institutions do not offer API services to non-commercial parties, i.e. nerds like me. And, double unfortunately, the aggregate service I previously used, Mint, has been deprecated after being discontinued by parent company Intuit. So, motivated to find and aggregate arguably my most important consumer data, I turned to the printed page.

Build Your Pipeline To A Data Engineering Career

You’ve reached the limit of the public preview. The full version of this post includes the implementation details: The code, the edge cases, and the "why" behind the architecture.

When you join PipelineToDE, you get:

  • The DA → DE Pathway Course: A structured roadmap to bridge the gap between analysis and engineering.
  • Weekly Senior Deep Dives: Fresh, tactical insights on Python, Cloud (GCP/AWS), and modern orchestration delivered every week.
  • Production-Ready Blueprints: Access to 80+ protected stories and code repos from my time in the trenches as a Senior DE
  • The DE Job Board (Coming Soon): Exclusive access to a curated board of high-agency Data Engineering roles.