If You Must Use A Spreadsheet As A Database At Least Use This Kind
Spreadsheets are a breeding ground for data inaccuracy — unless you can solve one core problem.
One of the things about being a data engineer (or, perhaps, just a data nerd), is that you can’t help but comment when someone mentions their org’s data infrastructure.
Especially if it’s bad.
And even more so if it’s nonexistent.
My most recent critique came at the expense of a friend of my wife and I who works in a school. Talking to another teacher friend, they mentioned that they wish they had more time for planning and, you know, teaching, instead of so much tedious, manual work.
When I commented that grading should probably still involve a degree of human intervention they shocked me by clarifying they weren’t referring to grading.
They were talking about recording attendance. Which they had to do.
By hand.
And they had to do it in the worst possible (yet most used) data storage solution: A spreadsheet.
While vulnerable to a range of development flaws, spreadsheets pose two primary issues:
- Difficult to scale
- Inaccurate
With inaccuracy, I’m not just talking about autocorrect in cells. Research suggests that mistakes in spreadsheets are all but inevitable, given the manual nature of data entry.
Scale also becomes an issue, with regard to both storage and utility.
Spreadsheets don’t allow for efficient storage practices like partitioning, clustering or setting partition-based expiration dates.
Cloud-based data sources are easiest in terms of scalability (because infra is effectively outsourced) but it is still up to the data engineer (or DBA) to ensure accuracy through methods that prevent issues like duplication.
But despite all of these arguments against spreadsheets as a database, I do make one exception — one I’ve used in production to support flagship products.
The solution is to use synchronous, connected spreadsheets.
Build Your Pipeline To A Data Engineering Career
You’ve reached the limit of the public preview. The full version of this post includes the implementation details: The code, the edge cases, and the "why" behind the architecture.
When you join PipelineToDE, you get:
- The DA → DE Pathway Course: A structured roadmap to bridge the gap between analysis and engineering.
- Weekly Senior Deep Dives: Fresh, tactical insights on Python, Cloud (GCP/AWS), and modern orchestration delivered every week.
- Production-Ready Blueprints: Access to 80+ protected stories and code repos from my time in the trenches as a Senior DE
- The DE Job Board (Coming Soon): Exclusive access to a curated board of high-agency Data Engineering roles.