Several months ago we had a contract to provide ETL services for a company wanting to use BigQuery as a data warehouse for several disparate data sources. There are a lot of tools that make this really easy to do. In this case we used StitchData and Apache Airflow.
It was a fairly straightforward project from an ETL perspective. And within a couple of weeks we had the pipeline deployed and massive amounts of data dropping into BigQuery. The company was really excited to be able to run valuable queries for not a lot of money.
One thing we did when deploying the solution was made sure that the large transaction tables were partitioned by day and required a where clause on the partition to return results. This is just a good practice to avoid a situation where a careless engineer runs a query without a where clause resulting in a big processing bill.
So I was surprised a few weeks later when one of the senior data engineers told me they liked BigQuery, but didn’t know if it was worth the price. I know what they were using for their data warehouse before and bq should have been a fraction of the price. We asked them to re-enable our access so we could see what was going on and spotted the problem immediately.
Partition filters don’t matter if you don’t use them properly
Looking at the query history I saw some unusually expensive queries by a junior analyst. While he was using the partition filter (he had to since we prevented it from working without one), every time he ran a query he was just adding “where date >= 2000-01-01” – in other words, pulling down all of the records from all time (since they only had data going back to 2016).
That was the first problem, but not even the worst one. This was a wide table which made it really performant for most of their analytical queries, but he was doing “Select *” meaning he was pulling in every column of the table even though he was only interested in two columns.
If I ran the world…
I wish BigQuery would just add a couple of features to prevent runaway query costs. In this case it had cost the company less than $100, and they’re big so it probably didn’t matter from a cost perspective, but it’s painful for me to watch 20TB of data being unnecessarily processed. Here’s what I wish they would do:
- Give account owners the option to disable “Select *” in queries.
- When a query is run against a partitioned table and less than 5% of the partitions are used in the resulting query, draw the user’s attention to that fact in hopes that they will use parameter filters more responsibly in the future.
So that’s it, and it’s pretty simple. Select only the columns you need and use partition filters properly. If you want to give non-technical users access to your bigdata consider doing it via a tool like Looker, Data Studio, PowerBI, Tableau, etc. where you can curate useful queries for your users. This has the advantage of ensuring that everybody has the same definitions for KPI’s.