Last month, I spoke with a Head of Data who was paying over $60k/mo for BigQuery.
We investigated the situation and found a couple of things:
Continuing with the theme of helping you reduce warehouse cost (see my earlier post: The cost of product analytics data in your data warehouse), the goal of this post is to build a realistic cost estimator so you can be prepared for warehouse cost in the future, and also to present a few techniques to prevent the warehouse cost from getting out of hand.
Before we start evaluating how cost grows exponentially, we need to understand how you get charged for warehouse usage. Working with our clients, we’ve found the total cost across warehouses to roughly average out for equivalent size and usage. Therefore, I’m using BigQuery as a proxy for this post because it’s one of the simpler pricing models.
BigQuery query cost: $5 per TB of data scanned
This cost is super misleading. A terabyte seems really large, and $5 isn’t that much. When I first saw this, I thought “I won’t have much data for years, and even when I do it’ll still be super cheap”. I was…wrong.
The cost of a data warehouse doesn’t increase at the same rate you increase data volume. It increases exponentially as you increase data volume. This is a common misconception because we don’t intuitively understand how we put load on a warehouse. Below, I tried to come up with a good mental model for this.
To start, I created a rubric for different company stages and sizes, breaking it down by data sources, tracked users, and data questions:
Now that we have it broken down by company size, I also needed to make a few assumptions on # of tables, usage, and data size:
Given these assumptions we will begin computing the actual load on the data warehouse.
# of Materialized Views = 10 * # of data sources + (10% * new questions/mo * 12 mo/year)
Average Table Size = 100 rows/table * # of tracked users * 1.2 Bytes
Average # of Tables per Query = 3 for small, 5 for medium and large (as your company grows, the questions you ask will become more complex, therefore requiring more complex joins)
Both the number of tables per query and the average size of each table are both increasing. This makes intuitive sense because as you continue to collect data from your customers, the data will always be growing. And as your company asks more questions, we will always need to build new materialized views to answer those questions.
Now that we have the number of tables per query and the average size of each table, we account for the fact that data is processed every hour, and multiply it by the cost to scan that amount of data:
Total Query Load per Month = # of Materialized Views * Average Table Size * Average # of tables per Query * 24 hours/day * 30 days/mo
TOTAL MONTHLY COST = $5/TB * Total Query Load per Month
It is very easy to just create another materialized view but not everything deserves it. Every time a materialized view is updated on it’s schedule, it requires a scan (costing you money). You can save a lot of money by answering ad-hoc questions with views that are not scheduled. We found that many ad-hoc questions start as super critical, but over time become less valuable once the stakeholder receives an answer. So instead of materializing that data, make it a view and just pay when people actually use the data. All those dashboards that no one looks at will now not cost so much.
Most BI tools or query tools will have a caching layer. This is like a materialized view but is saved per query. Keep in mind the first time it loads it will be really slow, but on subsequent runs will be really fast.
When dealing with larger data, try to break up your materialized views into a couple of queries that are incrementally updated individually. This means that you will process only the diff per update (an hour of data for example) instead of the whole history of data.
This is a bit counterintuitive. We often think about building tables on top of other tables so each query is simpler. Yes, dependencies can sometimes make things easier to debug. However, we’ve seen queries that depend on 20 nested parents (ones that aren’t being used by any other child queries). This means you are paying 20x the cost for the one table. You can save a lot of money by letting the query get complex. BE VERY CAREFUL with this. This is helpful for larger datasets and remember you will need to maintain it.
The reality is, the amount of data will continue to increase exponentially over time, and the amount of data questions as your company grows will continue to increase exponentially over time. There’s no way to truly get around an increase in warehouse cost. However, using some of the techniques mentioned above will definitely mitigate the expense.
And, I’d be remiss if I didn’t at least mention how our clients are using the Activity Schema approach with Narrator. It’s an alternative approach to data modeling that we’ve seen drastically reduce warehouse cost across all of our clients, given that it’s a single incrementally updated table for all analytics vs many tables and nested dependencies that come with a star schema.