Aggregate Awareness
Aggregate Awareness is now in closed beta. Request Beta Access.
You can also read the announcement blog post
Introduction
Aggregate Awareness is a query performance optimization functionality by Holistics. It inspects the query sent from Holistics BI and intelligently rewrites the query to use the smaller, pre-aggregated table instead of the raw table.
This results in less rows scanned, hence improving both performance & cost while maintaining query accuracy.
The aggregated tables can be either built using Aggregate Awareness, or using external transformation tools (dbt, Airflow, Dagster).
Glossary
Before we dive deeper into aggregate awareness, let’s build the foundation and align on the same page with some basic terms
- Pre-aggregate is the aggregation performed and prepared prior to the query time, producing a condensed version of source data with just enough information that needs to be used frequently. This helps reduce the size of the data and improves the query performance.
- Pre-aggregated table is the output of the pre-aggregate process.
- Aggregate Awareness is the capability of automatically identifying eligible pre-aggregates and substituting them into queries, making the queries use smaller pre-aggregated tables while still producing accurate results.
- Granularity is a term that describes the level of detail or complexity of a set of data.
- Coarse-grained (or low) granularity includes the data set having less detail with a grasp overview frequently used for quick analysis.
- Fine-grained (or high) granularity, in contrast, includes the data set having more detail frequently used for in-depth analysis or at the execution level.
- For example:
month
has coarser-grained (lower) granularity thanday
- (
country
,city
) has finer-grained (higher) granularity than (country
)
How It Works
You should know the concepts of Data Model and Dataset.
To make Holistics Aggregate Awareness know which pre-aggregate can be used in the query, there are two jobs to be done:
- Define the Pre-Aggregate in the dataset to map between the raw tables/models and the pre-aggregated table.
- Create the pre-aggregated table by using either
- The external tools (like dbt, Airflow, etc.) or
- The built-in persistence option in Holistics
When your query is intended to use the original tables, Holistics will automatically pick the right pre-aggregated tables having enough information to answer the data question.
Therefore, instead of processing millions of millions of records, it only needs fewer than that, leading to faster results.
Quick Example
Background
Assume we’re working with the following raw_transactions
table and a model on top of it residing in an ecommerce
dataset
// ---------- Modeling Layer ----------
Dataset ecommerce {
//... other dataset settings
models: [transactions]
}
Model transactions {
dimension id
measure count_transactions { // a custom measure definition
label: 'Count Transactions'
type: 'number'
definition: @aql count(transactions.id)
}
//... other fields' definitions
table_name: 'raw_transactions'
}
// ---------- Database ----------
Table raw_transactions {
id integer [PK]
user_id integer [ref: > users.id]
product_id integer [ref: > products.id]
merchant_id integer [ref: > merchants.id]
status varchar
city varchar
country varchar
order_value double
created_at timestamp
}
Given that, we know the common queries with this table are:
- Count transactions by day, week, month
- Count transactions by status, country, city
This table is relatively big (billions of records). So, we know these analytical queries would take time.
Create Pre-aggregated Table
A good approach is to build a pre-aggregated table on the frequently used dimensions and measures to reduce the number of records
// Create pre-aggregated table agg_transactions
SELECT
created_at::date as created_at_day,
status,
country,
city,
COUNT(1) as count_transactions
FROM raw_transactions
GROUP BY 1, 2, 3, 4
// ---------- Database - Schema: `persisted` ----------
Table agg_transactions {
created_at_day date
status varchar
country varchar
city varchar
count_transactions integer
}
Map Pre-aggregated Table with Model
After that, we use pre-aggregate to define which model dimensions & measures are equivalent to the respective pre-aggregated table fields
Dataset ecommerce {
//... other dataset settings
models: [transactions]
// ----- BEGIN NEW CODE - Add pre-aggregate
pre_aggregates: {
agg_transactions: PreAggregate {
// This dimension is in the pre-aggregated table `agg_transactions`
dimension created_at_day {
// This reference is in the `transactions` model
for: ref('transactions', 'created_at'),
time_granularity: "day"
}
dimension status {
for: ref('transactions', 'status')
}
dimension country {
for: ref('transactions', 'country')
}
dimension city {
for: ref('transactions', 'city')
}
measure count_transactions {
for: ref('transactions', 'id')
aggregation_type: 'count'
}
persistence: FullPersistence {
schema: 'persisted'
}
},
}
// ----- END NEW CODE
}
The Results
When querying and exploring data, analysts and explorers only work with a single transactions
model. They shouldn’t need to have knowledge about the underlying aggregation optimizations.
With Aggregate Awareness, Holistics can automatically choose the right underlying table to query based on the dimensions of the query. The mechanism in Holistics’s Query Engine might be conceptually like the following:
Setting up
To set up Aggregate Awareness in Holistics, you can follow these two steps
- Defining pre-aggregate in the dataset
- Assigning the pre-aggregated tables by either:
- Creating aggregated tables using Holistics Persistence, OR
- Bringing in external pre-aggregated tables created in advance using data tools like dbt, Airflow, etc.
Define the pre-aggregate
In the dataset file, add the pre_aggregates
config to map model dimensions and measures to the respective pre-aggregated table fields.
Dataset ecommerce {
//... other dataset settings
models: [transactions]
// Add pre-aggregate
pre_aggregates: {
agg_transactions: PreAggregate {
// This dimension is in the pre-aggregated table `agg_transactions`
dimension created_at_day {
// This reference is in the `transactions` model
for: ref('transactions', 'created_at'),
time_granularity: "day"
}
dimension status {
for: ref('transactions', 'status')
}
dimension country {
for: ref('transactions', 'country')
}
dimension city {
for: ref('transactions', 'city')
}
measure count_transactions {
for: ref('transactions', 'count_transactions')
aggregation_type: 'count'
}
persistence: FullPersistence {
schema: 'persisted'
}
},
}
}
Persisting the aggregated tables
The above setup assumes there are no physical tables created inside the data warehouse. In order for queries to use the aggregated table, an actual pre-aggregated table must be created.
Holistics supports the creation of this table through Holistics Persistence:
- You specify a writeable schema in the data warehouse.
- Holistics handles persisting and refreshing of tables behind the scenes.
pre_aggregates: {
transactions_agg: PreAggregate {
//... other pre-aggregate config
// PERSISTENCE CONFIG
persistence: FullPersistence {
schema: 'persisted'
}
}
}
Make sure that:
- The schema (e.g.
persisted
) exists in the data warehouse. - Holistics connection has proper permission to create a table and write data to this schema.
Creating table manually
To manually trigger the table creation (persistence process), go to Dataset’s Pre-aggregate section in the List tab to invoke the Run button and wait for it to finish.
Setting up scheduled refreshes
When the original tables change, the pre-aggregate could become outdated. Instead of just using UI to manually trigger the persistence, you can use a schedule trigger to define the pre-aggregate cadence.
In Holistics, you can create a schedules.aml
file at the root of your AML project:
const schedules = [
// Schedule for specific PreAggregates in a Dataset
PreAggregateSchedule {
cron: '15 20 * * *'
object: ecommerce_dataset
pre_aggregates: ['agg_transactions'] // persist the PreAggregate 'agg_transactions' only
}
// Schedule for all PreAggregates in a Dataset
PreAggregateSchedule {
cron: '15 20 * * *'
object: ecommerce_dataset
}
]
You can find the full syntax reference here.
Trigger via API (coming soon)
Alternatively, you can also trigger to create and refresh pre-aggregated tables using API (coming soon).
Working with externally-built tables
If you have a pipeline that builds the aggregated tables outside of Holistics (with tools like Airflow, dbt, etc.), you only need to map the table name to the pre-aggregate definition.
Use ExternalPersistence
to let Holistics know you’re using an external table.
Dataset ecommerce {
// ... other settings
pre_aggregates: {
agg_transactions: PreAggregate {
//... other settings
dimension created_at_day
measure count_transactions
// Map external pre-aggregated table with this option
persistence: ExternalPersistence {
table_name: 'persisted.agg_transactions'
}
},
}
}
Capabilities
In this section, we will explore the capabilities of Holistics Aggregate Awareness in various scenarios.
Dimension Awareness
- Basically, you can drill down with more details or go up to the higher category to grasp the overview using one or many dimensions within the pre-aggregate.
- For example, Holistics will automatically use the pre-aggregated table
agg_transactions
to answer all of these below queries instead of theraw_transactions
table.- Count transactions by day and status
- Count transactions by day, status, and country
- Count transactions by day, status, country, and city
- etc.
Join Awareness (Smart Join Substitution)
Aggregate Awareness is also smart enough to perform aggregations with relationship joins if the Pre-Aggregate
- contains the foreign key of a Many-to-One relationship
- and is on the "Many" side of that relationship
For example, let's say in our Pre-Aggregate, we add the dimension transactions.user_id
which has a Many-To-One relationship to the users
model.
Dataset ecommerce {
//... other settings
models: [
transactions,
users
]
pre_aggregates: {
agg_transactions: PreAggregate {
//... other settings
dimension created_at_day
measure count_transactions
// ADD THIS
dimension user_id {
for: ref('transactions', 'user_id')
}
persistence: FullPersistence {
schema: 'persisted'
}
},
}
}
//------------------------------------------------------------
// Existing setup
Model transactions {
//... other settings
table_name: 'raw_transactions'
}
Model users {
//... other settings
dimension gender
dimension age_group
table_name: 'users'
}
Relationship: transactions.user_id > users.id;
The following queries will automatically use the pre-aggregated table and perform a join with users
:
- Count transactions by gender
- Count transactions by country, age_group
even when your pre-aggregate does not contain gender, country, or age_group dimensions!
Time Granularity Awareness
Holistics can identify the pre-aggregated table at the right time granularity level based on the option that you configure in the pre-aggregate (minute, hour, day, week, month, quarter, year)
For example, in the quick example above, you can make use of the agg_transactions
at the day level to get the total transactions count at the week or month level, etc.
These queries would use pre-aggregated table agg_transactions
at day level and take additional time to aggregate from day to week or month (although it is still faster than using the raw table).
However, if you use data at the week level frequently and want to optimize the query time faster, try to create another pre-aggregated table at coarser-grained granularity (e.g., agg_transactions_week
)
Map it with the transactions model so that the engine will be aware of it when possible
Dataset ecommerce {
//... other settings
models: [
transactions
]
pre_aggregates: {
// BEGIN NEW CODE
agg_transactions_week: PreAggregate {
dimension created_at_week {
for: ref('transactions', 'created_at'),
time_granularity: "week"
}
//... other pre-aggregate settings
},
// END NEW CODE
agg_transactions: PreAggregate {
//... pre-aggregate settings
},
}
}
For now, when you query the total weekly transactions count, it will use the pre-aggregated table agg_transactions_week
leading to faster results than just using agg_transactions
table.
In the current version, if there are multiple pre-aggregates eligible for a query, the (eligible) pre-aggregate that is defined first in the dataset will be chosen to substitute into the query.
We recommend that you put the coarser-grained pre-aggregates before the finer-grained pre-aggregates in the dataset configuration. (e.g. “month” first, then “week”, then “day”)
Measures Types in Pre-aggregates
Learn more about Additive and Non Additive Measures
Depending on the measure type being used in pre-aggregate and whether or not you re-aggregate the measure in the current query, the pre-aggregated tables will be hit or missed
Additive measures
Aggregate awareness can be used to re-aggregate with these types of additive measures
- sum
- count
- min
- max
For example, a pre-aggregate of "Sum Revenue by Month" can also be used for the query "Sum Revenue by Year".
Non-additive measures
- Pre-aggregated non-additive measures can only match the queries that use the exact same dimensions/granulariy. Here are some non-additive measures:
- count distinct
- median
- stddev
- variance
- avg
- For example, a pre-aggregate of "Count Distinct User by Month" cannot be used for the query "Count Distinct User by Year".
- In the case of
avg
, you can actually split (decompose) it into two additive measures:sum
andcount
. If your PreAggregate hassum
andcount
measures, Aggregate Awareness can automatically pick up those measures to calculate theavg
.
// For `avg`
// ❗ Instead of pre-aggregating `avg` directly
measure avg_a {
for: ref('model', 'a')
aggregation_type: 'avg'
}
// ✅ Pre-aggregate on `sum` and `count
measure sum_a {
for: ref('model', 'a')
aggregation_type: 'sum'
}
measure count_a {
for: ref('model', 'a')
aggregation_type: 'count'
}
Holistics Persistence
This section provides more details about Holisticse Persistences (FullPersistence
and IncrementalPersistence
).
- To learn about setting up Holistics Persistence, please refer to the section above.
- These details are not relevant to
ExternalPersistence
(ref).
Configuring Holistics Persistence
PreAggregate
supports configuring persistence
using FullPersistence
and IncrementalPersistence
.
In fact, these types of persistence configs are the same of those used for Query Model Persistence.
Thus, you can refer to these doc sections to configure the Holistics Persistence of your Pre-Aggregates:
- AML Persistence (AML Syntax Reference)
- Note:
on_cascade
is not relevant (not applicable) to Pre-Aggregate Persistence.
- Note:
- Types of persistence config
- Persistence Table Optimizations
Uniqueness of Persisted Tables
Holistics generates a key for each Pre-Aggregate based on its semantics. When running a Persistence, Holistics would associate that key to the persisted table.
The semantics of a Pre-Aggregate includes (but is not limited to):
- Name
- Dimensions (including their SQL/AQL definitions)
- Measures (including their SQL/AQL definitions)
- Dataset's Relationships (that are relevant to the Dimensions and Measures)
- Dataset's Data Source
Therefore, if you make semantic changes to your Pre-Aggregate (for example, by changing a Dimension's definition in its Model), because the key is changed, Holistics will not use the old persisted tables of that Pre-Aggregate.
Additional notes:
- Pre-Aggregates defined in different git branches of your project can re-use the same persisted tables if (and only if) they have the same semantics.
Cleanup for Persisted Tables
Persisted Tables are automatically cleaned up from your Database after a period of time. That period of time is called the Time-To-Live (TTL).
The TTL is refreshed whenever:
- The persisted table's data is updated.
- The persisted table is used in a query.
The TTL is 7 days by default.
You can configure it in your Holistics Settings (at Pre-Aggregate Persistence TTL):
Tip: Leveraging AML Reusability
It is a common need to create different PreAggregates for different time granularities so that you can configure more efficient persistence pipelines.
To conveniently generate multiple PreAggregates for different time granularities, you can leverage AML Reusability.
Build multiple PreAggregates using AML Extend is a great example for you to get started!
FAQs
Why is pre-aggregate defined inside a dataset? Can I just define it in the model?
PreAggregate is defined in the dataset because it can reference fields from multiple different models.
Thus, it needs to be defined in a dataset where it can use the dataset's relationships to join those models.
I expected the query to use the pre-aggregated table, but it ended up using the original one.
There are many reasons for this issue. Here are the most common reasons:
- The tables in the query are finer-grained in granularity than the pre-aggregated tables that you defined in the dataset.
- For example, you query Count total transactions by hour, but you just have pre-aggregated tables at day and week levels, which have higher granularity than hour. So the raw table will be used instead of the pre-aggregated table.
- You defined the pre-aggregated tables at the right level of granularity already, but you did not trigger the persistence. That is why your pre-aggregated tables are not there when you need them.
In any case, please check out the Executed Query which includes the exact reason and other debugging details.
I expected the query to use pre-aggregate at this level of granularity because it is more optimal, but it turned out to use another one.
In the current version, if there are multiple pre-aggregates eligible for a query, the (eligible) pre-aggregate that is defined first in the dataset will be chosen to substitute into the query.
We recommend that you put the coarser-grained pre-aggregates before the finer-grained pre-aggregates in the dataset configuration. (e.g. “month” first, then “week”, then “day”)
How can I use my existing AQL measure in PreAggregate measure?
Currently, Holistics does not allow putting an AQL measure directly inside a PreAggregate.measure
definition. But instead, you can pre-aggregate by "decomposing" that AQL measure.
For example, the transactions
model has an AQL measure named count_transactions
:
Model transactions {
measure count_transactions {
type: 'number'
definition: @aql count(transactions.id);;
}
}
If we try to use transactions.count_transactions
in a PreAggregate, we will see an error:
Dataset ecommerce {
//... other settings
models: [transactions]
pre_aggregates: {
agg_transactions: PreAggregate {
// ...other settings
measure count_transactions {
for: ref('transactions', 'count_transactions') // ❌ DO NOT use this
aggregation_type: 'count'
}
persistence: FullPersistence {
schema: 'persisted'
}
},
}
}
Invalid PreAggregate: `count_transactions` uses "count" aggregation
but `transactions.count_transactions` is not a dimension.
Instead, we can decompose the measure into count
+ ref('transactions', 'id')
:
Dataset ecommerce {
//... other settings
models: [transactions]
pre_aggregates: {
agg_transactions: PreAggregate {
// ...other settings
measure count_transactions {
for: ref('transactions', 'id') // ✅ DO use this
aggregation_type: 'count'
}
persistence: FullPersistence {
schema: 'persisted'
}
},
}
}
Decomposing Measures and putting the base measures inside PreAggregate
definitions will make your PreAggregate
more robust.
For example, a sum
PreAggregate can be used to answer queries for:
- The Sum itself
- Running Sum
- Sum with Period Comparison
- etc.
What if my AQL measure composes of multiple AQL aggregations?
Answer: As long as you can decompose your AQL measure into the supported aggregation types, you can pre-aggregate it.
For example, if you have an AQL measure that has the formular @aql sum(model.a) * sum(model.b)
, you can decompose it and make 2 measures in your Pre-Aggregate:
measure sum_a {
for: ref('model', 'a')
aggregation_type: 'sum'
}
measure sum_b {
for: ref('model', 'b')
aggregation_type: 'sum'
}
What if my measure is a SQL measure?
We would strongly recommend you to convert your SQL measure into an AQL measure (i.e. using @aql
syntax instead of @sql
syntax):
// in Model definition
measure sum_x {
label: 'Sum of X'
type: 'number'
definition: @aql sum(model.x);;
}
// in PreAggregate definition
measure sum_x {
for: ref('model', 'x')
aggregation_type: 'sum'
}
(This is the recommended setup that we described earlier.)
Otherwise, you can still use aggregation_type: 'custom'
, but Metric awareness features won't apply and Holistics will always treat your measure as a non-additive measure.
// in Model definition
measure sum_x {
label: 'Sum of X'
type: 'number'
definition: @sql SUM({{ x }});;
}
// in PreAggregate definition
measure sum_x {
for: ref('model', 'sum_x')
aggregation_type: 'custom'
}
How could I handle timezone when querying with pre-aggregated tables?
Currently, PreAggregates always use the organization's timezone.
Therefore, Aggregate Awareness only works with queries/explorations/dashboards that are also using the organization's timezone.
If the query/dashboard timezone is different from the organization's timezone, the engine won't be able to apply any PreAggregate.
In later releases of Aggregate Awareness, we will allow configuring the timezone of PreAggregates.
Other references
- Tutorial with more examples Using Aggregate Awareness.
- Discuss more on Community Post.
Aggregate Awareness is now in closed beta. Request beta access and try it out.