Skip to main content

Aggregate Awareness

Closed Beta

Aggregate Awareness is now in closed beta. Request Beta Access.

You can also read the announcement blog post

Introduction

Aggregate Awareness is a query performance optimization functionality by Holistics. It inspects the query sent from Holistics BI and intelligently rewrites the query to use the smaller, pre-aggregated table instead of the raw table.

This results in less rows scanned, hence improving both performance & cost while maintaining query accuracy.

The aggregated tables can be either built using Aggregate Awareness, or using external transformation tools (dbt, Airflow, Dagster).

Glossary

Before we dive deeper into aggregate awareness, let’s build the foundation and align on the same page with some basic terms

  • Pre-aggregate is the aggregation performed and prepared prior to the query time, producing a condensed version of source data with just enough information that needs to be used frequently. This helps reduce the size of the data and improves the query performance.
  • Pre-aggregated table is the output of the pre-aggregate process.
  • Aggregate Awareness is the capability of automatically identifying eligible pre-aggregates and substituting them into queries, making the queries use smaller pre-aggregated tables while still producing accurate results.
  • Granularity is a term that describes the level of detail or complexity of a set of data.
    • Coarse-grained (or low) granularity includes the data set having less detail with a grasp overview frequently used for quick analysis.
    • Fine-grained (or high) granularity, in contrast, includes the data set having more detail frequently used for in-depth analysis or at the execution level.
    • For example:
      • month has coarser-grained (lower) granularity than day
      • (country, city) has finer-grained (higher) granularity than (country)

How It Works

KNOWLEDGE CHECKPOINT

You should know the concepts of Data Model and Dataset.

To make Holistics Aggregate Awareness know which pre-aggregate can be used in the query, there are two jobs to be done:

  • Define the Pre-Aggregate in the dataset to map between the raw tables/models and the pre-aggregated table.
  • Create the pre-aggregated table by using either
    • The external tools (like dbt, Airflow, etc.) or
    • The built-in persistence option in Holistics
How it works - Pre-aggregate

When your query is intended to use the original tables, Holistics will automatically pick the right pre-aggregated tables having enough information to answer the data question.

Therefore, instead of processing millions of millions of records, it only needs fewer than that, leading to faster results.

How it works - Query with pre-aggregate

Quick Example

Background

Assume we’re working with the following raw_transactions table and a model on top of it residing in an ecommerce dataset

Quick example - Raw table
// ---------- Modeling Layer ----------

Dataset ecommerce {
//... other dataset settings
models: [transactions]
}

Model transactions {
dimension id
measure count_transactions { // a custom measure definition
label: 'Count Transactions'
type: 'number'
definition: @aql count(transactions.id)
}
//... other fields' definitions

table_name: 'raw_transactions'
}

// ---------- Database ----------

Table raw_transactions {
id integer [PK]
user_id integer [ref: > users.id]
product_id integer [ref: > products.id]
merchant_id integer [ref: > merchants.id]
status varchar
city varchar
country varchar
order_value double
created_at timestamp
}

Given that, we know the common queries with this table are:

  • Count transactions by day, week, month
  • Count transactions by status, country, city

This table is relatively big (billions of records). So, we know these analytical queries would take time.

Create Pre-aggregated Table

A good approach is to build a pre-aggregated table on the frequently used dimensions and measures to reduce the number of records

Quick example - Create pre-aggregated table
// Create pre-aggregated table agg_transactions

SELECT
created_at::date as created_at_day,
status,
country,
city,
COUNT(1) as count_transactions
FROM raw_transactions
GROUP BY 1, 2, 3, 4

// ---------- Database - Schema: `persisted` ----------

Table agg_transactions {
created_at_day date
status varchar
country varchar
city varchar
count_transactions integer
}

Map Pre-aggregated Table with Model

After that, we use pre-aggregate to define which model dimensions & measures are equivalent to the respective pre-aggregated table fields

Quick example - Pre-aggregate config
Dataset ecommerce {
//... other dataset settings
models: [transactions]

// ----- BEGIN NEW CODE - Add pre-aggregate
pre_aggregates: {
agg_transactions: PreAggregate {
// This dimension is in the pre-aggregated table `agg_transactions`
dimension created_at_day {
// This reference is in the `transactions` model
for: ref('transactions', 'created_at'),
time_granularity: "day"
}
dimension status {
for: ref('transactions', 'status')
}
dimension country {
for: ref('transactions', 'country')
}
dimension city {
for: ref('transactions', 'city')
}
measure count_transactions {
for: ref('transactions', 'id')
aggregation_type: 'count'
}
persistence: FullPersistence {
schema: 'persisted'
}
},
}
// ----- END NEW CODE
}

The Results

When querying and exploring data, analysts and explorers only work with a single transactions model. They shouldn’t need to have knowledge about the underlying aggregation optimizations.

With Aggregate Awareness, Holistics can automatically choose the right underlying table to query based on the dimensions of the query. The mechanism in Holistics’s Query Engine might be conceptually like the following:

Quick example - Query with pre-aggregate

Setting up

To set up Aggregate Awareness in Holistics, you can follow these two steps

  • Defining pre-aggregate in the dataset
  • Assigning the pre-aggregated tables by either:
    • Creating aggregated tables using Holistics Persistence, OR
    • Bringing in external pre-aggregated tables created in advance using data tools like dbt, Airflow, etc.

Define the pre-aggregate

In the dataset file, add the pre_aggregates config to map model dimensions and measures to the respective pre-aggregated table fields.

Dataset ecommerce {
//... other dataset settings

models: [transactions]

// Add pre-aggregate
pre_aggregates: {
agg_transactions: PreAggregate {
// This dimension is in the pre-aggregated table `agg_transactions`
dimension created_at_day {
// This reference is in the `transactions` model
for: ref('transactions', 'created_at'),
time_granularity: "day"
}
dimension status {
for: ref('transactions', 'status')
}
dimension country {
for: ref('transactions', 'country')
}
dimension city {
for: ref('transactions', 'city')
}
measure count_transactions {
for: ref('transactions', 'count_transactions')
aggregation_type: 'count'
}
persistence: FullPersistence {
schema: 'persisted'
}
},
}
}

Persisting the aggregated tables

The above setup assumes there are no physical tables created inside the data warehouse. In order for queries to use the aggregated table, an actual pre-aggregated table must be created.

Holistics supports the creation of this table through Holistics Persistence:

  • You specify a writeable schema in the data warehouse.
  • Holistics handles persisting and refreshing of tables behind the scenes.
  pre_aggregates: {
transactions_agg: PreAggregate {
//... other pre-aggregate config

// PERSISTENCE CONFIG
persistence: FullPersistence {
schema: 'persisted'
}
}
}
NOTES

Make sure that:

Creating table manually

To manually trigger the table creation (persistence process), go to Dataset’s Pre-aggregate section in the List tab to invoke the Run button and wait for it to finish.

Setting up scheduled refreshes

When the original tables change, the pre-aggregate could become outdated. Instead of just using UI to manually trigger the persistence, you can use a schedule trigger to define the pre-aggregate cadence.

In Holistics, you can create a schedules.aml file at the root of your AML project:

const schedules = [
// Schedule for specific PreAggregates in a Dataset
PreAggregateSchedule {
cron: '15 20 * * *'
object: ecommerce_dataset
pre_aggregates: ['agg_transactions'] // persist the PreAggregate 'agg_transactions' only
}

// Schedule for all PreAggregates in a Dataset
PreAggregateSchedule {
cron: '15 20 * * *'
object: ecommerce_dataset
}
]

You can find the full syntax reference here.

Trigger via API (coming soon)

Alternatively, you can also trigger to create and refresh pre-aggregated tables using API (coming soon).

Working with externally-built tables

If you have a pipeline that builds the aggregated tables outside of Holistics (with tools like Airflow, dbt, etc.), you only need to map the table name to the pre-aggregate definition.

Use ExternalPersistence to let Holistics know you’re using an external table.

Dataset ecommerce {
// ... other settings
pre_aggregates: {
agg_transactions: PreAggregate {
//... other settings
dimension created_at_day
measure count_transactions

// Map external pre-aggregated table with this option
persistence: ExternalPersistence {
table_name: 'persisted.agg_transactions'
}
},
}
}

Capabilities

In this section, we will explore the capabilities of Holistics Aggregate Awareness in various scenarios.

Dimension Awareness

  • Basically, you can drill down with more details or go up to the higher category to grasp the overview using one or many dimensions within the pre-aggregate.
  • For example, Holistics will automatically use the pre-aggregated table agg_transactions to answer all of these below queries instead of the raw_transactions table.
    • Count transactions by day and status
    • Count transactions by day, status, and country
    • Count transactions by day, status, country, and city
    • etc.
Use cases - Dimension awareness

Join Awareness (Smart Join Substitution)

Aggregate Awareness is also smart enough to perform aggregations with relationship joins if the Pre-Aggregate

  • contains the foreign key of a Many-to-One relationship
  • and is on the "Many" side of that relationship

For example, let's say in our Pre-Aggregate, we add the dimension transactions.user_id which has a Many-To-One relationship to the users model.

Use cases - Join awareness - Add a foreign key to the pre-aggregated table
Dataset ecommerce {
//... other settings
models: [
transactions,
users
]

pre_aggregates: {
agg_transactions: PreAggregate {
//... other settings
dimension created_at_day
measure count_transactions

// ADD THIS
dimension user_id {
for: ref('transactions', 'user_id')
}

persistence: FullPersistence {
schema: 'persisted'
}
},
}
}

//------------------------------------------------------------
// Existing setup

Model transactions {
//... other settings
table_name: 'raw_transactions'
}

Model users {
//... other settings
dimension gender
dimension age_group
table_name: 'users'
}

Relationship: transactions.user_id > users.id;

The following queries will automatically use the pre-aggregated table and perform a join with users:

  • Count transactions by gender
  • Count transactions by country, age_group

even when your pre-aggregate does not contain gender, country, or age_group dimensions!

Use cases - Join awareness - Query with pre-aggregate and join with foreign key to extend the dimension awareness capability

Time Granularity Awareness

Holistics can identify the pre-aggregated table at the right time granularity level based on the option that you configure in the pre-aggregate (minute, hour, day, week, month, quarter, year)

For example, in the quick example above, you can make use of the agg_transactions at the day level to get the total transactions count at the week or month level, etc.

Use cases - Time awareness - Re-aggregate from day to week

These queries would use pre-aggregated table agg_transactions at day level and take additional time to aggregate from day to week or month (although it is still faster than using the raw table).

However, if you use data at the week level frequently and want to optimize the query time faster, try to create another pre-aggregated table at coarser-grained granularity (e.g., agg_transactions_week)

Use cases - Time awareness - Create pre-aggregated table at week

Map it with the transactions model so that the engine will be aware of it when possible

Use cases - Time awareness - Update the pre-aggregate
Dataset ecommerce {
//... other settings
models: [
transactions
]

pre_aggregates: {
// BEGIN NEW CODE
agg_transactions_week: PreAggregate {
dimension created_at_week {
for: ref('transactions', 'created_at'),
time_granularity: "week"
}
//... other pre-aggregate settings
},
// END NEW CODE

agg_transactions: PreAggregate {
//... pre-aggregate settings
},
}
}

For now, when you query the total weekly transactions count, it will use the pre-aggregated table agg_transactions_week leading to faster results than just using agg_transactions table.

Use cases - Time awareness- Query with pre-aggregate at week
NOTICE

In the current version, if there are multiple pre-aggregates eligible for a query, the (eligible) pre-aggregate that is defined first in the dataset will be chosen to substitute into the query.

We recommend that you put the coarser-grained pre-aggregates before the finer-grained pre-aggregates in the dataset configuration. (e.g. “month” first, then “week”, then “day”)

Measures Types in Pre-aggregates

KNOWLEDGE CHECKPOINT

Depending on the measure type being used in pre-aggregate and whether or not you re-aggregate the measure in the current query, the pre-aggregated tables will be hit or missed

Additive measures

  • Aggregate awareness can be used to re-aggregate with these types of additive measures

    • sum
    • count
    • min
    • max
  • For example, a pre-aggregate of "Sum Revenue by Month" can also be used for the query "Sum Revenue by Year".

Non-additive measures

  • Pre-aggregated non-additive measures can only match the queries that use the exact same dimensions/granulariy. Here are some non-additive measures:
    • count distinct
    • median
    • stddev
    • variance
    • avg
  • For example, a pre-aggregate of "Count Distinct User by Month" cannot be used for the query "Count Distinct User by Year".
  • In the case of avg, you can actually split (decompose) it into two additive measures: sum and count. If your PreAggregate has sum and count measures, Aggregate Awareness can automatically pick up those measures to calculate the avg.
// For `avg`

// ❗ Instead of pre-aggregating `avg` directly
measure avg_a {
for: ref('model', 'a')
aggregation_type: 'avg'
}

// ✅ Pre-aggregate on `sum` and `count
measure sum_a {
for: ref('model', 'a')
aggregation_type: 'sum'
}
measure count_a {
for: ref('model', 'a')
aggregation_type: 'count'
}

Holistics Persistence

INFO

This section provides more details about Holisticse Persistences (FullPersistence and IncrementalPersistence).

  • To learn about setting up Holistics Persistence, please refer to the section above.
  • These details are not relevant to ExternalPersistence (ref).

Configuring Holistics Persistence

PreAggregate supports configuring persistence using FullPersistence and IncrementalPersistence.

In fact, these types of persistence configs are the same of those used for Query Model Persistence.
Thus, you can refer to these doc sections to configure the Holistics Persistence of your Pre-Aggregates:

Uniqueness of Persisted Tables

Holistics generates a key for each Pre-Aggregate based on its semantics. When running a Persistence, Holistics would associate that key to the persisted table.

The semantics of a Pre-Aggregate includes (but is not limited to):

  • Name
  • Dimensions (including their SQL/AQL definitions)
  • Measures (including their SQL/AQL definitions)
  • Dataset's Relationships (that are relevant to the Dimensions and Measures)
  • Dataset's Data Source

Therefore, if you make semantic changes to your Pre-Aggregate (for example, by changing a Dimension's definition in its Model), because the key is changed, Holistics will not use the old persisted tables of that Pre-Aggregate.

Additional notes:

  • Pre-Aggregates defined in different git branches of your project can re-use the same persisted tables if (and only if) they have the same semantics.

Cleanup for Persisted Tables

Persisted Tables are automatically cleaned up from your Database after a period of time. That period of time is called the Time-To-Live (TTL).

The TTL is refreshed whenever:

  • The persisted table's data is updated.
  • The persisted table is used in a query.

The TTL is 7 days by default.
You can configure it in your Holistics Settings (at Pre-Aggregate Persistence TTL):

performance-pre-aggregate-persistence-ttl

Tip: Leveraging AML Reusability

It is a common need to create different PreAggregates for different time granularities so that you can configure more efficient persistence pipelines.

To conveniently generate multiple PreAggregates for different time granularities, you can leverage AML Reusability.

Build multiple PreAggregates using AML Extend is a great example for you to get started!

FAQs

Why is pre-aggregate defined inside a dataset? Can I just define it in the model?

PreAggregate is defined in the dataset because it can reference fields from multiple different models.
Thus, it needs to be defined in a dataset where it can use the dataset's relationships to join those models.

I expected the query to use the pre-aggregated table, but it ended up using the original one.

There are many reasons for this issue. Here are the most common reasons:

  • The tables in the query are finer-grained in granularity than the pre-aggregated tables that you defined in the dataset.
    • For example, you query Count total transactions by hour, but you just have pre-aggregated tables at day and week levels, which have higher granularity than hour. So the raw table will be used instead of the pre-aggregated table.
  • You defined the pre-aggregated tables at the right level of granularity already, but you did not trigger the persistence. That is why your pre-aggregated tables are not there when you need them.

In any case, please check out the Executed Query which includes the exact reason and other debugging details.

I expected the query to use pre-aggregate at this level of granularity because it is more optimal, but it turned out to use another one.

In the current version, if there are multiple pre-aggregates eligible for a query, the (eligible) pre-aggregate that is defined first in the dataset will be chosen to substitute into the query.

We recommend that you put the coarser-grained pre-aggregates before the finer-grained pre-aggregates in the dataset configuration. (e.g. “month” first, then “week”, then “day”)

How can I use my existing AQL measure in PreAggregate measure?

Currently, Holistics does not allow putting an AQL measure directly inside a PreAggregate.measure definition. But instead, you can pre-aggregate by "decomposing" that AQL measure.

For example, the transactions model has an AQL measure named count_transactions:

Model transactions {
measure count_transactions {
type: 'number'
definition: @aql count(transactions.id);;
}
}

If we try to use transactions.count_transactions in a PreAggregate, we will see an error:

Dataset ecommerce {
//... other settings
models: [transactions]
pre_aggregates: {
agg_transactions: PreAggregate {
// ...other settings
measure count_transactions {
for: ref('transactions', 'count_transactions') // ❌ DO NOT use this
aggregation_type: 'count'
}
persistence: FullPersistence {
schema: 'persisted'
}
},
}
}
FAQ - How can I use my existing AQL measure in pre-aggregate measure?
Invalid PreAggregate: `count_transactions` uses "count" aggregation
but `transactions.count_transactions` is not a dimension.

Instead, we can decompose the measure into count + ref('transactions', 'id'):

Dataset ecommerce {
//... other settings
models: [transactions]
pre_aggregates: {
agg_transactions: PreAggregate {
// ...other settings
measure count_transactions {
for: ref('transactions', 'id') // ✅ DO use this
aggregation_type: 'count'
}
persistence: FullPersistence {
schema: 'persisted'
}
},
}
}

Decomposing Measures and putting the base measures inside PreAggregate definitions will make your PreAggregate more robust.
For example, a sum PreAggregate can be used to answer queries for:

  • The Sum itself
  • Running Sum
  • Sum with Period Comparison
  • etc.

What if my AQL measure composes of multiple AQL aggregations?

Answer: As long as you can decompose your AQL measure into the supported aggregation types, you can pre-aggregate it.

For example, if you have an AQL measure that has the formular @aql sum(model.a) * sum(model.b), you can decompose it and make 2 measures in your Pre-Aggregate:

measure sum_a {
for: ref('model', 'a')
aggregation_type: 'sum'
}
measure sum_b {
for: ref('model', 'b')
aggregation_type: 'sum'
}

What if my measure is a SQL measure?

We would strongly recommend you to convert your SQL measure into an AQL measure (i.e. using @aql syntax instead of @sql syntax):

// in Model definition
measure sum_x {
label: 'Sum of X'
type: 'number'
definition: @aql sum(model.x);;
}

// in PreAggregate definition
measure sum_x {
for: ref('model', 'x')
aggregation_type: 'sum'
}

(This is the recommended setup that we described earlier.)

Otherwise, you can still use aggregation_type: 'custom', but Metric awareness features won't apply and Holistics will always treat your measure as a non-additive measure.

// in Model definition
measure sum_x {
label: 'Sum of X'
type: 'number'
definition: @sql SUM({{ x }});;
}

// in PreAggregate definition
measure sum_x {
for: ref('model', 'sum_x')
aggregation_type: 'custom'
}

How could I handle timezone when querying with pre-aggregated tables?

Currently, PreAggregates always use the organization's timezone.

Therefore, Aggregate Awareness only works with queries/explorations/dashboards that are also using the organization's timezone.
If the query/dashboard timezone is different from the organization's timezone, the engine won't be able to apply any PreAggregate.

In later releases of Aggregate Awareness, we will allow configuring the timezone of PreAggregates.

Other references

Closed Beta

Aggregate Awareness is now in closed beta. Request beta access and try it out.


Let us know what you think about this document :)