Skip to main content

Aggregate Awareness

Coming soon!

Aggregate Awareness is under development and will be coming soon!

While waiting for the beta launch for this feature, you can read this docs first or check out:

Introduction

Holistics with Aggregate Awareness helps you utilize summarized, clean data (transformed by tools like dbt, Airflow, BI tools, etc.) to dynamically answer data questions with just enough information. This enhances performance and reduces query costs, avoiding the need to always query the raw tables.

Aggregate Awareness - Introduction

Glossary

Before we dive deeper into aggregate awareness, let’s build the foundation and align on the same page with some basic terms

  • Pre-aggregate is the aggregation performed and prepared prior to the query time, producing a condensed version of source data with just enough information that needs to be used frequently. This helps reduce the size of the data and improves the query performance.
  • Pre-aggregated table is the output of the pre-aggregate process.
  • Aggregate Awareness is the capability of automatically identifying eligible pre-aggregates and substituting them into queries, making the queries use smaller pre-aggregated tables while still producing accurate results.
  • Granularity is a term that describes the level of detail or complexity of a set of data.
    • Coarse-grained granularity includes the data set having less detail with a grasp overview frequently used for quick analysis.
    • Fine-grained granularity, in contrast, includes the data set having more detail frequently used for in-depth analysis or at the execution level.

How it works

KNOWLEDGE CHECKPOINT

You should know the concepts of Data model and Dataset.

To make Holistics Aggregate Awareness know which pre-aggregate can be used in the query, there are two jobs to be done:

  • Define the Pre-Aggregate in the dataset to map between the raw tables/models and the pre-aggregated table.
  • Create the pre-aggregated table by using either
    • The external tools (like dbt, Airflow, etc.) or
    • The built-in persistence option in Holistics
How it works - Pre-aggregate

When your query is intended to use the original tables, Holistics will automatically pick the right pre-aggregated tables having enough information to answer the data question.

Therefore, instead of processing millions of millions of records, it only needs fewer than that, leading to faster results.

How it works - Query with pre-aggregate

Quick example

Common queries with large raw table

Assume we’re working with the following raw_transactions table and a model on top of it residing in an ecommerce dataset

Quick example - Raw table
// ---------- Modeling Layer ----------

Dataset ecommerce {
//... other dataset settings
models: [transactions]
}

Model transactions {
dimension id
measure count_transactions { // a custom measure definition
label: 'Count Transactions'
type: 'number'
definition: @aql count(transactions.id)
}
//... other fields' definitions

table_name: 'raw_transactions'
}

// ---------- Database ----------

Table raw_transactions {
id integer [PK]
user_id integer [ref: > users.id]
product_id integer [ref: > products.id]
merchant_id integer [ref: > merchants.id]
status varchar
city varchar
country varchar
order_value double
created_at timestamp
}

Given that, we know the common queries with this table are:

  • Count transactions by day, week, month
  • Count transactions by status, country, city

This table is relatively big (billions of records). So, we know these analytical queries would take time.

A good approach is to build a pre-aggregated table on the frequently used dimensions and measures to reduce the number of records

Quick example - Create pre-aggregated table
// Create pre-aggregated table agg_transactions

SELECT
created_at::date as created_at_day,
status,
country,
city,
COUNT(1) as count_transactions
FROM raw_transactions
GROUP BY 1, 2, 3, 4

// ---------- Database - Schema: `persisted` ----------

Table agg_transactions {
created_at_day date
status varchar
country varchar
city varchar
count_transactions integer
}

Map pre-aggregated table with its model to prepare Aggregate Awareness

After that, we use pre-aggregate to define which model dimensions & measures are equivalent to the respective pre-aggregated table fields

Quick example - Pre-aggregate config
Dataset ecommerce {
//... other dataset settings
models: [transactions]

// ----- BEGIN NEW CODE - Add pre-aggregate
pre_aggregates: {
agg_transactions: PreAggregate {
// This dimension is in the pre-aggregated table `agg_transactions`
dimension created_at_day {
// This reference is in the `transactions` model
for: ref('transactions', 'created_at'),
time_granularity: "day"
}
dimension status {
for: ref('transactions', 'status')
}
dimension country {
for: ref('transactions', 'country')
}
dimension city {
for: ref('transactions', 'city')
}
measure count_transactions {
for: ref('transactions', 'id')
aggregation_type: 'count'
}
persistence: FullPersistence {
schema: 'persisted'
}
},
}
// ----- END NEW CODE
}

Automatically optimize the query by using the just-enough information pre-aggregated table

When querying and exploring data, analysts and explorers only work with a single transactions model. They shouldn’t need to have knowledge about the underlying aggregation optimizations.

With Aggregation Awareness, Holistics can automatically choose the right underlying table to query based on the dimensions of the query. The mechanism in Holistics’s Query Engine might be conceptually like the following:

Quick example - Query with pre-aggregate

How to set up

To set up Aggregate Awareness in Holistics, you can follow these two steps

  • Defining pre-aggregate in the dataset
  • Assigning the pre-aggregated tables by either:
    • Creating aggregated tables using Holistics Persistence, OR
    • Bringing in external pre-aggregated tables created in advance using data tools like dbt, Airflow, etc.

Define the pre-aggregate

In the dataset file, add the pre_aggregates config to map model dimensions and measures to the respective pre-aggregated table fields.

Dataset ecommerce {
//... other dataset settings

models: [transactions]

// Add pre-aggregate
pre_aggregates: {
agg_transactions: PreAggregate {
// This dimension is in the pre-aggregated table `agg_transactions`
dimension created_at_day {
// This reference is in the `transactions` model
for: ref('transactions', 'created_at'),
time_granularity: "day"
}
dimension status {
for: ref('transactions', 'status')
}
dimension country {
for: ref('transactions', 'country')
}
dimension city {
for: ref('transactions', 'city')
}
measure count_transactions {
for: ref('transactions', 'count_transactions')
aggregation_type: 'count'
}
persistence: FullPersistence {
schema: 'persisted'
}
},
}
}

Creating aggregated tables using Holistics Persistence

The above setup assumes there are no physical tables created inside the data warehouse. In order for queries to use the aggregated table, an actual pre-aggregated table must be created.

Holistics supports the creation of this table through Holistics Persistence:

  • You specify a writeable schema in the data warehouse.
  • Holistics handles persisting and refreshing of tables behind the scenes.
  pre_aggregates: {
transactions_agg: PreAggregate {
//... other pre-aggregate config

// PERSISTENCE CONFIG
persistence: FullPersistence {
schema: 'persisted'
}
}
}
NOTES

Make sure that:

Trigger creating table manually

To manually trigger the table creation (persistence process), go to Dataset’s Pre-aggregate section in the List tab to invoke the Run button and wait for it to finish.

Setting up scheduled refreshes

When the original tables change, the pre-aggregate could become outdated. Instead of just using UI to manually trigger the persistence, you can use a schedule trigger to define the pre-aggregate cadence.

In Holistics, you can create a pre-aggregate-schedule.aml file to run the scheduling:

// IMPORTANT: Putting this file in the root folder to run properly

const schedules = [
Schedule {
dataset: "ecommerce",
// Leave the array empty to schedule all pre-aggregate in the ecommerce dataset
pre_aggs: ['agg_transactions'],
// Config your cron expression, e.g., Run at 20:15 every day
cron: '15 20 * * *',
},
]

Trigger via API (coming soon)

Alternatively, you can also trigger to create and refresh pre-aggregated tables using API (coming soon).

Working with pre-aggregated tables built externally

If you have a pipeline that builds the aggregated tables outside of Holistics (with tools like Airflow, dbt, etc.), you only need to map the table name to the pre-aggregate definition.

Use ExternalPersistence to let Holistics know you’re using an external table.

Dataset ecommerce {
// ... other settings
pre_aggregates: {
agg_transactions: PreAggregate {
//... other settings
dimension created_at_day
measure count_transactions

// Map external pre-aggregated table with this option
persistence: ExternalPersistence {
table_name: 'persisted.agg_transactions'
}
},
}
}

Use cases

In this section, we will explore the capabilities of Holistics Aggregate Awareness in various scenarios.

Dimension awareness

  • Basically, you can drill down with more details or go up to the higher category to grasp the overview using one or many dimensions within the pre-aggregate.
  • For example, Holistics will automatically use the pre-aggregated table agg_transactions to answer all of these below queries instead of the raw_transactions table.
    • Count transactions by day and status
    • Count transactions by day, status, and country
    • Count transactions by day, status, country, and city
    • etc.
Use cases - Dimension awareness

Join-awareness based on foreign key

Aggregate Awareness is also smart enough to perform aggregations with relationship joins if the pre-aggregated table contains the foreign key.

For example, assume we pull in user_id to the aggregated table with a relationship to the users model.

Use cases - Join awareness - Add a foreign key to the pre-aggregated table
Dataset ecommerce {
//... other settings
models: [
transactions,
users // ADD THIS
]

pre_aggregates: {
agg_transactions: PreAggregate {
//... other settings
dimension created_at_day
measure count_transactions

// ADD THIS
dimension user_id {
for: ref('transactions', 'user_id')
}

persistence: FullPersistence {
schema: 'persisted'
}
},
}
}

//------------------------------------------------------------

Model transactions {
//... other settings
table_name: 'raw_transactions'
}

// ADD THIS
Model users {
//... other settings
dimension gender
dimension age_group
table_name: 'users'
}

Relationship: transactions.user_id > users.id;

The following queries will automatically use the pre-aggregated table and perform a join with users:

  • Count transactions by gender
  • Count transactions by country, age_group
Use cases - Join awareness - Query with pre-aggregate and join with foreign key to extend the dimension awareness capability

Time granularity awareness

Holistics can identify the pre-aggregated table at the right time granularity level based on the option that you configure in the pre-aggregate (minute, hour, day, week, month, quarter, year)

For example, in the quick example above, you can make use of the agg_transactions at the day level to get the total transactions count at the week or month level, etc.

Use cases - Time awareness - Re-aggregate from day to week

These queries would use pre-aggregated table agg_transactions at day level and take additional time to aggregate from day to week or month (although it is still faster than using the raw table).

However, if you use data at the week level frequently and want to optimize the query time faster, try to create another pre-aggregated table at coarser-grained granularity (e.g., agg_transactions_week)

Use cases - Time awareness - Create pre-aggregated table at week

Map it with the transactions model so that the engine will be aware of it when possible

Use cases - Time awareness - Update the pre-aggregate
Dataset ecommerce {
//... other settings
models: [
transactions
]

pre_aggregates: {
// BEGIN NEW CODE
agg_transactions_week: PreAggregate {
dimension created_at_week {
for: ref('transactions', 'created_at'),
time_granularity: "week"
}
//... other pre-aggregate settings
},
// END NEW CODE

agg_transactions: PreAggregate {
//... pre-aggregate settings
},
}
}

For now, when you query the total weekly transactions count, it will use the pre-aggregated table agg_transactions_week leading to faster results than just using agg_transactions table.

Use cases - Time awareness- Query with pre-aggregate at week
NOTICE

In the current version, if there are multiple pre-aggregates eligible for a query, the (eligible) pre-aggregate that is defined first in the dataset will be chosen to substitute into the query.

We recommend that you put the coarser-grained pre-aggregates before the finer-grained pre-aggregates in the dataset configuration. (e.g. “month” first, then “week”, then “day”)

Measures types in pre-aggregate

KNOWLEDGE CHECKPOINT

Depending on the measure type being used in pre-aggregate and whether or not you re-aggregate the measure in the current query, the pre-aggregated tables will be hit or missed

Additive measures

  • Aggregate awareness can be used to re-aggregate with these types of additive measures

    • sum
    • count
    • min
    • max
  • It means that you can sum several sums, count several counts, etc., using pre-aggregated tables.

Non-additive measures

  • Pre-aggregated non-additive measures can only match the queries that use the exact same dimensions/granulariy.Here are some non-additive measures:
    • count distinct
    • median
    • stddev
    • variance
    • avg
  • In the case of avg, you can actually split it into two additive measures: sum and count. If your PreAggregate has sum and count measures, Aggregate Awareness can automatically pick up those measures to calculate the avg.

FAQs

Why is pre-aggregate defined inside a dataset? Can I just define it in the model?

PreAggregate is defined in the dataset because it can reference fields from multiple different models.
Thus, it needs to be defined in a dataset where it can use the dataset's relationships to join those models.

I expected the query to use the pre-aggregated table, but it ended up using the original one.

There are many reasons for this issue. Here are the most common reasons:

  • The tables in the query are finer-grained in granularity than the pre-aggregated tables that you defined in the dataset.
    • For example, you query Count total transactions by hour, but you just have pre-aggregated tables at day and week levels, which have higher granularity than hour. So the raw table will be used instead of the pre-aggregated table.
  • You defined the pre-aggregated tables at the right level of granularity already, but you did not trigger the persistence. That is why your pre-aggregated tables are not there when you need them.

In any case, please check out the Executed Query that include the exact reason and other debugging details.

I expected the query to use pre-aggregate at this level of granularity because it is more optimal, but it turned out to use another one.

In the current version, if there are multiple pre-aggregates eligible for a query, the (eligible) pre-aggregate that is defined first in the dataset will be chosen to substitute into the query.

We recommend that you put the coarser-grained pre-aggregates before the finer-grained pre-aggregates in the dataset configuration. (e.g. “month” first, then “week”, then “day”)

How can I use my existing AQL Measure in PreAggregate Measure?

Currently, Holistics has not supported pre-aggregate with an AQL measure. But instead, you can pre-aggregate the underlying dimension.

For example, the transactions model has an AQL measure named count_transactions:

Model transactions {
measure count_transactions {
type: 'number'
definition: @aql count(transactions.id);;
}
}

If we try to use transactions.count_transactions in a PreAggregate, we will see an error:

Dataset ecommerce {
//... other settings
models: [transactions]
pre_aggregates: {
agg_transactions: PreAggregate {
// ...other settings
measure count_transactions {
for: ref('transactions', 'count_transactions') // ❌ DO NOT use this
aggregation_type: 'count'
}
persistence: FullPersistence {
schema: 'persisted'
}
},
}
}
FAQ - How can I use my existing AQL measure in pre-aggregate measure?

Instead, we can pre-aggregate using the underlying dimension, which is transactions.id in this case:

Dataset ecommerce {
//... other settings
models: [transactions]
pre_aggregates: {
agg_transactions: PreAggregate {
// ...other settings
measure count_transactions {
for: ref('transactions', 'id') // ✅ SHOULD use this
aggregation_type: 'count'
}
persistence: FullPersistence {
schema: 'persisted'
}
},
}
}

What if my AQL Measure is composed of multiple AQL aggregations?

  • As mentioned above, Holistics currently only supports pre-aggregating these aggregation types

How could I handle timezone when querying with pre-aggregated tables?

Currently, PreAggregates always use the organization's timezone.

Therefore, Aggregate Awareness only works with queries/explorations/dashboards that are also using the organization's timezone.
If the query/dashboard timezone is different from the organization's timezone, the engine won't be able to apply any PreAggregate.

In later releases of Aggregate Awareness, we will allow configuring the timezone of PreAggregates.


Let us know what you think about this document :)