Skip to main content

Using Aggregate Awareness

Important

Please refer to Aggregate Awareness to see the full documentations.

Introduction

To improve query performance and reduce query costs when reporting from large datasets or tables, data teams typically create materialized/persisted views (physical tables) of different aggregations.

This tutorial will show an example on how to create and utilize such persisted aggregations using Holistics' Aggregate Awareness.

Scenario

Let's say we have an ecommerce Dataset with these Relationships:

Use case: Sum of Item Quantity sold over Year

Without Aggregate Awareness

To find out the Sum of item quantity sold over Year, we can make this exploration:

  • Table Fields:
    • orders.created_at (transform: Year)
    • order_items.quantity (aggregation: Sum)

If we run the above Executed Query with EXPLAIN ANALYZE, we get this query execution analysis:

Some highlights: The Database has to

  • Aggregate on 54,783 rows
  • Use 793kB of memory
  • Take 62.797ms in total

Using Aggregate Awareness

1. Define the Pre-Aggregate

For our use case, we can define a Pre-Aggregate with:

  • 1 Dimension:
    • demo_orders.created_at (time_granularity: year)
  • 1 Measure:
    • demo_order_items.quantity (aggregation: sum)
  • Persistence:
    • FullPersistence. This is the most basic/straightforward persistence mode to use.
pre_aggregates: {
aggregated_quantity: PreAggregate {
dimension pa_created_at {
for: ref('demo_orders', 'created_at')
time_granularity: 'year'
}
measure pa_sum_quantity {
for: ref('demo_order_items', 'quantity')
aggregation_type: 'sum'
}
persistence: FullPersistence {
schema: 'persisted'
}
}
}

2. Persist the Pre-Aggregate

Before actually persisting the Pre-Aggregate, we need to create the database schema to store the going-to-be-persisted tables.
In many databases, we can create the schema using a simple SQL:

CREATE SCHEMA persisted;

The schema name that we choose is persisted. Make sure it matches the schema that you define in the persistence of your Pre-Aggregate.

Then, we need to trigger the persistence.
There are 3 ways to trigger a Pre-Aggregate Persistence in Holistics:

  • API
  • Schedules
  • UI

Let's use the UI because it is most convenient for this tutorial:

  1. Go to the List view of the Dataset
  2. Click the Run button on our Pre-Aggregate (aggregated_quantity)
  3. Confirm
  4. Wait for the persistence job to finish

3. Test the exploration

Now when we run the same exploration again, we will see that Holistics automatically uses the aggregated table!

If we run the new Executed Query with EXPLAIN ANALYZE, we get this query execution analysis:

We immediately got a huge performance boost: the Database only has to

  • Aggregate on 5 rows (10,000 times less data)
  • Use 40kB of memory (20 times less memory)
  • Take 0.276ms in total (> 200 times faster)

Use case: Sum of Item Quantity sold by Category over Year

Without Pre-Aggregate

Because of the new dimension demo_categories.name, Holistics cannot re-use our existing Pre-Aggregate aggregated_quantity, because aggregated_quantity has coarser granularity than our exploration.

If we run the above Executed Query with EXPLAIN ANALYZE, we get this query execution analysis:

Some highlights: The Database has to

  • Aggregate on 54,783 rows
  • Take 102.314ms in total

Using Aggregate Awareness

1. Define the Pre-Aggregate

In this case, we can update our existing Pre-Aggregate to support more dimensions.
If we look again at the Relationships, we would notice that Categories is on the one-side of the relationship with Products.
Therefore, we only need to add demo_products.id into our Pre-Aggregate and Holistics will take care of the rest!

pre_aggregates: {
aggregated_quantity: PreAggregate {
dimension pa_created_at {
for: ref('demo_orders', 'created_at')
time_granularity: 'year'
}
// BEGIN new codes
dimension pa_product_id {
for: ref('demo_products', 'id')
}
// END new codes
measure pa_sum_quantity {
for: ref('demo_order_items', 'quantity')
aggregation_type: 'sum'
}
persistence: FullPersistence {
schema: 'persisted'
}
}
}

2. Persist the Pre-Aggregate

Re-persist the Pre-Aggregate using the same steps as the first use case:

  1. Go to the List view of the Dataset
  2. Click the Run button on our Pre-Aggregate (aggregated_quantity)
  3. Confirm
  4. Wait for the persistence job to finish

3. Test the exploration

We see that Holistics is able to use our Pre-Aggregate!

If we run the above Executed Query with EXPLAIN ANALYZE, we get this query execution analysis:

The Database only has to

  • Aggregate on 15,289 rows (nearly 4 times less data)
  • Take 17.010ms in total (> 200 times faster)

4. Bonus!

As shown above, Holistics Aggregate Awareness is relationship-aware and can leverage the Dataset Relationships to make the most out of your Pre-Aggregates.

Using the exact same Pre-Aggregate aggregated_quantity that we defined earlier, we can also efficiently perform many other explorations such as:





without having to add dimensions like demo_categories.name, demo_products.name, or demo_merchants.name into the Pre-Aggregate!

Use case: Persist Pre-Aggregate using dbt

SQL Dialect Note

This example uses Postgresql SQL Dialect in persistence.table_name and in the dbt SQL.

1. Define Holistics Pre-Aggregate

To let Holistics know that you will be handling the persistence yourself, use ExternalPersistence:

pre_aggregates: {
aggregated_quantity: PreAggregate {
dimension pa_created_at {
for: ref('demo_orders', 'created_at')
time_granularity: 'year'
}
dimension pa_product_id {
for: ref('demo_products', 'id')
}
measure pa_sum_quantity {
for: ref('demo_order_items', 'quantity')
aggregation_type: 'sum'
}
// BEGIN new codes
persistence: ExternalPersistence {
table_name: '"persisted"."aggregated_quantity"'
}
// END new codes
}
}

2. Persists the Pre-Aggregate using dbt

{{ config(materialized='table', schema='persisted', alias='aggregated_quantity') }}
SELECT
DATE_TRUNC ( 'year', "demo_orders"."created_at" ) AS "pa_created_at",
"demo_products"."id" AS "pa_product_id",
SUM("demo_order_items"."quantity") AS "pa_sum_quantity"
FROM
"demo"."order_items" "demo_order_items"
LEFT JOIN "demo"."orders" "demo_orders" ON "demo_order_items"."order_id" = "demo_orders"."id"
LEFT JOIN "demo"."products" "demo_products" ON "demo_order_items"."product_id" = "demo_products"."id"
GROUP BY
1,
2

Use case: Sum of Item Quantity sold over Month

Earlier, we pre-aggregated on the time_granularity of year, so Holistics cannot use that pre-aggregate for aggregations on month granulariy.

1. Define the Pre-Aggregate

Now, to pre-aggregate for month, here are some options:

  1. Modify the time_granularity of aggregated_quantity to month
    • This makes the Pre-Aggregate less fast for year aggregations. However, it should still be very fast.
  2. Create a new Pre-Aggregate with time_granularity: 'month'
    • This makes it fast to do aggregations on both month and year. However, it will cost more storage to store the pre-aggregated data and more maintenance effort.
pre_aggregates: {
aggregated_quantity: PreAggregate {
dimension pa_created_at {
for: ref('demo_orders', 'created_at')
// BEGIN new codes
time_granularity: 'month'
// END new codes
}
measure pa_sum_quantity {
for: ref('demo_order_items', 'quantity')
aggregation_type: 'sum'
}
persistence: FullPersistence {
schema: 'persisted'
}
}
}

2. Persist the Pre-Aggregate

Re-persist the Pre-Aggregate using the same steps as the first use case:

  1. Go to the List view of the Dataset
  2. Click the Run button on our Pre-Aggregate (aggregated_quantity)
  3. Confirm
  4. Wait for the persistence job to finish

3. Test the "Create" option

Let's say we go with the option to "Create" a new Pre-Aggregate, this will be the result:

Holistics is also time-granularity-aware. Thus, it can use aggregated_quantity_by_month (that we just created) for aggregation on quarter granularity as well!

If we try aggregating on year again, Holistics can use the initial Pre-Aggregate aggregated_quantity:


Let us know what you think about this document :)