Using Aggregate Awareness

Important

Please refer to Aggregate Awareness to see the full documentations.

Introduction

To improve query performance and reduce query costs when reporting from large datasets or tables, data teams typically create materialized/persisted views (physical tables) of different aggregations.

This tutorial will show an example on how to create and utilize such persisted aggregations using Holistics' Aggregate Awareness.

Scenario

Let's say we have an ecommerce Dataset with these Relationships:

Relationship Diagram
Dataset codes

Dataset ecommerce {
  __engine__: 'aql'
  label: 'Ecommerce'
  description: ''
  data_source_name: 'hlite_demo'
  models: [
    demo_order_items,
    demo_orders,
    demo_products,
    demo_merchants,
    demo_countries,
    demo_cities,
    demo_categories,
    demo_users
  ]
  relationships: [
    relationship(demo_order_items.product_id > demo_products.id, true)
    ,
    relationship(demo_order_items.order_id > demo_orders.id, true)
    ,
    relationship(demo_orders.user_id > demo_users.id, true)
    ,
    relationship(demo_users.city_id > demo_cities.id, true)
    ,
    relationship(demo_cities.country_code > demo_countries.code, true)
    ,
    relationship(demo_products.merchant_id > demo_merchants.id, true)
    ,
    relationship(demo_products.category_id > demo_categories.id, true)
  ]
  owner: '[email protected]'
}

Use case: Sum of Item Quantity sold over Year

Without Aggregate Awareness

To find out the Sum of item quantity sold over Year, we can make this exploration:

Table Fields:
- orders.created_at (transform: Year)
- order_items.quantity (aggregation: Sum)

If we run the above Executed Query with EXPLAIN ANALYZE, we get this query execution analysis:

Analysis
Query

EXPLAIN ANALYZE
SELECT
  TO_CHAR((CAST ( (DATE_TRUNC ( 'year', (CAST ( "demo_orders"."created_at" AS timestamptz )) AT TIME ZONE 'Europe/London' )) AT TIME ZONE 'Europe/London' AS timestamptz )) AT TIME ZONE 'Europe/London', 'YYYY-MM-DD HH24:MI:SS.US') AS "dy_do_ca_01c493",
  SUM("demo_order_items"."quantity") AS "s_doi_q_df1043"
FROM
  "demo"."order_items" "demo_order_items"
  LEFT JOIN "demo"."orders" "demo_orders" ON "demo_order_items"."order_id" = "demo_orders"."id"
GROUP BY
  1

Some highlights: The Database has to

Aggregate on 54,783 rows
Use 793kB of memory
Take 62.797ms in total

Using Aggregate Awareness

1. Define the Pre-Aggregate

For our use case, we can define a Pre-Aggregate with:

1 Dimension:
- demo_orders.created_at (time_granularity: year)
1 Measure:
- demo_order_items.quantity (aggregation: sum)
Persistence:
- FullPersistence. This is the most basic/straightforward persistence mode to use.

AML codes
Review

pre_aggregates: {
  aggregated_quantity: PreAggregate {
    dimension pa_created_at {
      for: ref('demo_orders', 'created_at')
      time_granularity: 'year'
    }
    measure pa_sum_quantity {
      for: ref('demo_order_items', 'quantity')
      aggregation_type: 'sum'
    }
    persistence: FullPersistence {
      schema: 'persisted'
    }
  }
}

2. Persist the Pre-Aggregate

Before actually persisting the Pre-Aggregate, we need to create the database schema to store the going-to-be-persisted tables.
In many databases, we can create the schema using a simple SQL:

CREATE SCHEMA persisted;

The schema name that we choose is persisted. Make sure it matches the schema that you define in the persistence of your Pre-Aggregate.

Then, we need to trigger the persistence.
There are 3 ways to trigger a Pre-Aggregate Persistence in Holistics:

API
Schedules
UI

Let's use the UI because it is most convenient for this tutorial:

Go to the List view of the Dataset
Click the Run button on our Pre-Aggregate (aggregated_quantity)
Confirm
Wait for the persistence job to finish

3. Test the exploration

Now when we run the same exploration again, we will see that Holistics automatically uses the aggregated table!

If we run the new Executed Query with EXPLAIN ANALYZE, we get this query execution analysis:

Analysis
Query

EXPLAIN ANALYZE
SELECT
  TO_CHAR((CAST ( "aggregated_quantity"."pa_created_at" AS timestamptz )) AT TIME ZONE 'Europe/London', 'YYYY-MM-DD HH24:MI:SS.US') AS "dy_do_ca_01c493",
  MAX("aggregated_quantity"."pa_sum_quantity") AS "s_doi_q_df1043"
FROM
  "persisted"."HPA_8d3841d4cbe47c86:f0c246a01792a43a_T1716200122" "aggregated_quantity"
GROUP BY
  1

We immediately got a huge performance boost: the Database only has to

Aggregate on 5 rows (10,000 times less data)
Use 40kB of memory (20 times less memory)
Take 0.276ms in total (> 200 times faster)

Use case: Sum of Item Quantity sold by Category over Year

Without Pre-Aggregate

Because of the new dimension demo_categories.name, Holistics cannot re-use our existing Pre-Aggregate aggregated_quantity, because aggregated_quantity has coarser granularity than our exploration.

If we run the above Executed Query with EXPLAIN ANALYZE, we get this query execution analysis:

Analysis
Query

EXPLAIN ANALYZE
SELECT
  TO_CHAR((CAST ( (DATE_TRUNC ( 'year', (CAST ( "demo_orders"."created_at" AS timestamptz )) AT TIME ZONE 'Europe/London' )) AT TIME ZONE 'Europe/London' AS timestamptz )) AT TIME ZONE 'Europe/London', 'YYYY-MM-DD HH24:MI:SS.US') AS "dy_do_ca_01c493",
  "demo_categories"."name" AS "dc_n_8a4e8a",
  SUM("demo_order_items"."quantity") AS "s_doi_q_df1043"
FROM
  "demo"."order_items" "demo_order_items"
  LEFT JOIN "demo"."orders" "demo_orders" ON "demo_order_items"."order_id" = "demo_orders"."id"
  LEFT JOIN "demo"."products" "demo_products" ON "demo_order_items"."product_id" = "demo_products"."id"
  LEFT JOIN "demo"."categories" "demo_categories" ON "demo_products"."category_id" = "demo_categories"."id"
GROUP BY
  1,
  2

Some highlights: The Database has to

Aggregate on 54,783 rows
Take 102.314ms in total

Using Aggregate Awareness

1. Define the Pre-Aggregate

In this case, we can update our existing Pre-Aggregate to support more dimensions.
If we look again at the Relationships, we would notice that Categories is on the one-side of the relationship with Products.
Therefore, we only need to add demo_products.id into our Pre-Aggregate and Holistics will take care of the rest!

pre_aggregates: {
  aggregated_quantity: PreAggregate {
    dimension pa_created_at {
      for: ref('demo_orders', 'created_at')
      time_granularity: 'year'
    }
    // BEGIN new codes
    dimension pa_product_id {
      for: ref('demo_products', 'id')
    }
    // END new codes
    measure pa_sum_quantity {
      for: ref('demo_order_items', 'quantity')
      aggregation_type: 'sum'
    }
    persistence: FullPersistence {
      schema: 'persisted'
    }
  }
}

2. Persist the Pre-Aggregate

Re-persist the Pre-Aggregate using the same steps as the first use case:

Go to the List view of the Dataset
Click the Run button on our Pre-Aggregate (aggregated_quantity)
Confirm
Wait for the persistence job to finish

3. Test the exploration

We see that Holistics is able to use our Pre-Aggregate!

If we run the above Executed Query with EXPLAIN ANALYZE, we get this query execution analysis:

Analysis
Query

EXPLAIN ANALYZE
SELECT
  TO_CHAR((CAST ( "aggregated_quantity"."pa_created_at" AS timestamptz )) AT TIME ZONE 'Europe/London', 'YYYY-MM-DD HH24:MI:SS.US') AS "dy_do_ca_01c493",
  "demo_categories"."name" AS "dc_n_8a4e8a",
  SUM("aggregated_quantity"."pa_sum_quantity") AS "s_doi_q_df1043"
FROM
  "persisted"."HPA_8d3841d4cbe47c86:2f77e1908c1b7df2_T1716202922" "aggregated_quantity"
  LEFT JOIN "demo"."products" "demo_products" ON "aggregated_quantity"."pa_product_id" = "demo_products"."id"
  LEFT JOIN "demo"."categories" "demo_categories" ON "demo_products"."category_id" = "demo_categories"."id"
GROUP BY
  1,
  2

The Database only has to

Aggregate on 15,289 rows (nearly 4 times less data)
Take 17.010ms in total (> 200 times faster)

4. Bonus!

As shown above, Holistics Aggregate Awareness is relationship-aware and can leverage the Dataset Relationships to make the most out of your Pre-Aggregates.

Using the exact same Pre-Aggregate aggregated_quantity that we defined earlier, we can also efficiently perform many other explorations such as:

without having to add dimensions like demo_categories.name, demo_products.name, or demo_merchants.name into the Pre-Aggregate!

Use case: Persist Pre-Aggregate using dbt

SQL Dialect Note

This example uses Postgresql SQL Dialect in persistence.table_name and in the dbt SQL.

1. Define Holistics Pre-Aggregate

To let Holistics know that you will be handling the persistence yourself, use ExternalPersistence:

pre_aggregates: {
  aggregated_quantity: PreAggregate {
    dimension pa_created_at {
      for: ref('demo_orders', 'created_at')
      time_granularity: 'year'
    }
    dimension pa_product_id {
      for: ref('demo_products', 'id')
    }
    measure pa_sum_quantity {
      for: ref('demo_order_items', 'quantity')
      aggregation_type: 'sum'
    }
    // BEGIN new codes
    persistence: ExternalPersistence {
      table_name: '"persisted"."aggregated_quantity"'
    }
    // END new codes
  }
}

2. Persists the Pre-Aggregate using dbt

{{ config(materialized='table', schema='persisted', alias='aggregated_quantity') }}
SELECT
  DATE_TRUNC ( 'year', "demo_orders"."created_at" ) AS "pa_created_at",
  "demo_products"."id" AS "pa_product_id",
  SUM("demo_order_items"."quantity") AS "pa_sum_quantity"
FROM
  "demo"."order_items" "demo_order_items"
  LEFT JOIN "demo"."orders" "demo_orders" ON "demo_order_items"."order_id" = "demo_orders"."id"
  LEFT JOIN "demo"."products" "demo_products" ON "demo_order_items"."product_id" = "demo_products"."id"
GROUP BY
1,
2

Use case: Sum of Item Quantity sold over Month

Earlier, we pre-aggregated on the time_granularity of year, so Holistics cannot use that pre-aggregate for aggregations on month granulariy.

1. Define the Pre-Aggregate

Now, to pre-aggregate for month, here are some options:

Modify the time_granularity of aggregated_quantity to month
- This makes the Pre-Aggregate less fast for year aggregations. However, it should still be very fast.
Create a new Pre-Aggregate with time_granularity: 'month'
- This makes it fast to do aggregations on both month and year. However, it will cost more storage to store the pre-aggregated data and more maintenance effort.

Modify
Create

pre_aggregates: {
  aggregated_quantity: PreAggregate {
    dimension pa_created_at {
      for: ref('demo_orders', 'created_at')
      // BEGIN new codes
      time_granularity: 'month'
      // END new codes
    }
    measure pa_sum_quantity {
      for: ref('demo_order_items', 'quantity')
      aggregation_type: 'sum'
    }
    persistence: FullPersistence {
      schema: 'persisted'
    }
  }
}

pre_aggregates: {
  aggregated_quantity: PreAggregate {
    dimension pa_created_at {
      for: ref('demo_orders', 'created_at')
      time_granularity: 'year'
    }
    measure pa_sum_quantity {
      for: ref('demo_order_items', 'quantity')
      aggregation_type: 'sum'
    }
    persistence: FullPersistence {
      schema: 'persisted'
    }
  }
  // BEGIN new codes
  aggregated_quantity_by_month: PreAggregate {
    dimension pa_created_at {
      for: ref('demo_orders', 'created_at')
      time_granularity: 'month'
    }
    measure pa_sum_quantity {
      for: ref('demo_order_items', 'quantity')
      aggregation_type: 'sum'
    }
    persistence: FullPersistence {
      schema: 'persisted'
    }
  }
  // END new codes
}

2. Persist the Pre-Aggregate

Re-persist the Pre-Aggregate using the same steps as the first use case:

Go to the List view of the Dataset
Click the Run button on our Pre-Aggregate (aggregated_quantity)
Confirm
Wait for the persistence job to finish

3. Test the "Create" option

Let's say we go with the option to "Create" a new Pre-Aggregate, this will be the result:

Holistics is also time-granularity-aware. Thus, it can use aggregated_quantity_by_month (that we just created) for aggregation on quarter granularity as well!

If we try aggregating on year again, Holistics can use the initial Pre-Aggregate aggregated_quantity:

Introduction​

Scenario​

Use case: Sum of Item Quantity sold over Year​

Without Aggregate Awareness​

Using Aggregate Awareness​

1. Define the Pre-Aggregate​

2. Persist the Pre-Aggregate​

3. Test the exploration​

Use case: Sum of Item Quantity sold by Category over Year​

Without Pre-Aggregate​

Using Aggregate Awareness​

1. Define the Pre-Aggregate​

2. Persist the Pre-Aggregate​

3. Test the exploration​

4. Bonus!​

Use case: Persist Pre-Aggregate using dbt​

1. Define Holistics Pre-Aggregate​

2. Persists the Pre-Aggregate using dbt​

Use case: Sum of Item Quantity sold over Month​

1. Define the Pre-Aggregate​

2. Persist the Pre-Aggregate​

3. Test the "Create" option​

Introduction

Scenario

Use case: Sum of Item Quantity sold over Year

Without Aggregate Awareness

Using Aggregate Awareness

1. Define the Pre-Aggregate

2. Persist the Pre-Aggregate

3. Test the exploration

Use case: Sum of Item Quantity sold by Category over Year

Without Pre-Aggregate

Using Aggregate Awareness

1. Define the Pre-Aggregate

2. Persist the Pre-Aggregate

3. Test the exploration

4. Bonus!

Use case: Persist Pre-Aggregate using dbt

1. Define Holistics Pre-Aggregate

2. Persists the Pre-Aggregate using dbt

Use case: Sum of Item Quantity sold over Month

1. Define the Pre-Aggregate

2. Persist the Pre-Aggregate

3. Test the "Create" option