Using Aggregate Awareness
Please refer to Aggregate Awareness to see the full documentations.
Introduction
To improve query performance and reduce query costs when reporting from large datasets or tables, data teams typically create materialized/persisted views (physical tables) of different aggregations.
This tutorial will show an example on how to create and utilize such persisted aggregations using Holistics' Aggregate Awareness.
Scenario
Let's say we have an ecommerce
Dataset with these Relationships:
- Relationship Diagram
- Dataset codes
Dataset ecommerce {
__engine__: 'aql'
label: 'Ecommerce'
description: ''
data_source_name: 'hlite_demo'
models: [
demo_order_items,
demo_orders,
demo_products,
demo_merchants,
demo_countries,
demo_cities,
demo_categories,
demo_users
]
relationships: [
relationship(demo_order_items.product_id > demo_products.id, true)
,
relationship(demo_order_items.order_id > demo_orders.id, true)
,
relationship(demo_orders.user_id > demo_users.id, true)
,
relationship(demo_users.city_id > demo_cities.id, true)
,
relationship(demo_cities.country_code > demo_countries.code, true)
,
relationship(demo_products.merchant_id > demo_merchants.id, true)
,
relationship(demo_products.category_id > demo_categories.id, true)
]
owner: '[email protected]'
}
Use case: Sum of Item Quantity sold over Year
Without Aggregate Awareness
To find out the Sum of item quantity sold over Year, we can make this exploration:
- Table Fields:
orders.created_at
(transform:Year
)order_items.quantity
(aggregation:Sum
)
If we run the above Executed Query with EXPLAIN ANALYZE
, we get this query execution analysis:
- Analysis
- Query
EXPLAIN ANALYZE
SELECT
TO_CHAR((CAST ( (DATE_TRUNC ( 'year', (CAST ( "demo_orders"."created_at" AS timestamptz )) AT TIME ZONE 'Europe/London' )) AT TIME ZONE 'Europe/London' AS timestamptz )) AT TIME ZONE 'Europe/London', 'YYYY-MM-DD HH24:MI:SS.US') AS "dy_do_ca_01c493",
SUM("demo_order_items"."quantity") AS "s_doi_q_df1043"
FROM
"demo"."order_items" "demo_order_items"
LEFT JOIN "demo"."orders" "demo_orders" ON "demo_order_items"."order_id" = "demo_orders"."id"
GROUP BY
1
Some highlights: The Database has to
- Aggregate on 54,783 rows
- Use 793kB of memory
- Take 62.797ms in total
Using Aggregate Awareness
1. Define the Pre-Aggregate
For our use case, we can define a Pre-Aggregate with:
- 1 Dimension:
demo_orders.created_at
(time_granularity:year
)
- 1 Measure:
demo_order_items.quantity
(aggregation:sum
)
- Persistence:
- FullPersistence. This is the most basic/straightforward persistence mode to use.
- AML codes
- Review
pre_aggregates: {
aggregated_quantity: PreAggregate {
dimension pa_created_at {
for: ref('demo_orders', 'created_at')
time_granularity: 'year'
}
measure pa_sum_quantity {
for: ref('demo_order_items', 'quantity')
aggregation_type: 'sum'
}
persistence: FullPersistence {
schema: 'persisted'
}
}
}
2. Persist the Pre-Aggregate
Before actually persisting the Pre-Aggregate, we need to create the database schema to store the going-to-be-persisted tables.
In many databases, we can create the schema using a simple SQL:
CREATE SCHEMA persisted;
The schema name that we choose is persisted
. Make sure it matches the schema
that you define in the persistence
of your Pre-Aggregate.
Then, we need to trigger the persistence.
There are 3 ways to trigger a Pre-Aggregate Persistence in Holistics:
- API
- Schedules
- UI
Let's use the UI because it is most convenient for this tutorial:
- Go to the List view of the Dataset
- Click the Run button on our Pre-Aggregate (
aggregated_quantity
) - Confirm
- Wait for the persistence job to finish
3. Test the exploration
Now when we run the same exploration again, we will see that Holistics automatically uses the aggregated table!
If we run the new Executed Query with EXPLAIN ANALYZE
, we get this query execution analysis:
- Analysis
- Query
EXPLAIN ANALYZE
SELECT
TO_CHAR((CAST ( "aggregated_quantity"."pa_created_at" AS timestamptz )) AT TIME ZONE 'Europe/London', 'YYYY-MM-DD HH24:MI:SS.US') AS "dy_do_ca_01c493",
MAX("aggregated_quantity"."pa_sum_quantity") AS "s_doi_q_df1043"
FROM
"persisted"."HPA_8d3841d4cbe47c86:f0c246a01792a43a_T1716200122" "aggregated_quantity"
GROUP BY
1
We immediately got a huge performance boost: the Database only has to
- Aggregate on 5 rows (10,000 times less data)
- Use 40kB of memory (20 times less memory)
- Take 0.276ms in total (> 200 times faster)
Use case: Sum of Item Quantity sold by Category over Year
Without Pre-Aggregate
Because of the new dimension demo_categories.name
, Holistics cannot re-use our existing Pre-Aggregate aggregated_quantity
,
because aggregated_quantity
has coarser granularity than our exploration.
If we run the above Executed Query with EXPLAIN ANALYZE
, we get this query execution analysis:
- Analysis
- Query
EXPLAIN ANALYZE
SELECT
TO_CHAR((CAST ( (DATE_TRUNC ( 'year', (CAST ( "demo_orders"."created_at" AS timestamptz )) AT TIME ZONE 'Europe/London' )) AT TIME ZONE 'Europe/London' AS timestamptz )) AT TIME ZONE 'Europe/London', 'YYYY-MM-DD HH24:MI:SS.US') AS "dy_do_ca_01c493",
"demo_categories"."name" AS "dc_n_8a4e8a",
SUM("demo_order_items"."quantity") AS "s_doi_q_df1043"
FROM
"demo"."order_items" "demo_order_items"
LEFT JOIN "demo"."orders" "demo_orders" ON "demo_order_items"."order_id" = "demo_orders"."id"
LEFT JOIN "demo"."products" "demo_products" ON "demo_order_items"."product_id" = "demo_products"."id"
LEFT JOIN "demo"."categories" "demo_categories" ON "demo_products"."category_id" = "demo_categories"."id"
GROUP BY
1,
2
Some highlights: The Database has to
- Aggregate on 54,783 rows
- Take 102.314ms in total
Using Aggregate Awareness
1. Define the Pre-Aggregate
In this case, we can update our existing Pre-Aggregate to support more dimensions.
If we look again at the Relationships, we would notice that Categories
is on the one-side of the relationship with Products
.
Therefore, we only need to add demo_products.id
into our Pre-Aggregate and Holistics will take care of the rest!
pre_aggregates: {
aggregated_quantity: PreAggregate {
dimension pa_created_at {
for: ref('demo_orders', 'created_at')
time_granularity: 'year'
}
// BEGIN new codes
dimension pa_product_id {
for: ref('demo_products', 'id')
}
// END new codes
measure pa_sum_quantity {
for: ref('demo_order_items', 'quantity')
aggregation_type: 'sum'
}
persistence: FullPersistence {
schema: 'persisted'
}
}
}
2. Persist the Pre-Aggregate
Re-persist the Pre-Aggregate using the same steps as the first use case:
- Go to the List view of the Dataset
- Click the Run button on our Pre-Aggregate (
aggregated_quantity
) - Confirm
- Wait for the persistence job to finish
3. Test the exploration
We see that Holistics is able to use our Pre-Aggregate!
If we run the above Executed Query with EXPLAIN ANALYZE
, we get this query execution analysis:
- Analysis
- Query
EXPLAIN ANALYZE
SELECT
TO_CHAR((CAST ( "aggregated_quantity"."pa_created_at" AS timestamptz )) AT TIME ZONE 'Europe/London', 'YYYY-MM-DD HH24:MI:SS.US') AS "dy_do_ca_01c493",
"demo_categories"."name" AS "dc_n_8a4e8a",
SUM("aggregated_quantity"."pa_sum_quantity") AS "s_doi_q_df1043"
FROM
"persisted"."HPA_8d3841d4cbe47c86:2f77e1908c1b7df2_T1716202922" "aggregated_quantity"
LEFT JOIN "demo"."products" "demo_products" ON "aggregated_quantity"."pa_product_id" = "demo_products"."id"
LEFT JOIN "demo"."categories" "demo_categories" ON "demo_products"."category_id" = "demo_categories"."id"
GROUP BY
1,
2
The Database only has to
- Aggregate on 15,289 rows (nearly 4 times less data)
- Take 17.010ms in total (> 200 times faster)
4. Bonus!
As shown above, Holistics Aggregate Awareness is relationship-aware and can leverage the Dataset Relationships to make the most out of your Pre-Aggregates.
Using the exact same Pre-Aggregate aggregated_quantity
that we defined earlier, we can also efficiently perform many other explorations such as:
without having to add dimensions like demo_categories.name
, demo_products.name
, or demo_merchants.name
into the Pre-Aggregate!
Use case: Persist Pre-Aggregate using dbt
This example uses Postgresql SQL Dialect in persistence.table_name
and in the dbt SQL.
1. Define Holistics Pre-Aggregate
To let Holistics know that you will be handling the persistence yourself, use ExternalPersistence
:
pre_aggregates: {
aggregated_quantity: PreAggregate {
dimension pa_created_at {
for: ref('demo_orders', 'created_at')
time_granularity: 'year'
}
dimension pa_product_id {
for: ref('demo_products', 'id')
}
measure pa_sum_quantity {
for: ref('demo_order_items', 'quantity')
aggregation_type: 'sum'
}
// BEGIN new codes
persistence: ExternalPersistence {
table_name: '"persisted"."aggregated_quantity"'
}
// END new codes
}
}
2. Persists the Pre-Aggregate using dbt
{{ config(materialized='table', schema='persisted', alias='aggregated_quantity') }}
SELECT
DATE_TRUNC ( 'year', "demo_orders"."created_at" ) AS "pa_created_at",
"demo_products"."id" AS "pa_product_id",
SUM("demo_order_items"."quantity") AS "pa_sum_quantity"
FROM
"demo"."order_items" "demo_order_items"
LEFT JOIN "demo"."orders" "demo_orders" ON "demo_order_items"."order_id" = "demo_orders"."id"
LEFT JOIN "demo"."products" "demo_products" ON "demo_order_items"."product_id" = "demo_products"."id"
GROUP BY
1,
2
Use case: Sum of Item Quantity sold over Month
Earlier, we pre-aggregated on the time_granularity
of year
, so Holistics cannot use that pre-aggregate for aggregations on month
granulariy.
1. Define the Pre-Aggregate
Now, to pre-aggregate for month
, here are some options:
- Modify the
time_granularity
ofaggregated_quantity
tomonth
- This makes the Pre-Aggregate less fast for
year
aggregations. However, it should still be very fast.
- This makes the Pre-Aggregate less fast for
- Create a new Pre-Aggregate with
time_granularity: 'month'
- This makes it fast to do aggregations on both
month
andyear
. However, it will cost more storage to store the pre-aggregated data and more maintenance effort.
- This makes it fast to do aggregations on both
- Modify
- Create
pre_aggregates: {
aggregated_quantity: PreAggregate {
dimension pa_created_at {
for: ref('demo_orders', 'created_at')
// BEGIN new codes
time_granularity: 'month'
// END new codes
}
measure pa_sum_quantity {
for: ref('demo_order_items', 'quantity')
aggregation_type: 'sum'
}
persistence: FullPersistence {
schema: 'persisted'
}
}
}
pre_aggregates: {
aggregated_quantity: PreAggregate {
dimension pa_created_at {
for: ref('demo_orders', 'created_at')
time_granularity: 'year'
}
measure pa_sum_quantity {
for: ref('demo_order_items', 'quantity')
aggregation_type: 'sum'
}
persistence: FullPersistence {
schema: 'persisted'
}
}
// BEGIN new codes
aggregated_quantity_by_month: PreAggregate {
dimension pa_created_at {
for: ref('demo_orders', 'created_at')
time_granularity: 'month'
}
measure pa_sum_quantity {
for: ref('demo_order_items', 'quantity')
aggregation_type: 'sum'
}
persistence: FullPersistence {
schema: 'persisted'
}
}
// END new codes
}
2. Persist the Pre-Aggregate
Re-persist the Pre-Aggregate using the same steps as the first use case:
- Go to the List view of the Dataset
- Click the Run button on our Pre-Aggregate (
aggregated_quantity
) - Confirm
- Wait for the persistence job to finish
3. Test the "Create" option
Let's say we go with the option to "Create" a new Pre-Aggregate, this will be the result:
Holistics is also time-granularity-aware. Thus, it can use aggregated_quantity_by_month
(that we just created) for aggregation on quarter
granularity as well!
If we try aggregating on year
again, Holistics can use the initial Pre-Aggregate aggregated_quantity
: