Build multiple pre-aggregates using AML Extend
Introduction
When setting up Aggregate Awareness, it is a common need to create different PreAggregates for different time granularities so that you can configure more efficient persistence pipelines. For example:
- A PreAggregate with time granularity
month
only needs to be persisted once a month. - While PreAggregate with time granularity
week
needs to be persisted once a week.
To conveniently generate multiple PreAggregates for different time granularities, you can leverage AML Reusability!
Without reusability
Here's how you define it without reusability.
Dataset movie_rating_analysis {
...
pre_aggregates: {
aggr_movie_ratings_monthly: PreAggregate {
dimension timestamp {
for: r(public_ratings.timestamp)
time_granularity: 'month'
}
measure highest_rating {
for: r(public_ratings.rating)
aggregation_type: 'max'
}
measure lowest_rating {
for: r(public_ratings.rating)
aggregation_type: 'min'
}
measure sum_rating {
for: r(public_ratings.rating)
aggregation_type: 'sum'
}
persistence: IncrementalPersistence {
schema: 'persisted'
incremental_column: 'timestamp'
}
},
aggr_movie_ratings_weekly: PreAggregate {
dimension timestamp {
for: r(public_ratings.timestamp)
time_granularity: 'week'
}
measure highest_rating {
for: r(public_ratings.rating)
aggregation_type: 'max'
}
measure lowest_rating {
for: r(public_ratings.rating)
aggregation_type: 'min'
}
measure sum_rating {
for: r(public_ratings.rating)
aggregation_type: 'sum'
}
persistence: IncrementalPersistence {
schema: 'persisted'
incremental_column: 'timestamp'
}
},
aggr_movie_ratings_daily: PreAggregate {
dimension timestamp {
for: r(public_ratings.timestamp)
time_granularity: 'day'
}
measure highest_rating {
for: r(public_ratings.rating)
aggregation_type: 'max'
}
measure lowest_rating {
for: r(public_ratings.rating)
aggregation_type: 'min'
}
measure sum_rating {
for: r(public_ratings.rating)
aggregation_type: 'sum'
}
persistence: IncrementalPersistence {
schema: 'persisted'
incremental_column: 'timestamp'
}
}
}
}
As shown in this example, we have to repeat many things: persistence
, highest_rating
, lowest_rating
, sum_rating
.
- If we want to add more measures in the future, we have to add 3 times.
- If we want to create more pre-aggregates in, for example,
year
, we again have to repeat almost everything.
Refactoring using AML Extend
Now let's refactor them for better reusability using AML Extend.
We can update the above codes using 2 steps:
Step 1: Pick one PreAggregate (e.g. aggr_movie_ratings_daily
) and turn it into a variable.
- To declare a variable, you need to do it outside of your
Dataset
declaration. - You can also declare this variable in a separate file!
PreAggregate aggr_movie_ratings_daily {
dimension timestamp {
for: r(public_ratings.timestamp)
time_granularity: 'day'
}
measure highest_rating {
for: r(public_ratings.rating)
aggregation_type: 'max'
}
measure lowest_rating {
for: r(public_ratings.rating)
aggregation_type: 'min'
}
measure sum_rating {
for: r(public_ratings.rating)
aggregation_type: 'sum'
}
persistence: IncrementalPersistence {
schema: 'persisted'
incremental_column: 'timestamp'
}
}
Dataset movie_rating_analysis {
...
}
Step 2: Create other PreAggregates by extending the variable we just created.
Dataset movie_rating_analysis {
...
pre_aggregates: {
aggr_movie_ratings_monthly: aggr_movie_ratings_daily.extend({
dimension timestamp {
for: r(public_ratings.timestamp)
time_granularity: 'month'
}
}),
aggr_movie_ratings_weekly: aggr_movie_ratings_daily.extend({
dimension timestamp {
for: r(public_ratings.timestamp)
time_granularity: 'week'
}
}),
aggr_movie_ratings_daily: aggr_movie_ratings_daily
}
}
Just like that, we reduced 66 lines of code into 35 lines of code, making it more maintainable and more readable at the same time.
AML Extend has made this so convenient!