Skip to main content

Persistence Query Model

Knowledge Checkpoint

A grasp of these concepts will help you understand this documentation better:

info

Only the Production version of your Persistence(s) will be executed. Therefore, please remember to DEPLOY TO PRODUCTION when you create (or modify) a Persistence for Holistics to execute it properly.

What is Persistence?

Model Persistence is the process of writing your data from your query model to your data warehouse to either make it available for querying, or to improve query performance.

Creating persistent table

In order to add the persistence to your query model, just simply add persistence: param to your query model

Model orders {
type: 'query'
label: "Orders"
data_source_name: 'demodata'
owner: '[email protected]'

query: @sql SELECT * FROM ecommerce.orders ;;

persistence: // ... your setting here

dimension id {
label: 'Id'
type: 'number'
}

// dimension ...
}

Full Persistence

Full Persistence is the process of rebuilding the whole persisted table each time it is triggered

Full Persistence

In order to execute Full Persistence, just simply add a FullPersistence config after persistence param

The FullPersistence config includes

  • schema: schema name that the persisted table will be written into
  • view_name (optional): allow Holistics to create a view table (in addition to the persisted_table) with a consistent name in your database’s schema, so that you can reliably query it via SQL from outside of Holistics
Model orders {
type: 'query'
label: "Orders"
data_source_name: 'demodata'
owner: '[email protected]'

query: @sql SELECT * FROM ecommerce.orders ;;

// Fully regenerate the persisted table
// everytime the persistence was triggered
persistence: FullPersistence {
schema: 'scratch' // persisted table will be written into this schema
view_name: 'pqm_orders' //Optional
}

dimension id {
label: 'Id'
type: 'number'
}

// dimension ...
}

Incremental Persistence

Incremental Persistence is the persisted table that is built by appending only new records to the table instead of rebuilding the table in its entirety

Incremental Persistence

In order to execute Incremental Persistence, just simply add an IncrementalPersistence config after persistence param

IncrementalPersistence config includes:

  • schema and view_name (same as FullPersistence)
  • incremental_column: Holistics will use that column to determine the data range to be queried from the source
  • primary_key (optional): only use this option when your past records change. Holistics will refer to this column to delete changed records in the current persisted_table.
Model cancelled_orders {
type: 'query'
label: "Orders"
data_source_name: 'demodata'
owner: '[email protected]'

// Dependent models
models: [orders]
query: @sql
SELECT {{ #orders.* }} FROM {{ #orders }}
WHERE {{ #orders.status }} = 'cancelled'
;;

// Incrementally regenerate the persisted table
// everytime the persistence was triggered
persistence: IncrementalPersistence {
// Incremental only settings
incremental_column: 'updated_at'
primary_key: 'id'

// other options are the same as FullPersitence
}

dimension id {
label: 'Id'
type: 'number'
}

// dimension ...
}

How incremental persistence works

For now, users will need to specify the incremental column in the Persistence config in order for the Incremental Persistence to run as expected.

After specifying the incremental column, when the persistence is triggered, the query will be built like below

SELECT * FROM ( model query here ) WHERE [[ incremental_column > {{ max_value }} ]]

Note: Currently, the persistence condition cannot be customized. In the future, we might allow users to customize their own incremental persistence conditions.

Set Persistence trigger

This option is used to trigger your persistence based on a certain condition or schedule. In Holistics, Persistence Trigger is defined in a separate file.

Schedule trigger

Your persistence will be executed at the exact time that you have set. Schedule trigger will be defined in the schedules.aml file like below.

const schedules = [
// Schedule orders model to run every 10 minutes
Schedule { models: [orders], cron: '0,10,20,30,40,50 * * * *' }
// Schedule cancelled_orders model to run once an hour
Schedule { models: [cancelled_orders], cron: '0 * * * *' }
]

Holistics will scan the file schedule.aml and rely on the model name plus its corresponding schedule to build the related persisted table at the time set.

The content of the file includes

  • import: all the query models you want to set schedule trigger
  • schedule config:
    • The config includes model's name and the cron job.
    • You can set multiple schedules for one query model.
    • Refer to cron example here for reference

Just a note:

  • The persistence cannot be run at the exact time you set because of several reasons like
    • Cascading persistence
    • Run out of default slot for persistence queue
  • The smallest frequency we support is every 5 mins since we only scan cron every 5 mins. If you schedule a persistence to run every 1 min it will simply run at the smallest frequency (5 mins)

SQL value trigger (not supported yet)

Persistence can be executed based on the value returned from a SQL statement that you defined. If the result of the SQL statement is different from the previous value, the persisted table is rebuilt. Otherwise, the existing persisted table is maintained in the database

Flow-based (Cascading) Persistence

There are cases where your persisted model refers to other persisted models which are its upstream dependent models.

Persistence 4.0 - docs.png

Let’s say you have the above set of SQL models that have references with each other. You now enable Persistence in all 4 models. In order to run Persistence in Model D, you can either choose to rebuild or reuse the Persistence in Model B, C, or A.

persistence: FullPersistence {
// persisted table will be written into this schema
schema: 'scratch'
view_name: 'pqm_orders' // Optional

// What to do when a downstream persistence cascade to the current one
// Available options:
// 'rebuild' (default)
// 'reuse' Reuse the last persisted table if available
on_cascade: 'reuse' | 'rebuild'
}

Rebuild cascade

Note: rebuild cascade will be run by default if on_cascade is not set

By setting the on_cascade key to ‘rebuild’, the persisted table of the current model will be rebuilt when its downstream dependent models are triggered or rebuilt.

Reuse cascade

By setting the on_cascade key to ‘reuse’, the persisted table of the current model will be reused in the downstream persistence models. Do note that if the Persistence is triggered by a schedule (in schedule.aml file), it will be rebuilt and disregarded on_cascade setting.

For the most part, you should only use 'reuse' in these two scenarios:

  • The model’s data is static
  • You want to control the exact timing of the persistence process. E.g: enable reuse and add a schedule to refresh at 6AM everyday ⇒ it will only be rebuilt once everyday at that time, regardless of other cascading.

Order of persistence execution

Let’s say, for example, we set on_cascade of Model D to ‘rebuid’, B to ‘reuse’, C to ‘rebuild’, A to ‘rebuild’

cascade

B uses the on_cascade: 'reuse' strategy. A, C, and D use on_cascade: 'rebuild'

Thus:

  • Schedule { models: [D] } A will still be rebuilt even if B uses the 'reuse' strategy since C uses the 'rebuild' strategy.
  • Schedule { models: [B] } will rebuild B, since the direct trigger won't respect the on_cascade: 'reuse' strategy (because it's not a cascade trigger).
  • [Schedule { models: [A] }, Schedule { models: [B] }] here since A and B belong to different pipelines, A will always be rebuilt regardless of B.

Persistence Table Optimizations

Model Persistence persists the data as physical tables in your Data Warehouse.
Thus, you can apply table optimization features provided by your Data Warehouse to improve the performance of the persisted tables.

For example, a Data Warehouse may provide these features:

Such features can be applied using Custom Persistence DDL.

Custom Persistence DDL

Closed Beta

Custom Persistence DDL is currently in Closed Beta.

Usage

By declaring the custom_ddl (Custom Data Definition Language) for a Persistence, you can customize the way the persistence table is created in your Data Warehouse.

For example, given this query model:

Model orders_fact {
type: 'query'
query: @sql
select
{{#o.id}} as order_id
, {{#o.user_id}}
, {{#o.status }}
, {{#o.created_at}} as order_created_at
, {{#o.delivery_attempts}}
, {{#o.discount}}
, {{#oi.quantity}}
, {{#p.id}} as product_id
, {{#p.price}}
, {{#p.merchant_id}}
from {{ #ecommerce_order_items AS oi }}
left join {{ #ecommerce_orders AS o }} on {{#oi.order_id}} = {{#o.id}}
left join {{ #ecommerce_products AS p }} on {{#oi.product_id}} = {{#p.id}}
;;
// other Model configs...
}

We can declare a custom_ddl like this to add indexes to the persisted table:

Model orders_fact {
type: 'query'
query: @sql
/* see sql above */
;;

persistence: FullPersistence {
schema: 'persisted'

// BEGIN custom_ddl declaration
custom_ddl: @sql
/* Run parsed_query and persist it into persisted_table */
CREATE TABLE {{ persisted_table }}
AS {{ parsed_query }};

/* Add indexes to the persisted_table */
CREATE INDEX {{ index_name }} ON {{ persisted_table }} (order_created_at);
CREATE INDEX {{ index_name }} ON {{ persisted_table }} ((order_created_at::date));
CREATE INDEX {{ index_name }} ON {{ persisted_table }} (user_id);
CREATE INDEX {{ index_name }} ON {{ persisted_table }} (status);
CREATE INDEX {{ index_name }} ON {{ persisted_table }} (product_id);
CREATE INDEX {{ index_name }} ON {{ persisted_table }} (merchant_id);
CREATE INDEX {{ index_name }} ON {{ persisted_table }} (order_id, product_id);
;;
// END custom_ddl declaration
}

Let's break down the custom_ddl AML parameter:

  • It is a SQL DDL string and hence should be declared using @sql syntax.

    • The exact syntax/dialect corresponds to the Data Warehouse of the Model. For example, if your Model's source is Postgresql, you need to use Postgresql DDL dialect.
  • It provides these template variables:

    VariableDescription
    parsed_queryThe final query/SQL generated from the Model.
    persisted_tableName of the physical table where the Model will be persisted into.
    Your DDL must use this table name so that Holistics can query it properly.
    index_nameA uniquely generated index name. It will be unique for every instance of usage.
    This helps you conveniently create many indexes without having to worry about index name collision.

For the above example custom_ddl, Holistics will run this query when executing the Model Persistence:

/* Run parsed_query and persist it into persisted_table */
CREATE TABLE "persisted"."order_master__H518f24_T1923344"
AS select
"o"."id" as order_id
, "o"."user_id"
, "o"."status"
, "o"."created_at" as order_created_at
, "o"."delivery_attempts"
, "o"."discount"
, "oi"."quantity"
, "oi"."product_id"
, "p"."price"
, "p"."merchant_id"
, "oi"."order_id" as test_broken
from (
SELECT
"ecommerce_order_items"."quantity" AS "quantity",
"ecommerce_order_items"."product_id" AS "product_id",
"ecommerce_order_items"."order_id" AS "order_id"
FROM
"ecommerce"."order_items" "ecommerce_order_items"
) AS "oi"
left join (
SELECT
"ecommerce_orders"."id" AS "id",
"ecommerce_orders"."user_id" AS "user_id",
"ecommerce_orders"."status" AS "status",
"ecommerce_orders"."created_at" AS "created_at",
"ecommerce_orders"."delivery_attempts" AS "delivery_attempts",
"ecommerce_orders"."discount" AS "discount"
FROM
"ecommerce"."orders" "ecommerce_orders"
) AS "o" on "oi"."order_id" = "o"."id"
left join (
SELECT
"ecommerce_products"."price" AS "price",
"ecommerce_products"."merchant_id" AS "merchant_id",
"ecommerce_products"."id" AS "id"
FROM
"ecommerce"."products" "ecommerce_products"
) AS "p" on "oi"."product_id" = "p"."id";

/* Add indexes to the persisted_table */
CREATE INDEX idx_7bfbea69fc9f7ad857259e007c5ebba7 ON "persisted"."order_master__H518f24_T1923344" (order_created_at);
CREATE INDEX idx_65bca19291cce8af52e3c56326a6d673 ON "persisted"."order_master__H518f24_T1923344" ((order_created_at::date));
CREATE INDEX idx_c1cb8848fe960f11867dca397eeec9d5 ON "persisted"."order_master__H518f24_T1923344" (user_id);
CREATE INDEX idx_2043e1826d0198f8eb310900e11a20ae ON "persisted"."order_master__H518f24_T1923344" (status);
CREATE INDEX idx_684ba1e7d8dbaa8ec7c1b4279e22be64 ON "persisted"."order_master__H518f24_T1923344" (product_id);
CREATE INDEX idx_7a74a60d164ccc9a2ed726c63c90d2ec ON "persisted"."order_master__H518f24_T1923344" (merchant_id);
CREATE INDEX idx_cfe6ad4d54928ec97888505ccc0bf6db ON "persisted"."order_master__H518f24_T1923344" (order_id, product_id);

Using templates

In Holistics AML editor, after typing custom_ddl: (with the colon), use Ctrl + Space to list and use the built-in custom_ddl templates.

Supported Data Warehouses

Currently, Custom Persistence DDL supports these Data Warehouses:

  • BigQuery
  • ClickHouse
  • MySql
  • Postgresql
  • Redshift
  • Snowflake
  • SqlServer

Incremental Persistence Caveat

custom_ddl is only applied when creating the persistence table. Thus, for Incremental Persistences, it will only be executed once in the first run of the Incremental Persistence.

As a result, updating the custom_ddl of an (already-persisted) Incremental Persistence won't have any effect.

To effectively change how the table of an Incremental Persistence is created, the existing persisted table must be dropped first. In future releases of Holistics, we will try to make it more convenient to reset the table creation of Incremental Persistences.

When will the persisted table be invalidated?

There are two cases when the Persisted table is invalidated and removed:

  • These persisted tables will be retained for 24 hours (from after the persistence is deleted or the related schedule is removed on production).
  • When the current model or upstream dependencies changed. This includes a change to query, incremental column, primary_key, model name, etc.

Let us know what you think about this document :)