Transform Model

Definition

In Holistics, a Transform Model (or SQL Model) is a data model created from a SQL SELECT statements that perform data transformations on other tables/models.

Think of transform model as a view (or materialized view) in your database.

You can create Transform Model for many different purposes:

  • To do data cleaning, standardizing operations from the raw data (usually stored in based models)
  • To pre-aggregate large amount of data for performance improvement purpose
  • Building reusable models to be used in multiple places, avoid repeating the same SQL query/logic.

At the moment, you can create Transform Models using SQL. In the future, you'll be able to do this with Python or other languages.

Creating Transform Model

Simply go to Data Modeling page, click Create -> Add Data Model from SQL

Querying other models

In your SQL query, it is recommended that you should reference directly to other data models, instead of querying the physical table. For example:

select
{{ #o.user_id }}
, {{ #oi.orders_count }} as total_orders_count
, {{ #oi.aov }} as average_order_value
, min( {{ #o.order_created_date }} ) as first_order_date
from {{ #orders_master oi }}
left join {{ #ecommerce_orders o }} on {{ #o.id }} = {{ #oi.order_id }}
group by 1

Doing this will:

  • Decouple the logical and physical aspect of the database. You woulnd't be able to worry about whether there exists an orders_master table in the underlying database.
  • Let Holistics build a proper dependency graph between the data models, helping with data flow.

Holistics will compile the query into valid SQL syntax of your database. For more details on the syntax, please visit Data Modeling Syntax.

Persisting model's results into SQL table

When defining your Transform Model, you have the option to write (persist) the data back to a table in your SQL database. That means at scheduled interval, the models' SQL query will be executed, and write the results into a database table.

Persisted model

If you select this option, the Transform Model will be backed by a physical table in your data warehouse. The table will be updated with new data following a schedule of your choice. Think of this similar to "materialized view" concept in standard SQL databases.

When you refer to a persisted model in another transformation, you will be querying from the table instead of rerunning the whole transformation sequence. The final query will look like this:

select
field_1
, field_2
, calculation_1
from persisted_model_a
  • Pros: Reduce the amount of actual data scanned when you explore data from this model -> lighter load on your system, and shorter query time.
  • Cons: Data can be staled or persistence schedule between models can be mismatch, which produces wrong results. (Flow-based persistence will be available in the future)

Advise: persist your transform model when

  • It is an upstream model that runs slowly (due to complex query, or large amount data is scanned)
  • It is a downstream model that your end users will explore frequently. This way they can have a faster exploration experience.

Non-persisted model

If you choose to not persist your Transform Model, the model will resemble a "view" in your database. When you refer to a non-persistence model, the model's full SQL will be inserted in the final SQL and you will rerun the whole transformation sequence.

When querying a "non-persisted transform model", the compiled SQL will usually contain a CTE (SQL WITH statement). Something like:

with model_a as (
select
field_1
, field_2
, some_calculation as calculation_1
from source_table
)
select
field_1
, field_2
, calculation_1
from model_a
  • Pros: Have visibility of the whole transformation sequence
  • Cons: Possible higher load to your database, and slower query.

Advise: keep your model not persisted when

  • The transformation makes little changes to the data (mostly renaming, concatenating...)
  • You need to ensure a series of data transformation use and produce absolutely up-to-date data.

Model Dependencies

Holistics uses the modeling syntax to know the dependencies between SQL models and determine the sequence of transformation.

For example, when you run the following query, Holistics knows that model order_master, ecommerce_orders and ecommerce_users must be executed first.

with base as (
select
{{ #o.user_id }}
, {{ #oi.orders_count }} as total_orders_count
, {{ #oi.delivered_orders_count }} as delivered_orders_count
, {{ #oi.gmv }} as gmv
, {{ #oi.nmv }} as nmv
, min( {{ #o.order_created_date }} ) as first_order_date
, max( {{ #o.order_created_date }}) as last_order_date
, current_date - max( {{ #o.order_created_date }}) as days_from_last_order
from {{ #orders_master oi }}
left join {{ #ecommerce_orders o }} on {{ #o.id }} = {{ #oi.order_id }}
group by 1
)
select
{{ #u.id }} as user_id
, total_orders_count
, delivered_orders_count
, gmv
, nmv
, first_order_date
, last_order_date
, days_from_last_order
from {{ #ecommerce_users u }}
left join base on {{ #u.id }} = base.user_id

If you save the query to a model called user_facts_aggr, Holistics can generate a dependency map for that model.

Changes you made in the parent model will be carried over subsequent models. If in model orders_master we changed the definition of gmv field and got different values, this change will reflect to user_facts_aggr.

Troubleshooting

Effects when changing model structure

When editing a Transform Model, if you change the structure of the result set (add fields, remove fields...), there will be side effects:

  • If your Transform Model has persistence, the persistence will be reset
  • Removing/renaming a field will:
    • Invalidate any custom dimensions or measures that are referring to it in their formulas
    • Invalidate any relationships that are pointing to that field
    • Break any Transform Model or report that refers to that field
  • Merely changing a field's data type will not reset the persistence setting. However, it can cause unforeseen issues in custom dimensions, measures and models using that field.

Broken downstream models

Transform Model can be broken when the structure of an upstream model is changed. For example:

  • When the model's query refers to fields/measures that are no longer available, or invalid in the upstream model.
  • When the upstream model is invalid.