Skip to main content

Relationships in Dataset

Introduction

When users use fields from different data models, based on the relationships metadata, Holistics will figure out the correct JOINs statements to apply to the SQL query.

For example, you have a dataset with the following models: orders, users, merchants, cities, countries

To calculate "Total orders by customers' countries", you will use the two fields countries.name and orders.total_orders from the countries and orders models:

The generated query will be:

SELECT
T3."name" AS "name",
count(1) AS "total_orders"
FROM
"ecommerce"."orders" T0
LEFT JOIN "ecommerce"."users" T1 ON T0."user_id" = T1."id"
LEFT JOIN "ecommerce"."cities" T2 ON T1."city_id" = T2."id"
LEFT JOIN "ecommerce"."countries" T3 ON T2."country_code" = T3."code"
GROUP BY 1
ORDER BY 2 DESC

Ambiguity in join paths

Sometimes, there are more than one possible paths to combine fields between two models. For example, you have some models with relationships as below

To answer the question "Total order values by customers' countries", you drag in Country Name from countries and Orders Count from orders. As you can see, there are two ways to go from countries to orders:

  • A: countries → cities → users → orders → order items
  • B: countries → cities → merchants → products → order items

Each JOIN path will produce a different result with different meanings (total values by users' countries vs. by merchant's origin countries.)

When you add new relationships in the dataset, Holistics will automatically detect path ambiguity and alert you:

Simple solution

Two simple ways to handle this situation is either disable one of the relationships to break the circle. In this case, the B path is disabled:

Or duplicate Cities and Countries models into dedicated models for Users and Merchants:

In the next section we will go more into details on how to handle path ambiguity

Using with_relationships()

The two solutions above have their own problems:

  • Disabling some relationships will, in effect, disable any calculations that use those relationships. For example, it is now impossible to calculate "Total order values by merchant countries".
  • Duplicating models will make the dataset bloated.

Another way to handle this situation is to use Holistics's with_relationships() in your metrics. This function will force metrics to use certain relationships, even if they are inactive.

tip
  • Use with_relationships() to force metrics to use certain relationships
  • Note that the relationships used in with_relationships() must be pre-defined in the dataset.

For example, to calculate "Total orders by merchant countries" without activating the B path, we can have a new metric that use a relationship mapping Countries to Merchants:

// ecommerce.dataset.aml
Dataset ecommerce {
...

relationships: [
relationship(orders.user_id > users.id, true),
relationship(order_items.order_id > orders.id, true),
relationship(order_items.product_id > products.id, true),

relationship(users.city_id > cities.id, true),
relationship(cities.country_code > countries.code, true),
relationship(products.merchant_id > merchants.id, true),

relationship(merchants.city_id > cities.id, false) // The relationship is currently disabled
]

metric order_values_by_merchant_countries {
label: 'Total Order Values by Merchant Countries'
type: 'number'
definition: @aql ecommerce_orders
| count(ecommerce_orders.id)
| with_relationships(merchants.city_id > cities.id);;
;;
}
}

For a more detailed discussion on ambiguous path, please check out the Handle Path Ambiguity in dataset document.

Other notes

How JOINs are constructed

The JOINs are constructed following some simple rules:

  • If the relationship between A and B is n - 1, then valid JOIN path is from n to 1 (A LEFT JOIN B or B RIGHT JOIN A)
  • If the relationship is 1 - 1, the path is valid both ways (A LEFT JOIN B or B LEFT JOIN A are OK) so we apply FULL JOIN in this case.

'SELECT DISTINCT' in the SQL

In some cases, you will see we apply SELECT DISTINCT in the underlying query for two reasons.

First, this is part of our mechanism to avoid fan-out issue. For more information, please refer to our document about fan-out issue

Second, we see no value to display all records (even duplicated records) so we have applied SELECT DISTINCT to prevent showing redundant data. Our aim is to provide dataset's explorers an overview of their data.


Let us know what you think about this document :)