Relationships in Dataset
Introduction
When users use fields from different data models, based on the relationships metadata, Holistics will figure out the correct JOINs statements to apply to the SQL query.
For example, you have a dataset with the following models: orders
, users
, merchants
, cities
, countries
To calculate "Total orders by customers' countries", you will use the two fields countries.name
and orders.total_orders
from the countries
and orders
models:
The generated query will be:
SELECT
T3."name" AS "name",
count(1) AS "total_orders"
FROM
"ecommerce"."orders" T0
LEFT JOIN "ecommerce"."users" T1 ON T0."user_id" = T1."id"
LEFT JOIN "ecommerce"."cities" T2 ON T1."city_id" = T2."id"
LEFT JOIN "ecommerce"."countries" T3 ON T2."country_code" = T3."code"
GROUP BY 1
ORDER BY 2 DESC
Ambiguity in join paths
Sometimes, there are more than one possible paths to combine fields between two models. For example, you have some models with relationships as below
To answer the question "Total order values by customers' countries", you drag in Country Name from countries
and Orders Count from orders
. As you can see, there are two ways to go from countries
to orders
:
- A: countries → cities → users → orders → order items
- B: countries → cities → merchants → products → order items
Each JOIN path will produce a different result with different meanings (total values by users' countries vs. by merchant's origin countries.)
When you add new relationships in the dataset, Holistics will automatically detect path ambiguity and alert you:
Simple solution
Two simple ways to handle this situation is either disable one of the relationships to break the circle. In this case, the B path is disabled:
Or duplicate Cities and Countries models into dedicated models for Users and Merchants:
In the next section we will go more into details on how to handle path ambiguity
Using with_relationships()
The two solutions above have their own problems:
- Disabling some relationships will, in effect, disable any calculations that use those relationships. For example, it is now impossible to calculate "Total order values by merchant countries".
- Duplicating models will make the dataset bloated.
Another way to handle this situation is to use Holistics's with_relationships()
in your metrics. This function will force metrics to use certain relationships, even if they are inactive.
- Use
with_relationships()
to force metrics to use certain relationships - Note that the relationships used in
with_relationships()
must be pre-defined in the dataset.
For example, to calculate "Total orders by merchant countries" without activating the B path, we can have a new metric that use a relationship mapping Countries to Merchants:
// ecommerce.dataset.aml
Dataset ecommerce {
...
relationships: [
relationship(orders.user_id > users.id, true),
relationship(order_items.order_id > orders.id, true),
relationship(order_items.product_id > products.id, true),
relationship(users.city_id > cities.id, true),
relationship(cities.country_code > countries.code, true),
relationship(products.merchant_id > merchants.id, true),
relationship(merchants.city_id > cities.id, false) // The relationship is currently disabled
]
metric order_values_by_merchant_countries {
label: 'Total Order Values by Merchant Countries'
type: 'number'
definition: @aql ecommerce_orders
| count(ecommerce_orders.id)
| with_relationships(merchants.city_id > cities.id);;
;;
}
}
For a more detailed discussion on ambiguous path, please check out the Handle Path Ambiguity in dataset document.
Other notes
How JOINs are constructed
The JOINs are constructed following some simple rules:
- If the relationship between A and B is n - 1, then valid JOIN path is from n to 1 (
A LEFT JOIN B
orB RIGHT JOIN A
) - If the relationship is 1 - 1, the path is valid both ways (
A LEFT JOIN B
orB LEFT JOIN A
are OK) so we apply FULL JOIN in this case.
'SELECT DISTINCT' in the SQL
In some cases, you will see we apply SELECT DISTINCT in the underlying query for two reasons.
First, this is part of our mechanism to avoid fan-out issue. For more information, please refer to our document about fan-out issue
Second, we see no value to display all records (even duplicated records) so we have applied SELECT DISTINCT to prevent showing redundant data. Our aim is to provide dataset's explorers an overview of their data.