Relationships in Dataset
Introduction
When users use fields from different data models, based on the relationships metadata, Holistics will figure out the correct JOINs statements to apply to the SQL query.
For example, you have a dataset with the following models: orders, users, merchants, cities, countries
To calculate "Total orders by customers' countries", you will use the two fields countries.name and orders.total_orders from the countries and orders models:
The generated query will be:
SELECT
T3."name" AS "name",
count(1) AS "total_orders"
FROM
"ecommerce"."orders" T0
LEFT JOIN "ecommerce"."users" T1 ON T0."user_id" = T1."id"
LEFT JOIN "ecommerce"."cities" T2 ON T1."city_id" = T2."id"
LEFT JOIN "ecommerce"."countries" T3 ON T2."country_code" = T3."code"
GROUP BY 1
ORDER BY 2 DESC
Path Ambiguity
When working with complex datasets, you may encounter situations where multiple join paths exist between models. For example, connecting countries to order_items could go through either customers (via users and orders) or merchants (via merchants and products), each producing different business results.
Holistics automatically resolves these ambiguous paths using an intelligent ranking algorithm that considers relationship patterns, explicit specifications, and path complexity. The system selects the most analytically appropriate path at query time, allowing you to keep all relationships active without manual intervention.
For detailed information about how path ambiguity works, including the automatic resolution algorithm, common scenarios (like role-playing dimensions and galaxy schemas), and manual control options, see Path Ambiguity in Dataset.
Other notes
How JOINs are constructed
The JOINs are constructed following some simple rules:
- If the relationship between A and B is n - 1, then valid JOIN path is from n to 1 (
A LEFT JOIN BorB RIGHT JOIN A) - If the relationship is 1 - 1, the path is valid both ways (
A LEFT JOIN BorB LEFT JOIN Aare OK) so we apply FULL JOIN in this case.
'SELECT DISTINCT' in the SQL
In some cases, you will see we apply SELECT DISTINCT in the underlying query for two reasons.
First, this is part of our mechanism to avoid fan-out issue. For more information, please refer to our document about fan-out issue
Second, we see no value to display all records (even duplicated records) so we have applied SELECT DISTINCT to prevent showing redundant data. Our aim is to provide dataset's explorers an overview of their data.