Skip to main content

Relationships in Dataset

Introduction

When users use fields from different data models, based on the relationships metadata, Holistics will figure out the correct JOINs statements to apply to the SQL query.

For example, you have a dataset with the following models: orders, users, merchants, cities, countries

To calculate "Total orders by customers' countries", you will use the two fields countries.name and orders.total_orders from the countries and orders models:

The generated query will be:

SELECT
T3."name" AS "name",
count(1) AS "total_orders"
FROM
"ecommerce"."orders" T0
LEFT JOIN "ecommerce"."users" T1 ON T0."user_id" = T1."id"
LEFT JOIN "ecommerce"."cities" T2 ON T1."city_id" = T2."id"
LEFT JOIN "ecommerce"."countries" T3 ON T2."country_code" = T3."code"
GROUP BY 1
ORDER BY 2 DESC

Path Ambiguity

When working with complex datasets, you may encounter situations where multiple join paths exist between models. For example, connecting countries to order_items could go through either customers (via users and orders) or merchants (via merchants and products), each producing different business results.

Holistics automatically resolves these ambiguous paths using an intelligent ranking algorithm that considers relationship patterns, explicit specifications, and path complexity. The system selects the most analytically appropriate path at query time, allowing you to keep all relationships active without manual intervention.

For detailed information about how path ambiguity works, including the automatic resolution algorithm, common scenarios (like role-playing dimensions and galaxy schemas), and manual control options, see Path Ambiguity in Dataset.

Other notes

How JOINs are constructed

The JOINs are constructed following some simple rules:

  • If the relationship between A and B is n - 1, then valid JOIN path is from n to 1 (A LEFT JOIN B or B RIGHT JOIN A)
  • If the relationship is 1 - 1, the path is valid both ways (A LEFT JOIN B or B LEFT JOIN A are OK) so we apply FULL JOIN in this case.

'SELECT DISTINCT' in the SQL

In some cases, you will see we apply SELECT DISTINCT in the underlying query for two reasons.

First, this is part of our mechanism to avoid fan-out issue. For more information, please refer to our document about fan-out issue

Second, we see no value to display all records (even duplicated records) so we have applied SELECT DISTINCT to prevent showing redundant data. Our aim is to provide dataset's explorers an overview of their data.


Let us know what you think about this document :)