Relationships in Dataset
Introduction
When users use fields from different data models, based on the relationships metadata, Holistics will figure out the correct JOINs statements to apply to the SQL query.
For example, you have a dataset with the following models: orders, users, merchants, cities, countries
To calculate "Total orders by customers' countries", you will use the two fields countries.name and orders.total_orders from the countries and orders models:
The generated query will be:
SELECT
T3."name" AS "name",
count(1) AS "total_orders"
FROM
"ecommerce"."orders" T0
LEFT JOIN "ecommerce"."users" T1 ON T0."user_id" = T1."id"
LEFT JOIN "ecommerce"."cities" T2 ON T1."city_id" = T2."id"
LEFT JOIN "ecommerce"."countries" T3 ON T2."country_code" = T3."code"
GROUP BY 1
ORDER BY 2 DESC
Ambiguity in join paths
Sometimes, there are more than one possible paths to combine fields between two models. For example, you have some models with relationships as below:
Dataset ecommerce {
...
models: [orders, users, order_items, countries, merchants, products, cities]
relationships: [
relationship(orders.user_id > users.id, true),
relationship(order_items.order_id > orders.id, true),
relationship(cities.country_code > countries.code, true),
relationship(users.city_id > cities.id, true),
relationship(order_items.product_id > products.id, true),
relationship(products.merchant_id > merchants.id, true),
relationship(merchants.city_id > cities.id, true)
]
}
To answer the question "Total order quantity by countries", you drag in Country Name from countries and Total Order Quantity from order_items.
explore {
dimensions {
country_name: countries.name
}
measures {
total_order_quantity: sum(order_items.quantity)
}
}
As you can see, there are two ways to go from countries to order_items:
- Path A: countries → cities → users → orders → order_items (by customer country)
- Path B: countries → cities → merchants → products → order_items (by merchant country)
Each path will produce a different result with different business meanings (total values by users' countries vs. by merchants' countries).
How Holistics Resolves Ambiguity Automatically
When multiple paths exist, Holistics intelligently selects the most analytically correct path using a ranking algorithm. You can create datasets with all relationships active, and Holistics will automatically choose the right path at query time.
The algorithm works in 4 steps:
Step 1: Path Tier Classification
All possible paths are categorized into 4 tiers based on common analytics patterns:
- Tier 1 (Best): Only one-to-many relationships (e.g., dimension → fact, like
countries → cities → users → orders) - Tier 2: Only many-to-one relationships (e.g., fact → dimension, like
orders → users → cities → countries) - Tier 3: Special sequential patterns (e.g., many-to-many using junction tables, like
users → orders → order_items → products) - Tier 4: Mixed patterns not matching the above
Step 2: Assign Weights
Each relationship in a path has a default weight of 0. Weights increase when you explicitly specify relationships:
- Default relationships (defined in dataset): Weight = 0
with_relationships()specified: Higher weight than default- Nested
with_relationships(): Inner (nested) relationships get even higher weight
This means explicitly specified paths using with_relationships() take priority over default paths.
Step 3: Rank Paths
Paths are compared in this order:
- Tier first (Tier 1 > Tier 2 > Tier 3 > Tier 4)
- Weight second (higher scores preferred)
- Path length third (shorter paths preferred as they're simpler and more performant)
Step 4: Automatic Selection with Transparency
- Clear winner: Holistics automatically uses the best-ranked path
- Tie situation: If multiple paths have identical rankings, Holistics will:
- Show a warning in the query preview
- Indicate which path was chosen
- Suggest resolving ambiguity manually
Examples
Example 1: Clear Winner Based on Tier Ranking
Query: "Total products by cities"
Path 1: cities → merchants → products (3 hops)
- Business meaning: Total products produced/sold by merchants in each city
- Pattern: cities → merchants → products
- Tier: Only one-to-many relationships → Tier 1 (Best)
- Length: 3 hops
Path 2: cities → users → orders → order_items → products (5 hops)
- Business meaning: Total products bought by users living in each city
- Pattern: cities → users → orders → order_items (junction) → products
- Tier: Many-to-many using junction table (order_items) → Tier 3
- Length: 5 hops
Result: Holistics automatically selects Path 1 because:
- Tier 1 ranks higher than Tier 3 (tier is the primary ranking criteria)
- Path 1 has a simpler dimension → fact pattern
- Path 2 requires a junction table, making it more complex
Both paths are analytically correct but answer different business questions. If you need Path 2 instead, override with with_relationships():
Dataset ecommerce {
...
metric products_bought_by_city_residents {
label: 'Products Bought by City Residents'
type: 'number'
definition: @aql
products
| count(products.id)
| with_relationships(
relationship(products.merchant_id > merchants.id, false, 'two_ways')
)
;;
}
}
Example 2: Tie Situation with Same Tier, Weight, and Path Length
Query: "Total order items by countries"
Path A: countries → cities → users → orders → order_items (5 hops)
- Business meaning: Order items from customers in each country
- Tier: Only one-to-many relationships → Tier 1
- Length: 5 hops
Path B: countries → cities → merchants → products → order_items (5 hops)
- Business meaning: Order items from merchants located in each country
- Tier: Only one-to-many relationships → Tier 1
- Length: 5 hops
Result: Both paths have:
- Same tier (Tier 1 - both are valid dimension-to-fact patterns)
- Same length (5 hops)
- Same weights (both use default relationships)
In this tie situation, Holistics will automatically pick a path, and may show a warning suggesting you explicitly define the path for clarity.
If you need the merchant path instead, override with with_relationships():
Dataset ecommerce {
...
metric orders_by_merchant_countries {
label: 'Total Orders by Merchant Countries'
type: 'number'
definition: @aql
order_items
| count(order_items.id)
| with_relationships(
relationship(order_items.product_id > products.id, false, 'two_ways')
)
;;
}
}
Manual Control Options
While automatic resolution works well for most cases, you have full control to override the behavior:
Option 1: Disable Relationships and Use with_relationships()
Use with_relationships() to explicitly specify which path a metric should use. This is ideal when different metrics need different paths.
Dataset ecommerce {
...
relationships: [
relationship(orders.user_id > users.id, true),
relationship(order_items.order_id > orders.id, true),
relationship(order_items.product_id > products.id, true),
relationship(users.city_id > cities.id, true),
relationship(cities.country_code > countries.code, true),
relationship(products.merchant_id > merchants.id, true),
relationship(merchants.city_id > cities.id, false) // The relationship is currently disabled
]
metric order_values_by_merchant_countries {
label: 'Total Order Values by Merchant Countries'
type: 'number'
definition: @aql orders
| count(orders.id)
| with_relationships(merchants.city_id > cities.id);;
;;
}
}
When to use:
- You want to completely prevent a certain path from being used across the entire dataset
- You're working with legacy datasets and want to maintain existing behavior
Trade-offs:
- ❌ Disables the path for ALL queries in the dataset
- ❌ Requires
with_relationships()to re-enable for specific metrics
Option 2: Duplicate Models
Create separate dimension models for different contexts (e.g., user_cities and merchant_cities):
When to use:
- Clear semantic separation is important for end users
- You want to avoid any ambiguity warnings
Trade-offs:
- ❌ Makes the dataset larger and more complex to maintain
- ❌ Duplicates dimension data
Best Practices
-
Let Holistics handle it: For most cases, the automatic resolution works correctly. Trust the algorithm!
-
Use
with_relationships()for exceptions: When you need a specific path that differs from the automatic selection, explicitly declare it. -
Monitor warnings: If you see ambiguity warnings in query previews, review whether the automatic selection makes sense for your use case.
-
Document complex paths: Add comments in your AML code explaining why specific paths are chosen for metrics.
-
Test your metrics: Always verify that metrics return expected results, especially in datasets with multiple possible paths.
Other notes
How JOINs are constructed
The JOINs are constructed following some simple rules:
- If the relationship between A and B is n - 1, then valid JOIN path is from n to 1 (
A LEFT JOIN BorB RIGHT JOIN A) - If the relationship is 1 - 1, the path is valid both ways (
A LEFT JOIN BorB LEFT JOIN Aare OK) so we apply FULL JOIN in this case.
'SELECT DISTINCT' in the SQL
In some cases, you will see we apply SELECT DISTINCT in the underlying query for two reasons.
First, this is part of our mechanism to avoid fan-out issue. For more information, please refer to our document about fan-out issue
Second, we see no value to display all records (even duplicated records) so we have applied SELECT DISTINCT to prevent showing redundant data. Our aim is to provide dataset's explorers an overview of their data.