Skip to main content

How Holistics handles joins

Introduction

For business users that use drag-and-drop report interface day by day, it may not be that important to understand what is happening behind the scenes as long as "it works".

However, for analysts, it is important to understand fully Holistics' underlying mechanism for joining to ensure the results are calculated correctly.

How we generate SQL based on different join types

Holistics uses LEFT JOIN (also known as LEFT OUTER JOIN) for both many-to-one and one-to-one relationships. For many-to-one, the model on the "many" side goes on the left of the join.

Consider an example where we want to count total revenue generated by different genders of the users.

Using Data Exploration, the result may look like this:

The final SQL will look like this:

SELECT
U.gender,
SUM(O.revenue) as total_orders
FROM orders O
LEFT JOIN users U ON O.user_id = U.id
GROUP BY 1
Why LEFT JOIN from the "many" side to the "one" side?

This preserves rows from the "many" table even when there's no matching record on the "one" side (referential integrity). In the example above, orders whose user_id doesn't exist in users are still counted.

Potential fan-out issues with one-to-many relationship

Fan-out happens when you LEFT JOIN two data models with one-to-many relationship, but the table with one column is on the left and the table with many column on the right.

For example, you are working on an ecommerce dataset that contains two data models Order Items and Orders.

Let's say when exploring this set of Data, if you use any field from order_items, for example, order_items.order_id then delivery_attempts will be duplicated and normally sum on the number of delivery will be wrong.

Since our current mechanism will find a join path (LEFT JOIN by default) from the data model that has aggregations (measures) to the data model that does not, the fan-out issue occurs.

In the aforementioned case, you use measure field SUM(orders."delivery_attempts") of orders data model and non-measure field order_items.order_id of order_items data model. Since the relationship between orders and order_items is one-to-many, the underlying query will be generated as below

SELECT
T1."order_id" AS "order_id",
SUM(T0."delivery_attempts") AS "delivery_attempts"
FROM
"ecommerce"."orders" T0
LEFT JOIN "ecommerce"."order_items" T1 ON T0."id" = T1."order_id"
WHERE
T0."id" < 2200
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500

What have we done to solve the issue

Note:

Holistics would auto-resolve this fan-out issue for the Aggregations and Business Calculation in our Dataset exploration UI or measure defined using AML (only in 4.0). However, we do not support solving the fan-out problems when using measures created in the 3.0 Data Model layer.

In order to solve this issue, Holistics has modified the underlying query to get the distinct value when the fan-out issue happens.

Specifically, when the measure/aggregation is on the dimension model instead of the fact model, we will calculate the measure first inside the dimension model before executing the join.

For example, let take the example above, when we sum the delivery_attempts, this query will run first inside order model to calculate total delivery_attempts for each order_id. Let's call this Set A

SELECT
T0."id" AS "order_id",
SUM(T0."delivery_attempts") AS "total_delivery_attempts"
FROM
"ecommerce"."orders" T0
WHERE
T0."id" < 2200
GROUP BY 1
ORDER BY 2 DESC

After that, we will execute the join from order_items to orders as normal and select Distinct order_id from order_items model. Let's call this Set B

SELECT
distinct T0."order_id" AS "order_id"
FROM
"ecommerce"."order_items" T0
LEFT JOIN "ecommerce"."orders" T1 ON T0."order_id" = T1."id"
WHERE
T0."id" < 2200

Finally, we will execute Inner Join Set A and Set B, select order_id from Set B and total_delivery_attempts from Set A. By doing this, the order_id will be unique and fan-out issue no longer exists.

The final query will be:

With total_delivery_attemps as (
SELECT
T0."order_id" AS "order_id",
SUM(T0."delivery_attempts") AS "total_delivery_attempts"
FROM
"ecommerce"."orders" T0
WHERE
T0."id" < 2200
GROUP BY 1
ORDER BY 2 DESC
)
, distinct_order_id as (
SELECT
distinct T0."order_id" AS "order_id"
FROM
"ecommerce"."order_items" T0
LEFT JOIN
"ecommerce"."orders" T1 ON T0."order_id" = T1."id"
WHERE
T0."id" < 2200
)

SELECT
T1."order_id"
T0."total_delivery_attempts"
FROM
total_delivery_attemps T0
INNER JOIN
distinct_order_id T1 on T0."order_id" = T1."order_id"

How to avoid fan-out issue in your report

In Holistics, the fan-out warning should only appear when you're using a custom measure, because we only auto-handle fan-out for measures with an explicit aggregation type. If your measure is created inside modeling layer using SQL, there will be high possibility that we cannot guess our aggregation function thus causing the fan-out issues.

There are some solutions to avoid the error in this situation:

  • Use Aggregation or Business calculation on the Dataset Exploration
  • Using AML syntax (only in 4.0)

You could find more detail here

Handling referential integrity violations

Take the (Orders, Users) example above. Consider scenario where there are orders records of user ID = 5, but in the users model, no corresponding record of user ID 5 found. This is a violation of the referential integrity rule between orders and users (many-to-one relationship).

If we use INNER JOIN for the above query, the result set would eliminate orders rows with unfounded users. This is dangerous and will underreport the sales results.

Therefore, using INNER JOIN does not resolve referential integrity violation correctly.

-- IMPORTANT: This is not what Holistics does
-- Using INNER JOIN
SELECT
U.gender,
SUM(O.revenue) as total_orders
FROM orders O
INNER JOIN users U ON O.user_id = U.id
GROUP BY 1

Generally speaking, Holistics will opt to use OUTER JOIN when dealing with referential integrity issues. In some situation, this will add query performance overhead, but it ensures that all records will be accounted for and not fall into referential integrity violation traps.


Open Markdown
Let us know what you think about this document :)