Skip to main content

Best practices when working with datasets

Knowledge Checkpoint

We think it would be best if you have a clear understanding of these concept before reading this documentation:

Dataset is the product of the data builder, and it is the interface that end-users will analyze data. Therefore, you must pay extra care when building the dataset.

What makes a good dataset

Understanding the basics of building dataset is easy — ”just join together your data models by setting up the relationships and paths!”

But there are three things you should be optimizing for when building your dataset:

  1. Correctness: Do reports built on top of this dataset return correct data? This is the very lowest bar for design, as Holistics makes it fairly easy to get correct results as long as your join conditions are correct.
  2. Performance: While there may be multiple ways to get Holistics to write a query, there will typically be one way that has the optimal performance. Designing your dataset with performance in mind can be the difference between 15-second and 15-minute dashboard load times.
  3. Usability: When a user who is not an expert in your data loads your explore, it should be immediately clear how to get the answers they’re looking for.

There is no hard-and-fast rule to develop datasets effectively in Holistics, but here are some recommendations:

Best Practices

Organizing Datasets in a logical manner

  • Here are some typical ways of organizing the dataset:
    • Master Datasets: group by entity type (customers, inventory, product). Goal is completeness
    • Department Datasets. Goal is relevance. Ask yourself these questions before building a dataset:
      • “If I'm from this department, how can I find my data quickly?
      • "Is there a dataset folder that my department will access frequently?”
  • Don't worry about duplicating datasets (there's no harm in having the same data model fields for different teams) - Focus on accessibility and building purpose
  • Make it clear when datasets are not meant to be explored (Group into folder)

Make multiple small datasets instead of a single big one

  • Design each dataset to answer a specific set of questions. Anticipate your end users’ questions, and build your dataset around that. Only include the models relevant to the questions.
  • It is advisable to have 4-5 models joined together per one dataset, and 6–7 is the absolute upper limit. If you find yourself needing more, you probably need to model the underlying data more effectively.
  • This is because:
    • Too many models will confuse end-users
    • Too many models will more likely generate complicated & non-optimal joins

Avoid complex join paths

  • Build your dataset in “star schema”. There should be one central table (can be a fact table) that contains necessary measures, surrounded by dimension tables.

  • Avoid “snowflake-schema like” datasets, because in snowflake schema, additional information about an entity/dimension is separated in their own tables → More models to include in the dataset → more joins are needed to get complete information about a certain entity

  • Ideally, set up your models so that the dimensions you need and the measures are only one JOIN away. A join chain that is too long will not be performant.
  • One good way to reduce the joins generated at dataset run time is to prejoin / denormalize your data as much as possible.

Other Notes

  • You can create any number of datasets because Holistics datasets has the below 2 benefits
    • No need to worry about report sprawl. All definitions are centralized within Holistics.
    • All Holistics datasets are virtual and they don't take up physical storage space.
  • Common Challenges:
    • Data exploration offers flexibility for business users, but there is the risk of dragging in incompatible dimensions and metrics, or adding non-additive metrics.
    • Beware of Fanout Issues.

Let us know what you think about this document :)