Setup DataHub integration

Early access

This feature is currently in development and not yet available. Contact Holistics to sign up for early access.

Prerequisites

Before setting up the integration, make sure you have:

Holistics CLI installed and available in your PATH. See the CLI documentation for installation instructions.
DataHub instance running and accessible. This can be a local instance or DataHub Cloud.
Access to your Holistics AML project - either as a local directory or a git repository.

Installation

Install the DataHub Holistics connector using pip:

pip install datahub-holistics

This package registers itself as a DataHub ingestion source plugin, so you can use type: holistics in your ingestion recipes.

The connector expects a recent Holistics CLI that exposes the canonical lineage graph via:

holistics aml lineage .

Configuration

The connector is configured through a YAML recipe file, following DataHub's standard ingestion format.

Basic setup with local directory

If your AML project is on your local machine, use the base_folder option:

source:
  type: holistics
  config:
    base_folder: /path/to/your/holistics-aml-project

    connection_to_platform_map:
      bigquery_prod:
        platform: bigquery
        env: PROD

sink:
  type: datahub-rest
  config:
    server: http://localhost:8080

Git-based setup

For production use, you'll typically want the connector to clone your AML project from git. This ensures you're always ingesting from the latest committed state:

source:
  type: holistics
  config:
    git_info:
      repo: https://github.com/your-company/holistics-project
      branch: main
      deploy_key_file: /path/to/deploy_key  # Optional, for private repos

    connection_to_platform_map:
      bigquery_prod:
        platform: bigquery
        env: PROD

sink:
  type: datahub-rest
  config:
    server: http://localhost:8080

Connection mapping

Connection mapping is essential for establishing lineage from your Holistics models to the underlying database tables. Without it, the connector won't know which DataHub platform corresponds to each Holistics data source.

Each entry maps a Holistics data_source_name (as defined in your AML files) to a DataHub platform:

connection_to_platform_map:
  # Simple mapping - just specify the platform
  bigquery_prod:
    platform: bigquery
    env: PROD

  # Detailed mapping - useful for platforms that need more context
  postgres_analytics:
    platform: postgres
    platform_instance: analytics-db
    database: analytics
    schema: public
    env: PROD

  # Snowflake example
  snowflake_warehouse:
    platform: snowflake
    platform_instance: my-snowflake
    env: PROD

The connector uses this mapping to construct proper DataHub URNs for source tables, enabling end-to-end lineage from dashboards down to database tables.

Feature flags

You can control what metadata gets extracted:

source:
  type: holistics
  config:
    base_folder: /path/to/project

    # Feature flags (all default to true)
    extract_owners: true        # Extract owner information from AML
    extract_lineage: true       # Build lineage relationships
    extract_descriptions: true  # Include descriptions from AML
    include_hidden_fields: false # Include fields marked as hidden

    # Platform identification
    platform_instance: production-holistics
    env: PROD

    connection_to_platform_map:
      # ... your mappings

Filtering

Use regex patterns to control which entities get ingested:

source:
  type: holistics
  config:
    base_folder: /path/to/project

    # Only ingest specific entities
    model_pattern:
      allow:
        - ".*"
      deny:
        - "tmp_.*"      # Skip temporary models
        - "test_.*"     # Skip test models

    dataset_pattern:
      allow:
        - ".*"

    dashboard_pattern:
      allow:
        - ".*"

    connection_to_platform_map:
      # ... your mappings

Stateful ingestion

Enable stateful ingestion to automatically detect and remove stale entities when they're deleted from your AML project:

source:
  type: holistics
  config:
    base_folder: /path/to/project

    stateful_ingestion:
      enabled: true

    connection_to_platform_map:
      # ... your mappings

Running the ingestion

Save your recipe to a file (e.g., holistics_recipe.yaml) and run:

datahub ingest -c holistics_recipe.yaml

The connector will output progress information showing how many models, datasets, dashboards, and charts were processed.

Internally, the connector calls the Holistics CLI and reconstructs DataHub entities from AML-native graph nodes and edges.

Verification

After the ingestion completes:

Check DataHub UI - Navigate to your DataHub instance and search for "holistics". You should see your models, datasets, and dashboards.
Verify lineage - Open a dashboard and check the Lineage tab. You should see connections to charts, which connect to models, which connect to source tables.

The canonical AML graph may contain additional concepts such as filter blocks or other non-viz dashboard blocks. These are preserved in the CLI output but are not currently emitted as DataHub chart entities.
Check schema - Open a model and look at the Schema tab. Dimensions and measures should appear as fields with appropriate tags.

Troubleshooting

CLI not found: Ensure the Holistics CLI is installed and in your PATH. Test by running holistics --version.

Git clone fails: For private repositories, make sure your deploy key has read access and the path in deploy_key_file is correct.

No lineage to source tables: Verify your connection_to_platform_map entries match the data_source_name values in your AML models.

Entities missing: Check the ingestion report for filtered or errored entities. Adjust your *_pattern settings if needed.

Prerequisites​

Installation​

Configuration​

Basic setup with local directory​

Git-based setup​

Connection mapping​

Feature flags​

Filtering​

Stateful ingestion​

Running the ingestion​

Verification​

Troubleshooting​