Data Build Tool :Hierarchical models in dbt

Prem Vishnoi(cloudvala)
6 min readFeb 18, 2024

--

Hierarchical models in dbt (Data Build Tool) refer to the organization and structuring of dbt models in a way that reflects dependencies and relationships between different datasets within a data warehouse. This approach is essential for managing complex data transformations, ensuring data quality, and facilitating efficient data analysis. Here’s a deeper dive into what hierarchical models are in the context of dbt and how they can be implemented and utilized:

What Are Hierarchical Models?

Hierarchical models in dbt are essentially a series of dbt models (SQL files) that are built on top of each other in a layered architecture. Each model in the hierarchy transforms data at a certain level of abstraction or complexity, starting from raw data sources and moving up to more aggregated, business-focused insights.

Layers in a Hierarchical Model Structure

  1. Source Models (Staging Models): The base of the hierarchy, where raw data from source tables is initially loaded and lightly transformed (e.g., renaming columns, changing data types) to create a clean, consistent staging layer.
  2. Intermediate Models: Built on top of staging models, these perform more complex transformations, such as joining tables, filtering data, and computing new metrics. They serve to structure data in a way that’s more aligned with business logic but may still be quite granular.
  3. Dimension and Fact Models: Drawing from the Data Warehouse dimensional modeling technique, dimension models store descriptive attributes (dimensions), and fact models store quantitative metrics (facts). These models are built from intermediate models and are designed for fast, efficient querying by analytics and BI tools.
  4. Aggregate Models: At the top of the hierarchy, these models provide summarized data, such as daily or monthly aggregates, making them ideal for reports and dashboards. They are built on top of dimension and fact models and are highly optimized for specific analytical purposes.

Implementing Hierarchical Models in dbt

  • Defining Dependencies: dbt automatically understands dependencies based on the ref() function used in SQL models. When you reference another model within a dbt model, dbt ensures that the referenced model is built before the current one.
  • Organizing Models: It’s a good practice to organize your dbt models into directories that reflect their layer in the hierarchy (e.g., staging/, intermediate/, marts/ for dimension and fact models, and aggregates/). This organization helps in understanding the data flow and dependencies at a glance.
  • Testing and Documentation: Hierarchical models benefit greatly from dbt’s built-in testing and documentation capabilities. Tests ensure data integrity and quality at each layer of the hierarchy, while documentation helps maintain a clear understanding of the data transformation process.

Benefits of Hierarchical Models

  • Modularity: Changes in one part of the model hierarchy (e.g., a staging model) can be made independently of others, reducing the risk of unintended consequences in downstream models.
  • Reusability: Higher-level models can reuse components from lower levels, avoiding duplication of logic and effort.
  • Scalability: A hierarchical structure can easily accommodate new data sources and business requirements by adding or modifying specific layers without overhauling the entire data model.
  • Clarity: The hierarchical organization of models makes it easier for team members to understand the data transformation process, from raw data to actionable insights.

In summary, hierarchical models in dbt provide a robust framework for managing complex data transformations in a scalable, maintainable, and understandable way. By leveraging dbt’s features like ref(), tests, and documentation within a structured model hierarchy, teams can ensure their data pipelines remain efficient and reliable as their data ecosystem grows.

Hierarchical models in dbt refer to the relationships between your various data models. These relationships define the order in which different models are executed, ensuring that downstream models have the data they need from upstream models before running.

Here’s a breakdown of hierarchical models in dbt:

Concept:

  • Models depend on other models for their data.
  • This creates a dependency chain or directed acyclic graph (DAG).
  • dbt builds models in the correct order based on this dependency chain.

Benefits:

  • Ensures models are built with necessary data available.
  • Avoids errors from missing data.
  • Makes your data pipeline more maintainable and understandable.

Implementation:

  • Use the source() or ref() function to specify dependencies in your model files.
  • source() takes the raw data source reference.
  • ref() references other dbt models within your project.
  • dbt automatically parses these references and builds models in the correct order.

Example:

Consider a model total_sales that calculates total sales for each product category. It might depend on two other models:

  • product_sales: Calculates individual product sales.
  • product_categories: Provides product category information.

Here’s how you might define the relationships:

SQL

# product_sales.sql
source: "raw_sales_data"
# product_categories.sql
source: "product_data"
# total_sales.sql
source: "{{ ref('product_sales') }}"
source: "{{ ref('product_categories') }}"
# ... rest of your total_sales logic

Additional Notes:

  • You can have multiple dependencies per model.
  • dbt also offers advanced features like materialization strategies to control how models are built and persisted.
  • There are different ways to manage complex data pipelines with multiple levels of hierarchies and cross-schema references.

No hierarchy model creation

Consider you have the following 4 models defined:

  • sales_raw
  • all_weekend_sales
  • total_sales_by_rep
  • new_products_purchased_this_week

If these models do not have a hierarchy or lineage defined in the source, in what order will dbt execute them?

Place the models in the order they will be created, from first to last.

In dbt (Data Build Tool), the execution order of models is determined by their dependencies on each other, as specified by the ref() function within each model's SQL code. Without explicit hierarchies or dependencies (lineage) defined in the source code through ref() calls, dbt does not have a predetermined order based on dependencies to follow. However, if these models are completely independent of each other (i.e., no model uses ref() to refer to another model in this list), dbt will execute them in parallel or in an arbitrary order, as it does not have to respect any particular sequence.

However, in a typical scenario where dependencies are implied by their names or functionalities, we might guess their logical dependencies based on standard data processing practices:

  1. sales_raw: This sounds like the raw data layer, possibly the initial ingestion of sales data into the warehouse. It would logically be the first step in a data pipeline.
  2. new_products_purchased_this_week: This model might depend on sales_raw to filter out sales related to new products only. However, without explicit dependencies, this is just a guess. In practice, it could be independent if it doesn't reference sales_raw directly.
  3. all_weekend_sales: Similarly, this model would logically filter or aggregate sales_raw data for sales that happened over the weekend. Again, without explicit dependencies, its position is speculative.
  4. total_sales_by_rep: This model suggests an aggregation of sales by sales representative, which would likely depend on the raw sales data and possibly aggregate across all sales, including those identified in new_products_purchased_this_week and all_weekend_sales. If it does not explicitly use ref() to depend on these models, its position is also not determined by dependencies.

Without seeing the actual SQL and ref() usage inside these dbt models, we can't definitively say what the execution order will be based on dependencies. In practice, dbt will execute models in parallel where possible, and sequentially where there are dependencies. If no dependencies are defined, the order of execution is not guaranteed and can be considered arbitrary from an external perspective.

To enforce a specific execution order, you would need to explicitly set dependencies between these models using the ref() function in their SQL code. Without such dependencies, the models are treated as independent by dbt during execution.

--

--

Prem Vishnoi(cloudvala)
Prem Vishnoi(cloudvala)

Written by Prem Vishnoi(cloudvala)

Head of Data and ML experienced in designing, implementing, and managing large-scale data infrastructure. Skilled in ETL, data modeling, and cloud computing

No responses yet