Data Engineer: Introduction to testing in dbt
1. Introduction to testing in dbt
Welcome back! In this chapter, we’re going to discuss testing in dbt and what that entails. Let’s get started!
2. What is a test?
The first question you may be wondering is what is a test? In dbt, a test is an assertion or a validation of various dbt objects. This can include models, which we’ve covered thus far. It can also apply to other dbt objects such as sources, seeds, and snapshots. Primarily, tests are used to verify our data is as expected. This can include tests for null values, verifying the values are in range, or the relationships between data. We can also create custom tests for validating specific logic.
3. Test types
There are three kinds of tests in dbt. The first is built-in, which are 4 pre-defined tests available for use. We’ll cover these further in a moment. The other two types are singular and generic. We’ll cover those in later videos.
4. Built-in tests
As mentioned, dbt has 4 built-in tests. The are as follows: Unique, which verifies all values in a column are unique. not_null, which verifies all values in a column are not null. accepted_values verifies all values are within a specific list. These values are listed in a values: option. relationships, which also takes a to: and field: option. It verifies connection of an object to a specific table or column.
5. Where to apply tests?
Model tests are defined in a YAML file within the models directory — other tests, like seed tests are defined in their respective directories. We will name this file model_properties.yml. This file can technically be named anything .yml, such as schema.yml. The naming depends on your preferences or requirements. The actual tests are defined under the tests subheading, under the column name option within the YAML. An example is the easiest way to see the content — here we’re defining tests on two columns of the taxi_rides_raw model. The tpep_pickup_datetime has a not_null test defined. The payment_type column has a not_null test and an accepted values test applied. In this case, we’re verifying that the values are between 1 and 6. You can apply as many tests as desired for each model within your project.
6. Running tests
To actually execute the tests in dbt, we go back to the venerable dbt command, and use the test subcommand, or simply dbt test. This defaults to running tests for our entire project. If we’d like to run it against a specific model, we can use the — select modelname option, such as dbt test — select customers to just run tests for a customer model. The output of the command will indicate whether the tests pass or fail, and which ones. It should be noted that to actually find where a test fails, we need to do a few more steps.
7. Finding failures
To find the specific issues in our data, we need to first look at the compiled SQL code. This normally resides in the target/compiled/projectname/models/model_properties.yml directory. In our case, it will be the target/compiled/nyc_yellow_taxi/models/model_properties.yml/ directory. Look for the appropriate .sql file that matches the failed test or tests. Copy those contents into your database client and check how many rows exist with the issue. You can then remove the data manually from source or modify your model scripts to handle the issue. It’s typically easier to do the second to handle future issues. Once complete, you’ll want to rerun dbt run and dbt test to verify the issue is fixed.
8. Let’s practice!
We’ve covered a lot about testing in this video — let’s practice what we’ve learned in the exercises ahead.
version: 2
models:
- name: taxi_rides_raw
columns:
- name: fare_amount
tests:
- not_null
- name: payment_type
tests:
- not_null
- accepted_values:
values: [1, 2, 3, 4, 5, 6]