DBT Complete Tutorial
What is dbt ?
A data build tool
(dbt) is an open-source command-line tool .
designed for building data transformations in modern data warehouses.
It helps data analysts and engineers to transform raw data into meaningful information by applying rules, calculations, and manipulations to create clean and reliable datasets.
Pre
requisite to install
· Pycharm / visual studio
·
Database(oracle,
mysql, postgresql)
Pre
requisite to learn
· SQL
How to
install dbt in windows
1. Install Python and pycharm/visual studio
in windows and any database which is supporting dbt.
Here I am using visual studio code and MySql.
Steps to
Installation:
step 1. Create a folder named "dbt" on
any drive where you desire to install it (e.g., C, D, E).
step 2. create a folder named ".dbt"
Inside the C:/user/username folder,
step 3. Use an Integrated Development Environment
(IDE) such as PyCharm or Visual Studio.
step 4. Install the dbt power user extension on
Visual Studio. To do this, open Visual
Studio --> extensions -->search for "dbt power user," -->
select it, and click install.
step 5. After installation, go to visual studio
and open dbt folder and the go to open the terminal and run the following
command to create virtual env
step 6. Create a virtual environment by executing
the appropriate command.
python -m venv dbt_venv
step 7. Activate the virtual environment using the corresponding command.
.\dbt_venv\Scripts\Activate.ps1
step 8. Install any desired database software, like MySQL.
python -m pip install dbt-mysql
step 9. Run dbt init command on terminal to iInitialize the project to set up the
database.
Follow the onscreen
instruction, where you need to select project name and database name.
The project folder,
database connection setup, and profile.yml file will be created.
The profile.yml file will be located in the .dbt folder created in step 2.
step 10. Open the profile.yml file and update
database connection details establish a connection to the database like
database name, host, port number, username, password. And save it.
step 11: go to terminal, run dbt debug command , to check your database
connection details are correct, it should connect database.
step 12. If the connection is successful, you are
ready to create a model for performing Extract, Transform, and Load (ETL)
operations
How to install packages :
Step1: create package.yml file in dbt project
folder(name should be same as step 9)
Step2: go to https://hub.getdbt.com/dbt-labs -- search dbt_utils -- copy code and paste it in package.yml file and
save it.
packages:
- package: dbt-labs/dbt_utils
version: 1.1.1
step 3: run dbt deps command on command
line, on successful you could see packages in dbt_package folder.
Dbt seeds:
Dbt seed process csv file into warehouse and create table with ddl, table name should be same as processed file name and load the data into table.
All the processed files should be placed into dbt seeds folder.
· Dbt seed –s filename.csv
·
Dbt seed
–select filename -- both command will process selected csv
file only.
·
Dbt seed – it will process all the csv files
under seed folder to warehouse
Dbt model:
Models are
primarily written as a select statement and saved as a .sql.
A model is a single
file containing a final select statement, and a project can have multiple
models, and models can even reference each other. Add to that, numerous
projects and the level of effort required for transforming complex data sets
can improve drastically compared to older methods.
create .sql file in
dbt model folder (ex emp_fact.sql) and write a code like below . this is just
reference.
{{ config(
materialized="view",
schema="staging"
) }}
WITH source AS (
SELECT
customer_id,
first_name,
last_name,
house_number,
street,
city,
email_address
FROM source_table
)
·
Dbt run –select
emp_fact.sql – it will run specific model
·
Dbt run – run Ã
it will run all the model under model folder
Dbt test :
We can perform below tests using dbt test.
·
Duplicates – Column don’t have any duplicates
(repeated) values
·
Uniqueness - Column has no repeating values
·
Not Null – There are no null values in a
column
·
Referential Integrity - Test that
the column values having an existing relationship with a parent table
·
Freshness: data should be freshness SLA based on a
predefined timestamp
·
Accepted
(Default) values –
Check that column contain only pre-defined values
The
tests to run can be selected using the --select flag discussed
# run tests for one_specific_model
dbt test --select "one_specific_model"
#
run tests for all models in package
dbt test --select "some_package.*"
#
run only tests defined singularly
dbt test --select "test_type:singular"
#
run only tests defined generically
dbt test --select "test_type:generic"
#
run singular tests limited to one_specific_model
dbt test --select "one_specific_model,test_type:singular"
#
run generic tests limited to one_specific_model
dbt test --select "one_specific_model,test_type:generic"
first step you need
to create .yml file and add code with below format and update model name,
columns, additionally you can use few macros to test not null string, not
accepted value etc..
you can write your
own macro and apply.
Example:
An
employee table having empno, ename, sal, loc
- empno is primary key and should not contain duplicates and null values
- sal column not contain 0
- loc column is accepted only default values like (INDIA,USA )
Data tests produce a collection of failed records. Test blocks are used to define generic data tests, which were previously known as schema tests.
There are two methods for defining data tests in dbt: singular data test and generic data test..
1. Singular data test
2. Generic data test
Singular data test:
A singular data test involves a straightforward SQL query that identifies failing rows. This query can be saved in a .sql file within the test directory or model.
. Sql file name emp_sal_zero.sal
{{ config(materialized='table') }}
with emp_test as(
select totalsal from analytics.emp_fact
where totalsal <=0
) select * from emp_test
To run an SQL file, use the following command:
If the .sql file is in the model directory:
dbt run –select emp_sal_zero
If the .sql file is in the test directory:
dbt test
When running the above command, it will create a table called emp_sal_zero and store the results in the table..
Generic Data test:
A generic data test defined two ways to test .
In the model directory, create a .yml file and specify unique, not null, and accepted values for each column. You can define any number of columns in the file.
Example below
version: 2
models:
- name: emp_fact
database: mysql
schema: analytics
description: "Employee Table"
columns:
- name: empno
description: "The primary key for this table"
tests:
- unique
- not_null
- name: sal_status
description: "Checking Default values"
tests:
- accepted_values:
values: ['lo','me','hg']
You may also use a predefined generic model like a macro in the .yml file along with the above steps.
- name: sal_status
description: "Checking Default values"
tests:
- dbt_utils.not_accepted_values:
values: ['lo','me','hg']
Some data tests are generic and can be reused multiple times. These tests are defined in a test block, which includes a parametrized query and accepts arguments. An example of a generic data test may look like:
.sql file name assert_negative_value.sql
{% test assert_negative_value(model, column_name) %}
select {{ column_name }}
from {{ model }}
where {{ column_name }} is null
{% endtest %}
Add assert_negative_value.sql file name into .yml as below
- name: columnname
description: "To Test negative value in sal column"
tests:
- assert_negative_value
·
Run Dbt
test command in terminal , it will run entire test file
What is dbt ?
designed for building data transformations in modern data warehouses.
It helps data analysts and engineers to transform raw data into meaningful information by applying rules, calculations, and manipulations to create clean and reliable datasets.
· Pycharm / visual studio
· SQL
step 7. Activate the virtual environment using the corresponding command.
step 8. Install any desired database software, like MySQL.
The profile.yml file will be located in the .dbt folder created in step 2.
- package: dbt-labs/dbt_utils
version: 1.1.1
Dbt seed process csv file into warehouse and create table with ddl, table name should be same as processed file name and load the data into table.
All the processed files should be placed into dbt seeds folder.
· Dbt seed –s filename.csv
materialized="view",
schema="staging"
) }}
SELECT
customer_id,
first_name,
last_name,
house_number,
street,
city,
email_address
FROM source_table
)
# run tests for one_specific_model
dbt test --select "one_specific_model"
dbt test --select "some_package.*"
dbt test --select "test_type:singular"
dbt test --select "test_type:generic"
dbt test --select "one_specific_model,test_type:singular"
dbt test --select "one_specific_model,test_type:generic"
- empno is primary key and should not contain duplicates and null values
- sal column not contain 0
- loc column is accepted only default values like (INDIA,USA )
Data tests produce a collection of failed records. Test blocks are used to define generic data tests, which were previously known as schema tests.
There are two methods for defining data tests in dbt: singular data test and generic data test..
1. Singular data test
2. Generic data test
Singular data test:
A singular data test involves a straightforward SQL query that identifies failing rows. This query can be saved in a .sql file within the test directory or model.
. Sql file name emp_sal_zero.sal
{{ config(materialized='table') }}
with emp_test as(
select totalsal from analytics.emp_fact
where totalsal <=0
) select * from emp_test
To run an SQL file, use the following command:
If the .sql file is in the model directory:
dbt run –select emp_sal_zero
If the .sql file is in the test directory:
dbt test
When running the above command, it will create a table called emp_sal_zero and store the results in the table..
Generic Data test:
A generic data test defined two ways to test .
In the model directory, create a .yml file and specify unique, not null, and accepted values for each column. You can define any number of columns in the file.
Example below
version: 2
models:
- name: emp_fact
database: mysql
schema: analytics
description: "Employee Table"
columns:
- name: empno
description: "The primary key for this table"
tests:
- unique
- not_null
- name: sal_status
description: "Checking Default values"
tests:
- accepted_values:
values: ['lo','me','hg']
You may also use a predefined generic model like a macro in the .yml file along with the above steps.
- name: sal_status
description: "Checking Default values"
tests:
- dbt_utils.not_accepted_values:
values: ['lo','me','hg']
Some data tests are generic and can be reused multiple times. These tests are defined in a test block, which includes a parametrized query and accepts arguments. An example of a generic data test may look like:
.sql file name assert_negative_value.sql
{% test assert_negative_value(model, column_name) %}
select {{ column_name }}
from {{ model }}
where {{ column_name }} is null
{% endtest %}
Add assert_negative_value.sql file name into .yml as below
- name: columnname
description: "To Test negative value in sal column"
tests:
- assert_negative_value
0 Comments
Thanks for your message.