Data Build Tool : End to End Project | Installation | Seeds | Model | Test

Data Build Tool : End to End Project | Installation | Seeds | Model | Test

                     DBT Complete Tutorial
What is dbt ?
 
A data build tool (dbt) is an open-source command-line tool .
designed for building data transformations in modern data warehouses.
It helps data analysts and engineers to transform raw data into meaningful information by applying rules, calculations, and manipulations to create clean and reliable datasets.
 
Pre requisite to install
·        Pycharm / visual studio
·        Database(oracle, mysql, postgresql)
 
Pre requisite to learn
·        SQL
 
How to install dbt in windows
 
1.     Install Python and pycharm/visual studio in windows and any database which is supporting dbt.
 
Here I am using visual studio code and MySql.
 
Steps to Installation:
 
step 1. Create a folder named "dbt" on any drive where you desire to install it (e.g., C, D, E).
step 2. create a folder named ".dbt" Inside the C:/user/username folder,
step 3. Use an Integrated Development Environment (IDE) such as PyCharm or Visual Studio.
step 4. Install the dbt power user extension on Visual Studio. To do this,  open Visual Studio --> extensions -->search for "dbt power user," --> select it, and click install.
step 5. After installation, go to visual studio and open dbt folder and the go to open the terminal and run the following command to create virtual env
step 6. Create a virtual environment by executing the appropriate command.
            python -m venv dbt_venv
step 7. Activate the virtual environment using the corresponding command.
            .\dbt_venv\Scripts\Activate.ps1
step 8. Install any desired database software, like MySQL.
            python -m pip install dbt-mysql
 
step 9.  Run dbt init command on terminal to iInitialize the project to set up the database.
 
Follow the onscreen instruction, where you need to select project name and database name.
 
The project folder, database connection setup, and profile.yml file will be created.
The profile.yml file will be located in the .dbt folder created in step 2.
 
step 10. Open the profile.yml file and update database connection details establish a connection to the database like database name, host, port number, username, password. And save it.
 
step 11: go to terminal, run dbt debug command , to check your database connection details are correct, it should connect database.
 
step 12. If the connection is successful, you are ready to create a model for performing Extract, Transform, and Load (ETL) operations
 
How to install packages :
 
Step1: create package.yml file in dbt project folder(name should be same as step 9)
Step2: go to https://hub.getdbt.com/dbt-labs -- search dbt_utils -- copy code and paste it in package.yml file and save it.
 
packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1
 
step 3: run dbt deps command on command line, on successful you could see packages in dbt_package folder.
 
Dbt seeds:
Dbt seed process csv file into warehouse and create table with ddl, table name should be same as processed file name and load the data into table.
All the processed files should be placed into dbt seeds folder.
           
·        Dbt seed –s filename.csv          
·        Dbt seed –select filename  -- both command will process selected csv file only.
·        Dbt seed – it will process all the csv files under seed folder to warehouse
 
Dbt model:
 
Models are primarily written as a select statement and saved as a .sql.
 
A model is a single file containing a final select statement, and a project can have multiple models, and models can even reference each other. Add to that, numerous projects and the level of effort required for transforming complex data sets can improve drastically compared to older methods.
 
create .sql file in dbt model folder (ex emp_fact.sql) and write a code like below . this is just reference.
 
{{ config(
   materialized="view",
   schema="staging"
) }}
 
WITH source AS (
   SELECT
       customer_id,
       first_name,
       last_name,
       house_number,
       street,
       city,
       email_address
   FROM source_table
)
 
 
 
·        Dbt run –select emp_fact.sql – it will run specific model
·        Dbt run – run à it will run all the model under model folder
 
 
Dbt test :
 
We can perform below tests using dbt test.
 
·        Duplicates – Column don’t have any duplicates (repeated) values
·        Uniqueness - Column has no repeating values
·        Not Null – There are no null values in a column
·        Referential Integrity - Test that the column values having an existing relationship with a parent table
·        Freshness: data should be freshness SLA based on a predefined timestamp
·        Accepted (Default) values – Check that column contain only pre-defined values
 
The tests to run can be selected using the --select flag discussed
# run tests for one_specific_model
dbt test --select "one_specific_model"
 
 # run tests for all models in package
dbt test --select "some_package.*"
 
 # run only tests defined singularly
dbt test --select "test_type:singular"
 
 # run only tests defined generically
dbt test --select "test_type:generic"
 
 # run singular tests limited to one_specific_model
dbt test --select "one_specific_model,test_type:singular"
 
# run generic tests limited to one_specific_model
dbt test --select "one_specific_model,test_type:generic"
 
 
 
first step you need to create .yml file and add code with below format and update model name, columns, additionally you can use few macros to test not null string, not accepted value etc..
 
you can write your own macro and apply.
 
Example:
An employee table having empno, ename, sal, loc
-         empno is primary key and should not contain duplicates and null values
-         sal column not contain 0
-         loc column is accepted only default values like (INDIA,USA )
Data tests produce a collection of failed records. Test blocks are used to define generic data tests, which were previously known as schema tests.
There are two methods for defining data tests in dbt: singular data test and generic data test..
1.     Singular data test
2.     Generic data test
 Singular data test:
A singular data test involves a straightforward SQL query that identifies failing rows. This query can be saved in a .sql file within the test directory or model.
Sql file name emp_sal_zero.sal
 {{ config(materialized='table') }}
with emp_test as(
    select totalsal from analytics.emp_fact
    where totalsal <=0
)  select * from emp_test
To run an SQL file, use the following command:
If the .sql file is in the model directory:
dbt run –select emp_sal_zero
 If the .sql file is in the test directory:
dbt test
 When running the above command, it will create a table called emp_sal_zero and store the results in the table..
 Generic Data test:
 A generic data test defined two ways to test .
In the model directory, create a .yml file and specify unique, not null, and accepted values for each column. You can define any number of columns in the file.
 Example below
version: 2
 
models:
  - name: emp_fact
    database: mysql
    schema: analytics
    description: "Employee Table"
    columns:
      - name: empno
        description: "The primary key for this table"
        tests:
          - unique
          - not_null
      - name: sal_status
        description: "Checking Default values"
        tests:
          - accepted_values:
              values: ['lo','me','hg']
You may also use a predefined generic model like a macro in the .yml file along with the above steps. 
      - name: sal_status
        description: "Checking Default values"
        tests:
          - dbt_utils.not_accepted_values:
              values: ['lo','me','hg']
Some data tests are generic and can be reused multiple times. These tests are defined in a test block, which includes a parametrized query and accepts arguments. An example of a generic data test may look like:
 .sql file name assert_negative_value.sql
 {% test assert_negative_value(model, column_name) %}
     select {{ column_name }}
    from {{ model }}
    where {{ column_name }} is null
 {% endtest %}
 Add   assert_negative_value.sql file name into .yml as below
      - name: columnname
        description: "To Test negative value in sal column"
        tests:
          - assert_negative_value
 
 
·        Run Dbt test command in terminal , it will run entire test file
 

Post a Comment

0 Comments