Data Build Tool: dbt seed command tutorial | CSV file to warehouse

Data Build Tool: dbt seed command tutorial | CSV file to warehouse




What is dbt seed and what is its purpose?

 

1.     dbt is a command line tool utilized by data analysts to efficiently transform data into a data warehouse.

2.     The dbt seed command processes a small set of data directly from a csv/txt file into the data warehouse.

3.     All processed files should be placed in the seeds folder.

4.     This command loads existing csv files located in the seed-paths directory into the warehouse.

5.     Seeds are ideal for static data that changes infrequently. Utilizing seeds for static data helps maintain efficiency in the dbt workflow.

6.     However, frequent changes to seed data can introduce complexity and potentially slow down data transformation processes.


Selecting or excluding certain data for analysis can be done using dbt seed.

 For selecting data, customer data can be saved into a csv/txt file, placed in the seed folder, and loaded into the warehouse using the dbt seed command.

 Similarly, for excluding data, a list of customers/products to be excluded can be saved into a file, placed in the seed folder, and loaded into the warehouse for analysis.

 Loading Small set Datasets

 Small reference datasets that do not change often, such as exchange rates or calendar feeds, can be loaded into the warehouse as seeds for further analysis instead of storing them as tables in the source database.

 Data validation test

 Data validation tests can be conducted by exporting data from an existing system source, running dbt seed model to load the data into the warehouse, and creating actual results to test against existing data.

 How to create dbt seed?

 To create a dbt seed, start by creating a CSV/TXT file to use as a seed. For example, create a sample CSV file containing department data and save it as dept.csv. kept this file in the seeds directory of your dbt project.

 Remember, any file you want to use as a dbt seed must be placed in the seeds folder.

 To Load file data into your warehouse, run the following command:

 dbt seed --select dept.csv

 You can then use these seeds in your models by referencing them in your dbt models using the `ref` function and making joins with existing tables. For example, you could write a query like this:

 sql

select *

from {{ ref('dept') }} d

join existing_table t on d.id = t.id

 

 How handle Column Types – leading 0

 To resolve the issue of numeric values starting with zero being treated as integers and dropping the leading zero when processing CSV files into the data warehouse, a property file for seed needs to be created. This property file will be similar to the ones used for models to load correct data types.

 The updated version of the property file includes a seed named "deptno" with the following configuration:

create properties.yml file in seeds folder and add below code. change table name, column name, data type as per your project.

version: 2
seeds:
  - name: dept
    config:
      #schema: analytics
      column_types:
        code: nvarchar(5)
-----------

Add below details to dbt_project.yml file


name : dept
seeds:
    # This configure seeds/dept.csv
    dept:
      +column_types:
        deptno: varchar(5)




These changes in the seed structure require a full-refresh to apply them to the data warehouse.

 

Additionally, seeds can have specific properties such as tests, customized schema names, modified source paths, and documentation. These properties can be set and adjusted for seeds just like for models. For more detailed information, the official documentation on properties and configurations should be consulted.

Telugu Video -- Telugu

English Video -- English


Post a Comment

0 Comments