What is dbt seed and
what is its purpose?
1. dbt is a command line tool utilized by data analysts
to efficiently transform data into a data warehouse.
2. The dbt seed command processes a small set of data
directly from a csv/txt file into the data warehouse.
3. All processed files should be placed in the seeds
folder.
4. This command loads existing csv files located in the
seed-paths directory into the warehouse.
5. Seeds are ideal for static data that changes
infrequently. Utilizing seeds for static data helps maintain efficiency in the
dbt workflow.
6. However, frequent changes to seed data can introduce
complexity and potentially slow down data transformation processes.
Selecting or excluding
certain data for analysis can be done using dbt seed.
For selecting data, customer data can be saved into a csv/txt file,
placed in the seed folder, and loaded into the warehouse using the dbt seed
command.
Similarly, for excluding
data, a list of customers/products to
be excluded can be saved into a file, placed in the seed folder, and loaded
into the warehouse for analysis.
Loading Small set Datasets
Small reference datasets
that do not change often, such as exchange rates or calendar feeds, can be
loaded into the warehouse as seeds for further analysis instead of storing them
as tables in the source database.
Data validation test
Data validation tests can
be conducted by exporting data from an existing system source, running dbt seed
model to load the data into the warehouse, and creating actual results to test
against existing data.
How to create dbt seed?
To create a dbt seed,
start by creating a CSV/TXT file to use as a seed. For example, create a sample
CSV file containing department data and save it as dept.csv. kept this file in
the seeds directory
of your dbt project.
Remember, any file you
want to use as a dbt seed must be placed in the seeds folder.
To Load file data into your warehouse, run the following command:
dbt seed --select dept.csv
You can then use these
seeds in your models by referencing them in your dbt models using the `ref`
function and making joins with existing tables. For example, you could write a
query like this:
sql
select *
from {{ ref('dept') }} d
join existing_table t on
d.id = t.id
How handle Column Types
– leading 0
To resolve the issue of
numeric values starting with zero being treated as integers and dropping the
leading zero when processing CSV files into the data warehouse, a property file
for seed needs to be created. This property file will be similar to the ones
used for models to load correct data types.
The updated version of the
property file includes a seed named "deptno" with the following
configuration:
create properties.yml file in seeds folder and add below code. change table name, column name, data type as per your project.
version: 2
seeds:
- name: dept
config:
#schema: analytics
column_types:
code: nvarchar(5)
-----------
Add below details to dbt_project.yml file
name : dept
seeds:
# This configure seeds/dept.csv
dept:
+column_types:
deptno: varchar(5)
These changes in the seed
structure require a full-refresh to apply them to the data warehouse.
Additionally, seeds can
have specific properties such as tests, customized schema names, modified
source paths, and documentation. These properties can be set and adjusted for
seeds just like for models. For more detailed information, the official
documentation on properties and configurations should be consulted.
Telugu Video -- Telugu
English Video -- English
0 Comments
Thanks for your message.