Data Build Tool- Seeds complete details

Data Build Tool- Seeds complete details

DBT Seeds

 

1.      The dbt seed command will load CSV file located in the seeds path directory of dbt project

2.      seeds are best suited to static data which changed frequently

3.      Seeds can referenced in downstream models by using ref function.

4.      We cannot put same csv file within dbt folder


1. Processing CSV Files Using dbt Seeds

In dbt, CSV files can be efficiently processed using the seed functionality. To do this, all your CSV files must be placed within the designated seeds directory, typically located at seeds/ in your dbt project folder. Once in the correct folder, you can run dbt seed to load these CSV files into your data warehouse as tables. This method is straightforward and ideal for static reference data or small datasets that do not change frequently.

 2. Adding a New Column and Reprocessing the Same CSV File

If you need to modify an existing CSV file by adding a new column and then reprocess it, you'll encounter an issue. dbt does not support altering seed files in-place with new columns without additional configuration. Attempting to re-run dbt seed on a CSV with an added column will produce an error similar to:

Note: here we have processed emp file which contain deptno no

1054 (42S22): Unknown column 'deptno' in 'field list'

This happens because dbt treats seed tables as static and expects the schema to match exactly. To successfully reprocess with new columns, you must update the seed file, then re-run dbt seed after ensuring that your schema definitions are aligned. Alternatively, consider using incremental models or external table management for more flexible transformation workflows.

 3. Removing an Existing Column and Reprocessing

When you remove a column from your CSV file and reprocess it, dbt will import the data, but the removed column will be present with null values in the resulting table. Essentially, the table schema remains unchanged unless explicitly altered, so the removed column’s data will be null for all records. To fully remove the column, you should update your schema definition accordingly or use schema management tools to alter the table structure post-seed.

  4. Adjusting Column Lengths During CSV Processing

 dbt's seed functionality allows flexibility in column size definitions. You can increase or decrease the length of string columns (e.g., VARCHAR) during processing without encountering errors, provided the data fits within the specified length. This enables you to adapt your schema as data requirements evolve without reconfiguring or reloading the entire dataset.

  5. Processing CSV Files with Integer Columns Containing Leading Zeros (Without Additional Configuration)

When processing CSV files that have integer columns with leading zeros (e.g., "00123"), dbt will automatically strip the leading zeros and store the value as an integer (e.g., 123). This behavior occurs because, by default, integer columns do not preserve leading zeros, which are considered formatting rather than data.

 6. Preserving Leading Zzeros in Seed Data

If you want to maintain leading zeros in your data — for example, storing "00123" as-is — you need to specify the column type as VARCHAR rather than INTEGER. This can be achieved through configuration files:

-Using dbt_project.yml:

dbt_project.yml

seeds:

  your_project_name:

    seed_test:

      +column_types:

        sal: varchar(10)

 

- Using properties.yml in the seed folder:

 

Create a properties.yml file in your seed directory with the following content:

version: 2

 

seeds:

  - name: seed_test

    config:

      column_types:

        sal: varchar(10)

 

 

This configuration instructs dbt to treat the sal column as a string, preserving any leading zeros during processing.

 

 7. Processing CSV Files Located Outside the Seed Folder

 

If your CSV files are stored outside the default seed directory, you can still process them by placing the files anywhere within your dbt project folder structure. Then, specify the path in your dbt_project.yml under the seed-paths setting. For example:

 

seed-paths: ["seeds","models"]

 

After configuring the seed paths, run dbt seed to process all CSV files present in those directories.

 8. Processing Multiple Files into Different Schemas

 

To handle multiple seed files and load them into different schemas, you need to configure your dbt_project.yml accordingly. For example:

 

dbt_project.yml

name: your_project_name

version: 1.0

config-version: 2

 

seeds:

  your_project_name:

    seed_test:

      +column_types:

        sal: varchar(10)

    market:

      +schema: market

      market_test:

        +column_types:

          msal: varchar(10)

 

 

This setup:

 

- Enables seed processing.

- Loads seed_test into the default schema.

- Loads market seed into a specific schema named market.

- Configures market_test seed with custom column types.

 

Ensure that the seed files are named and organized appropriately within your seed directories to match these configurations.

 

**In summary**, leveraging dbt's seed functionality allows for straightforward CSV processing, but managing schema changes, data types, and multiple files requires careful configuration. By adjusting your dbt_project.yml and schema definitions, you can tailor the seed processing to meet your data governance and transformation needs effectively.

 

SNO

Scenario

Steps

Status

Comment

1

Process CSV File

Normal CSV File

Pass

 

Remove Existing Column

Failed

NULL values loaded in removed column

Add New Column

Failed

Column "new" of relation "file" doesnot exist

Change Column Data type INT to CHAR

Pass

Invalid input syntax for type integer

Change Column Data type CHAR to INT

Pass

data will load

Increase/Descrease Column data length

Pass

data will load

2

Process CSV File other than Seed folder

 

 

 

Pleace csv file into any dbt folder

Pass

configure folder into dbt_project.yml{seed} then it will load

3

How to process leading zero values

 

 

 

configure column data type char in dbt_project.yml file

Need to check

 

configure column data type char in properties.yml file

Pass

 

4

Process same two csv files from different folders

 

Unable to process

since these resources have the same name, dbt will be unable to find the correct resource

5

Process two different csv files into different schema ,tables

Ex- seed/marketing/market.csv

 

 

configure dbt_project.yml

Pass

 

configure properties.yml

Pass

need to check

6

Test seed file

 

 

 

add file details in schema.yml in seed folder

Pass

 


 

Post a Comment

0 Comments