Data Build Tool- Seeds complete details

DBT Seeds

1. The dbt seed command will load CSV file located in the seeds path directory of dbt project

2. seeds are best suited to static data which changed frequently

3. Seeds can referenced in downstream models by using ref function.

4. We cannot put same csv file within dbt folder

1. Processing CSV Files Using dbt Seeds

In dbt, CSV files can be efficiently processed using the seed functionality. To do this, all your CSV files must be placed within the designated seeds directory, typically located at seeds/ in your dbt project folder. Once in the correct folder, you can run dbt seed to load these CSV files into your data warehouse as tables. This method is straightforward and ideal for static reference data or small datasets that do not change frequently.

2. Adding a New Column and Reprocessing the Same CSV File

If you need to modify an existing CSV file by adding a new column and then reprocess it, you'll encounter an issue. dbt does not support altering seed files in-place with new columns without additional configuration. Attempting to re-run dbt seed on a CSV with an added column will produce an error similar to:

Note: here we have processed emp file which contain deptno no

1054 (42S22): Unknown column 'deptno' in 'field list'

This happens because dbt treats seed tables as static and expects the schema to match exactly. To successfully reprocess with new columns, you must update the seed file, then re-run dbt seed after ensuring that your schema definitions are aligned. Alternatively, consider using incremental models or external table management for more flexible transformation workflows.

3. Removing an Existing Column and Reprocessing

When you remove a column from your CSV file and reprocess it, dbt will import the data, but the removed column will be present with null values in the resulting table. Essentially, the table schema remains unchanged unless explicitly altered, so the removed column’s data will be null for all records. To fully remove the column, you should update your schema definition accordingly or use schema management tools to alter the table structure post-seed.

4. Adjusting Column Lengths During CSV Processing

dbt's seed functionality allows flexibility in column size definitions. You can increase or decrease the length of string columns (e.g., VARCHAR) during processing without encountering errors, provided the data fits within the specified length. This enables you to adapt your schema as data requirements evolve without reconfiguring or reloading the entire dataset.

5. Processing CSV Files with Integer Columns Containing Leading Zeros (Without Additional Configuration)

When processing CSV files that have integer columns with leading zeros (e.g., "00123"), dbt will automatically strip the leading zeros and store the value as an integer (e.g., 123). This behavior occurs because, by default, integer columns do not preserve leading zeros, which are considered formatting rather than data.

6. Preserving Leading Zzeros in Seed Data

If you want to maintain leading zeros in your data — for example, storing "00123" as-is — you need to specify the column type as VARCHAR rather than INTEGER. This can be achieved through configuration files:

-Using dbt_project.yml:

dbt_project.yml

seeds:

your_project_name:

seed_test:

+column_types:

sal: varchar(10)

- Using properties.yml in the seed folder:

Create a properties.yml file in your seed directory with the following content:

version: 2

seeds:

- name: seed_test

config:

column_types:

sal: varchar(10)

This configuration instructs dbt to treat the sal column as a string, preserving any leading zeros during processing.

7. Processing CSV Files Located Outside the Seed Folder

If your CSV files are stored outside the default seed directory, you can still process them by placing the files anywhere within your dbt project folder structure. Then, specify the path in your dbt_project.yml under the seed-paths setting. For example:

seed-paths: ["seeds","models"]

After configuring the seed paths, run dbt seed to process all CSV files present in those directories.

8. Processing Multiple Files into Different Schemas

To handle multiple seed files and load them into different schemas, you need to configure your dbt_project.yml accordingly. For example:

dbt_project.yml

name: your_project_name

version: 1.0

config-version: 2

seeds:

your_project_name:

seed_test:

+column_types:

sal: varchar(10)

market:

+schema: market

market_test:

+column_types:

msal: varchar(10)

This setup:

- Enables seed processing.

- Loads seed_test into the default schema.

- Loads market seed into a specific schema named market.

- Configures market_test seed with custom column types.

Ensure that the seed files are named and organized appropriately within your seed directories to match these configurations.

**In summary**, leveraging dbt's seed functionality allows for straightforward CSV processing, but managing schema changes, data types, and multiple files requires careful configuration. By adjusting your dbt_project.yml and schema definitions, you can tailor the seed processing to meet your data governance and transformation needs effectively.

SNO	Scenario	Steps	Status	Comment
1	Process CSV File	Normal CSV File	Pass
		Remove Existing Column	Failed	NULL values loaded in removed column
		Add New Column	Failed	Column "new" of relation "file" doesnot exist
		Change Column Data type INT to CHAR	Pass	Invalid input syntax for type integer
		Change Column Data type CHAR to INT	Pass	data will load
		Increase/Descrease Column data length	Pass	data will load
2	Process CSV File other than Seed folder
2	Process CSV File other than Seed folder	Pleace csv file into any dbt folder	Pass	configure folder into dbt_project.yml{seed} then it will load
3	How to process leading zero values
		configure column data type char in dbt_project.yml file	Need to check
		configure column data type char in properties.yml file	Pass
4	Process same two csv files from different folders		Unable to process	since these resources have the same name, dbt will be unable to find the correct resource
5	Process two different csv files into different schema ,tables	Ex- seed/marketing/market.csv
		configure dbt_project.yml	Pass
		configure properties.yml	Pass	need to check
6	Test seed file
6	Test seed file	add file details in schema.yml in seed folder	Pass

Data Build Tool- Seeds complete details

Posted by Sreenu Y

Post a Comment

0 Comments

Popular Posts

Data Build Tool- Seeds complete details

Compare file and table data and dynamically excludes the specified columns in Python

Data Build Tool : End to End Project | Installation | Seeds | Model | Test

Posts

YouTube

Categories

Social Media

Footer Menu Widget

Contact form