Compare file and table data and dynamically excludes the specified columns in Python

Compare file and table data and dynamically excludes the specified columns in Python


Wlcome to the data analysis series. Today, we will discuss a scenario that involves comparing file-file, file-table, and table-table data. Specifically, we will explore how to compare file and table data while dynamically excluding certain specified columns such as inserted_date, business_date, dw_load_date, or updated_date.

 

What is use of System default date column in target system?

 

In most data warehousing projects, when loading source data into a target system, we often assign system default dates (insert_date, business_date, or dw_load_date) to track the history of data loaded and the type of data loaded into the target system. This scenario is typically encountered in Slowly Changing Dimension (SCD) type projects.

 

Here are the challenges and solutions in different comparison scenarios:

 

1. File to File:

 What is flat file ?

 A flat file is a file that contains data in a plain text format, typically organized as a sequence of records or rows, with each record containing a fixed number of fields or columns. It is called a "flat" file because it does not have any embedded hierarchical structure like a database or XML file.

 Challenges in flat file testing?

·        Flat files can often contain large volumes of data, which may create difficulties in processing and validating the entire file efficiently.

·        Flat files can have different versions or variants, making it necessary to handle compatibility issues while validating and processing them.

 The conventional method for validating file data is as follows:

 Typically, if the file contains a small amount of data that can be easily imported into an Excel file, we import the data from the text file into an Excel file. We then use either the VLOOKUP or FC command for CSV/TXT files. However, this process can be time-consuming and repetitive if any issues are found and need to be revalidating after issue fixed.

 In cases where the file contains a large amount of data that cannot be imported into Excel, we need to consider alternative approaches such as using Unix or any third-party tools.

 Occasionally, the file may contain unnecessary columns, such as date columns or default value columns. Prior to performing any validation, it is advisable to remove these columns.

 Example in above screen shot, the source file contains columns like deptno, dname, and loc, and the target file contains the same columns along with inserted_date, comparing data between the two files becomes challenging. It is difficult to exclude default system columns like insert_date.

 2. File to Table:

 How to validate file to table or table to file data?

 Assuming that in your project, the source of data is in the form of a file and you need to load this data into a target system which is a database. Comparing the data between two different systems can be challenging. However, if both the source and target data are in the same format, such as files or tables, the comparison becomes much easier. First, you would import the necessary data from the table into a file, and then compare the data between the files.

 However, if the table contains a large amount of data that cannot be easily imported into a file, you would need to resort to using third-party tools for the comparison.

 Example, see above screenhost, the source file contains columns like deptno, dname, and loc, and the target table has the same columns along with inserted_date, excluding default system columns during data comparison becomes difficult. One must manually specify the required columns in the select query and then export the data for comparison.

 3. Table to Table:

If the source table has columns like deptno, dname, and loc, and the target table has the same columns along with inserted_date, excluding default system columns during data comparison becomes problematic. Again, one must manually specify the required columns in the select query and export the data for comparison, making it a tedious task.

 To simplify these scenarios, we have a Python script that addresses the problem. You only need to provide the excluded column, the source and target input information, and the script will compare the data and output the results for you.

 Below code is working comparing two files:

you can modify code accordingly as per your requirement.

---------------------

import csv

def compare_files(file1, file2):
    # Read file1 data
   
with open(file1, 'r') as f1:
        reader1 = csv.DictReader(f1)
        data1 = [row for row in reader1]
        print(type(data1))

    # Read file2 data, excluding certain columns
   
with open(file2, 'r') as f2:
        reader2 = csv.DictReader(f2)
        excluded_columns = ['insert_date', 'updated_date', 'dw_load_date']
        data2 = [{k: v for k, v in row.items() if k not in excluded_columns} for row in reader2]
        print(type(data2))

  
   
common_columns = ['deptno', 'dname', 'loc']
    for row1, row2 in zip(data1, data2):
        for column in common_columns:
            if row1[column] != row2[column]:
                print(f"Difference found in {column}: {row1[column]} (file1) vs {row2[column]} (file2)")
                break
   
print("Comparison completed.")

# Example usage
compare_files('data/dept.csv', 'data/dept2.csv')

 

In this example, the compare_data function takes the file data, table data, and a list of excluded columns as inputs. It removes the excluded columns from both the file and table data using the drop method. Then, it converts the table data to the same data type as the file data using the astype method. Finally, it compares the remaining columns of the file and table data using the compare method provided by the pandas library and returns the differences.

=====================================

Below code is working for file to table comparing after excluding specified columns in target.

 

import pandas as pd

import cx_Oracle as ora

import csv

import sqlalchemy as sa

 

def compare_data(file, tbl_name, excluded_columns):

    # Read file and make df

    df = pd.read_csv(file, delimiter=',', header=None, skiprows=1)

    file_data = [list(row) for row in df.values]

    print(file_data)

 

    # connect oracle and make df

    connstr = 'oracle://scott:tiger@localhost:1521'

    conn = sa.create_engine(connstr)

 

    df = pd.read_sql_query("select * from %s" % tbl_name, con=conn)

    # Remove excluded columns from both file and table data and convert list

    if excluded_columns in df.columns:

        df1 = df.drop(columns=excluded_columns).reset_index(drop=True)

        table_data = df1.values.tolist()

        print(type(table_data))

    else :

            #df1 = df.drop(columns=excluded_columns).reset_index(drop=True)

            table_data = df.values.tolist()

            print(type(table_data))

    print(table_data)

 

    # Compare the remaining columns and return the differences

    if file_data == table_data:

        return "Both File and Table having same data"

    else:

        return "Both File and Table are not having same data"

    print("Comparison completed.")

 

# Specify columns to exclude

excluded_columns = ['curr_date']

 

# Compare file and table data

file = 'G:\Python_Automation\data\dept.csv'

tbl_name = 'dept'

excluded_columns='curr_date'

differences = compare_data(file, tbl_name, excluded_columns)

 

# Print the differences

print(differences)

================================

Occasionally, we may need to validate files that have the same structure. If we don't have default system date columns, we may encounter an error stating "['curr_date'] not found in axis" when comparing files using the above code. To prevent such issues, the following script can be used. It checks if the column exists, and if it does, it removes it and creates a new dataframe. If the column is not found, it reads the file and creates the dataframe.

 

 

import pandas as pd

import cx_Oracle as ora

import csv

import sqlalchemy as sa

 

 

def compare_data(file, tbl_name, excluded_columns):

    # Read file and make df

    df = pd.read_csv(file, delimiter=',', header=None, skiprows=1)

    file_data = [list(row) for row in df.values]

    print(file_data)

 

    # connect oracle and make df

    connstr = 'oracle://scott:tiger@localhost:1521'

    conn = sa.create_engine(connstr)

 

    df = pd.read_sql_query("select * from %s" % tbl_name, con=conn)

 

    # Remove excluded columns from both file and table data and convert list

    if excluded_columns in df.columns:

        df1 = df.drop(columns=excluded_columns).reset_index(drop=True)

        table_data = df1.values.tolist()

        print(type(table_data))

    else :

            #df1 = df.drop(columns=excluded_columns).reset_index(drop=True)

            table_data = df.values.tolist()

            print(type(table_data))

    print(table_data)

 

    # Compare the remaining columns and return the differences

    if file_data == table_data:

        return "Both File and Table having same data"

    else:

        return "Both File and Table are not having same data"

    print("Comparison completed.")

 

 

# Specify columns to exclude

excluded_columns = ['curr_date']

 

# Compare file and table data

file = 'G:\Python_Automation\data\dept.csv'

tbl_name = 'dept'

excluded_columns='curr_date'

differences = compare_data(file, tbl_name, excluded_columns)

 

# Print the differences

print(differences)

 

-------------------------

if you are looking different kind of realtime scenarios pls follow my blog and youtube channel given below 

Youtube link  64techskills

blog 64techskills

 Video

 

Post a Comment

1 Comments

  1. One of the tutorial for data validation. thanks for providing. keep continue.

    ReplyDelete

Thanks for your message.