How to Exclude Unnecessary Columns and Compare Data in Python | Data Analysis Tutorial

How to Exclude Unnecessary Columns and Compare Data in Python | Data Analysis Tutorial


 

In this article, we will discuss the process of excluding unnecessary columns when comparing data files to tables or tables to tables using the pandas library. This is a commonly encountered scenario in data-related projects where data validation is necessary.


In addition to excluding unnecessary columns, pandas can also be used to filter and compare specific rows of data. By leveraging the filtering capabilities of pandas, you can effectively validate and compare data between different sources or tables.

 

Why target system has default date columns?

 These columns are crucial for tracking the loading and type of data on specific dates. As a result, these columns are mandatory for the project.


However, when comparing the source data to the target data, it is essential to remove these date columns; otherwise, an error will be thrown. Here is a Python code snippet that can help you remove unnecessary columns from the target and perform the comparison.

============= Code===============

import pandas as pd

import cx_Oracle as ora

import csv

import sqlalchemy as sa

 

 

def compare_data(file, tbl_name, excluded_columns):

    # Read file and make df

    file_df = pd.read_csv(file, delimiter=',', header=None, skiprows=1)

    file_data = [list(row) for row in file_df.values]

    print(file_data)

 

    # connect oracle and make df

    connstr = 'oracle://scott:tiger@localhost:1521'

    conn = sa.create_engine(connstr)

 

    target_df = pd.read_sql_query("select * from %s" % tbl_name, con=conn)

 

    # Checking if excluded columns exist in DataFrame columns

    # Remove excluded columns from target and convert list

    if any(col in target_df.columns for col in excluded_columns):

       

        # Removing excluded columns if found in DataFrame

        target_df1 = target_df.drop(columns=[col for col in excluded_columns if col in target_df.columns]).reset_index(drop=True)

       

        #df1 = df.drop(columns=excluded_columns).reset_index(drop=True)

        table_data = target_df1.values.tolist()

        print(type(table_data))

    else :

            # Creating DataFrame if excluded columns not found

            table_data = target_df.values.tolist()

            print(type(table_data))

   

    print(table_data)

 

    # Compare the remaining columns and return the differences

    if file_data == table_data:

        return "Both File and Table having same data"

    else:

        return "Both File and Table are not having same data"

    print("Comparison completed.")

 

 

# Specify columns to exclude

excluded_columns=['curr_date','insert_date','source_name']

 

# Compare file and table data

file = 'G:\Python_Automation\data\dept.csv'

tbl_name = 'dept3'

 

differences = compare_data(file, tbl_name, excluded_columns)

 

# Print the differences

 

print(differences)

print("Success")



If you like it my content , please like, share, subscribe my channel

YouTube video

Post a Comment

0 Comments