In this article, we will discuss the process of
excluding unnecessary columns when comparing data files to tables or tables to
tables using the pandas library. This is a commonly encountered scenario in
data-related projects where data validation is necessary.
In addition to excluding unnecessary columns, pandas can also be used to filter
and compare specific rows of data. By leveraging the filtering capabilities of
pandas, you can effectively validate and compare data between different sources
or tables.
Why target system has default
date columns?
These columns are crucial for tracking the loading
and type of data on specific dates. As a result, these columns are mandatory
for the project.
However, when comparing the source data to the target data, it is essential to
remove these date columns; otherwise, an error will be thrown. Here is a Python
code snippet that can help you remove unnecessary columns from the target and
perform the comparison.
============= Code===============
import pandas as pd
import cx_Oracle as ora
import csv
import sqlalchemy as sa
def compare_data(file, tbl_name, excluded_columns):
# Read file and make df
file_df =
pd.read_csv(file, delimiter=',', header=None, skiprows=1)
file_data =
[list(row) for row in file_df.values]
print(file_data)
#
connect oracle and make df
connstr =
'oracle://scott:tiger@localhost:1521'
conn =
sa.create_engine(connstr)
target_df =
pd.read_sql_query("select * from %s" % tbl_name, con=conn)
#
Checking if excluded columns exist in DataFrame columns
#
Remove excluded columns from target and convert list
if any(col in
target_df.columns for col in excluded_columns):
#
Removing excluded columns if found in DataFrame
target_df1
= target_df.drop(columns=[col for col in excluded_columns if col in
target_df.columns]).reset_index(drop=True)
#df1 =
df.drop(columns=excluded_columns).reset_index(drop=True)
table_data
= target_df1.values.tolist()
print(type(table_data))
else :
#
Creating DataFrame if excluded columns not found
table_data
= target_df.values.tolist()
print(type(table_data))
print(table_data)
#
Compare the remaining columns and return the differences
if file_data ==
table_data:
return
"Both File and Table having same data"
else:
return
"Both File and Table are not having same data"
print("Comparison completed.")
# Specify columns to exclude
excluded_columns=['curr_date','insert_date','source_name']
# Compare file and table data
file = 'G:\Python_Automation\data\dept.csv'
tbl_name = 'dept3'
differences = compare_data(file, tbl_name,
excluded_columns)
# Print the differences
print(differences)
print("Success")

0 Comments
Thanks for your message.