Wlcome to the data analysis series. Today, we will discuss a scenario that involves comparing file-file, file-table, and table-table data. Specifically, we will explore how to compare file and table data while dynamically excluding certain specified columns such as inserted_date, business_date, dw_load_date, or updated_date.
What is use of System default
date column in target system?
In most data warehousing projects, when loading source
data into a target system, we often assign system default dates (insert_date,
business_date, or dw_load_date) to track the history of data loaded and the
type of data loaded into the target system. This scenario is typically
encountered in Slowly Changing Dimension (SCD) type projects.
Here are the challenges and solutions in different
comparison scenarios:
1. File to File:
·
Flat files can often contain large volumes of data,
which may create difficulties in processing and validating the entire file
efficiently.
·
Flat files can have different versions or variants,
making it necessary to handle compatibility issues while validating and
processing them.
If the source table has columns like deptno, dname, and loc,
and the target table has the same columns along with inserted_date, excluding
default system columns during data comparison becomes problematic. Again, one
must manually specify the required columns in the select query and export the
data for comparison, making it a tedious task.
you can modify code accordingly as per your requirement.
---------------------
import csv
def compare_files(file1, file2):
# Read
file1 data
with open(file1, 'r') as f1:
reader1 = csv.DictReader(f1)
data1 = [row for row in reader1]
print(type(data1))
# Read
file2 data, excluding certain columns
with open(file2, 'r') as f2:
reader2 = csv.DictReader(f2)
excluded_columns = ['insert_date', 'updated_date',
'dw_load_date']
data2 = [{k: v for k, v in row.items()
if k not in excluded_columns}
for row in reader2]
print(type(data2))
common_columns = ['deptno', 'dname',
'loc']
for row1,
row2 in zip(data1,
data2):
for column
in common_columns:
if
row1[column] != row2[column]:
print(f"Difference found in {column}: {row1[column]} (file1) vs {row2[column]} (file2)")
break
print("Comparison completed.")
# Example usage
compare_files('data/dept.csv',
'data/dept2.csv')
In this example, the compare_data function
takes the file data, table data, and a list of excluded columns as inputs. It
removes the excluded columns from both the file and table data using the drop
method. Then, it converts the table data to the same data type as the file data
using the astype method. Finally, it compares the remaining columns of the file
and table data using the compare method provided by the pandas library and
returns the differences.
=====================================
Below code is working for file
to table comparing after excluding specified columns in target.
import pandas as pd
import cx_Oracle as ora
import csv
import sqlalchemy as sa
def compare_data(file, tbl_name, excluded_columns):
# Read file and make df
df = pd.read_csv(file, delimiter=',',
header=None, skiprows=1)
file_data = [list(row) for row in
df.values]
print(file_data)
# connect oracle and make df
connstr =
'oracle://scott:tiger@localhost:1521'
conn = sa.create_engine(connstr)
df = pd.read_sql_query("select * from
%s" % tbl_name, con=conn)
# Remove excluded columns from both
file and table data and convert list
if excluded_columns in df.columns:
df1 =
df.drop(columns=excluded_columns).reset_index(drop=True)
table_data =
df1.values.tolist()
print(type(table_data))
else :
#df1 =
df.drop(columns=excluded_columns).reset_index(drop=True)
table_data =
df.values.tolist()
print(type(table_data))
print(table_data)
# Compare the remaining columns and
return the differences
if file_data == table_data:
return "Both File and
Table having same data"
else:
return "Both File and
Table are not having same data"
print("Comparison completed.")
# Specify columns to exclude
excluded_columns = ['curr_date']
# Compare file and table data
file = 'G:\Python_Automation\data\dept.csv'
tbl_name = 'dept'
excluded_columns='curr_date'
differences = compare_data(file, tbl_name,
excluded_columns)
# Print the differences
print(differences)
================================
Occasionally, we may need to validate files that have the
same structure. If we don't have default system date columns, we may encounter
an error stating "['curr_date'] not found in axis" when comparing
files using the above code. To prevent such issues, the following script can be
used. It checks if the column exists, and if it does, it removes it and creates
a new dataframe. If the column is not found, it reads the file and creates the
dataframe.
import pandas as pd
import cx_Oracle as ora
import csv
import sqlalchemy as sa
def compare_data(file, tbl_name, excluded_columns):
# Read file and make df
df = pd.read_csv(file, delimiter=',',
header=None, skiprows=1)
file_data = [list(row) for row in
df.values]
print(file_data)
# connect oracle and make df
connstr =
'oracle://scott:tiger@localhost:1521'
conn = sa.create_engine(connstr)
df = pd.read_sql_query("select * from
%s" % tbl_name, con=conn)
# Remove excluded columns from both
file and table data and convert list
if excluded_columns in df.columns:
df1 = df.drop(columns=excluded_columns).reset_index(drop=True)
table_data =
df1.values.tolist()
print(type(table_data))
else :
#df1 =
df.drop(columns=excluded_columns).reset_index(drop=True)
table_data =
df.values.tolist()
print(type(table_data))
print(table_data)
# Compare the remaining columns and
return the differences
if file_data == table_data:
return "Both File and
Table having same data"
else:
return "Both File and
Table are not having same data"
print("Comparison completed.")
# Specify columns to exclude
excluded_columns = ['curr_date']
# Compare file and table data
file = 'G:\Python_Automation\data\dept.csv'
tbl_name = 'dept'
excluded_columns='curr_date'
differences = compare_data(file, tbl_name,
excluded_columns)
# Print the differences
print(differences)
-------------------------
if you are looking different kind of realtime scenarios
pls follow my blog and youtube channel given below
Youtube link 64techskills
blog 64techskills
1 Comments
One of the tutorial for data validation. thanks for providing. keep continue.
ReplyDeleteThanks for your message.