Python for data comparing
Comparing data between a
source and target is a crucial step in many data-related projects. It involves
checking if the data in the target system matches the expected data from the
source system. This comparison is critical as it helps identify any
discrepancies or errors that may have occurred during the data migration or
integration process.
There are various
challenges when comparing data manually. Firstly, manual comparison can be
time-consuming, especially when dealing with large datasets. It becomes
difficult to review every single record manually, increasing the chances of
missing discrepancies. Secondly, manual comparison is prone to human errors, as
individuals can easily overlook or misinterpret data differences. Additionally,
manual comparison does not scale well when dealing with complex data structures
or when frequent data comparisons are required.
Despite these challenges,
data comparison is essential in data-related projects for multiple reasons. It
ensures data integrity by verifying that the data in the target system matches
the original source accurately. This validation helps to prevent issues such as
data loss, corruption, or inaccuracies, which could adversely affect business
operations or decision-making processes. Data comparison also helps in
identifying and rectifying any data transformation or integration issues, thus
improving data quality and reliability.
To overcome the challenges
of manual data comparison, several tools are available. These tools automate
the process of comparing data between source and target systems, enhancing
efficiency and accuracy. Some common tools for data comparison include:
1. Data comparison
software: These tools compare large
datasets by analyzing various attributes or fields, highlighting any
discrepancies or differences between the source and target systems.
2. Database comparison
tools: These tools compare database
schemas, tables, or records, ensuring the data in the target database matches
the expected structure and content.
3. Change data capture
(CDC) tools: These tools capture and
compare incremental changes in data, tracking modifications or updates made to
the source system and validating their propagation correctly.
4. Data integration
platforms: Many integration platforms
offer built-in data comparison capabilities, enabling efficient comparison
between different data sources and targets.
Using such tools
streamlines the data comparison process, reduces manual effort, and improves
accuracy, making them indispensable for data-related projects.
Here we will see how to
compare source to target system data using python script.
import sys
import os
import pandas as pd
import csv
import cx_Oracle as ora
# read csv file and make
df
def read_csv(file):
try:
df = pd.read_csv(file,
delimiter=',',header=None)
data_list = [list(row) for row in
df.values]
#sort
data_list.sort()
return data_list
except FileNotFoundError:
print("File not found")
except Exception as err:
print("An Exception
error",err)
def
read_oracle(user,password,host,tbl_name):
conn_string = "%s/%s@%s" % (user,
password,host)
try:
conn= ora.connect(conn_string)
cursor=conn.cursor()
cursor.execute("select * from
%s" % (tbl_name))
#sqlquer= 'select * from %s',tbl_name
row = cursor.fetchall()
#df=pd.DataFrame(row,columns=['deptno','dname','loc'])
ora_df_list=[]
for i in row:
ora_df_list.append(list(i))
return ora_df_list
if conn:
cursor.close()
conn.close()
except ora.Error as error:
print(error)
#
file='G:\ETL_Automation\data\dept.csv'
# read_csv(file)
def
csv_db_compare(file,user,password,host,table):
try:
filedf=read_csv(file)
oradf=read_oracle(user,password,host,table)
if len(filedf)==len(oradf):
if filedf== oradf:
print("Count matching and
data matching")
#print("Both are
identical")
else:
print("Count matching but
data not matching")
#print("Both are not same")
else:
print("Count not
matching")
print("File count is
",len(filedf))
print("table count is
",len(oradf))
except Exception as err:
print("undefine error",err)
user='scott'
passowrd='tiger'
host='localhost'
dbname='oracle'
tbl_name='dept'
#
read_oracle(user,passowrd,host,tbl_name)
file='G:\ETL_Automation\data\dept.csv'
# read_csv(file)
csv_db_compare(file,user,passowrd,host,tbl_name)
Youtube Video link in
Telugu
https://youtu.be/njGxidaytD8
for English
https://youtu.be/iazy-I6apv8
0 Comments
Thanks for your message.