Comparing CSV File data to table Using Pandas

Comparing CSV File data to table Using Pandas

Python for data comparing


Comparing data between a source and target is a crucial step in many data-related projects. It involves checking if the data in the target system matches the expected data from the source system. This comparison is critical as it helps identify any discrepancies or errors that may have occurred during the data migration or integration process.

 

There are various challenges when comparing data manually. Firstly, manual comparison can be time-consuming, especially when dealing with large datasets. It becomes difficult to review every single record manually, increasing the chances of missing discrepancies. Secondly, manual comparison is prone to human errors, as individuals can easily overlook or misinterpret data differences. Additionally, manual comparison does not scale well when dealing with complex data structures or when frequent data comparisons are required.

 

Despite these challenges, data comparison is essential in data-related projects for multiple reasons. It ensures data integrity by verifying that the data in the target system matches the original source accurately. This validation helps to prevent issues such as data loss, corruption, or inaccuracies, which could adversely affect business operations or decision-making processes. Data comparison also helps in identifying and rectifying any data transformation or integration issues, thus improving data quality and reliability.

 

To overcome the challenges of manual data comparison, several tools are available. These tools automate the process of comparing data between source and target systems, enhancing efficiency and accuracy. Some common tools for data comparison include:

 

1. Data comparison software: These tools compare large datasets by analyzing various attributes or fields, highlighting any discrepancies or differences between the source and target systems.

2. Database comparison tools: These tools compare database schemas, tables, or records, ensuring the data in the target database matches the expected structure and content.

3. Change data capture (CDC) tools: These tools capture and compare incremental changes in data, tracking modifications or updates made to the source system and validating their propagation correctly.

4. Data integration platforms: Many integration platforms offer built-in data comparison capabilities, enabling efficient comparison between different data sources and targets.

 

Using such tools streamlines the data comparison process, reduces manual effort, and improves accuracy, making them indispensable for data-related projects.

Here we will see how to compare source to target system data using python script.

 

 

import sys

import os

import pandas as pd

import csv

import cx_Oracle as ora

 

# read csv file and make df

def read_csv(file):

 

    try:

 

        df = pd.read_csv(file, delimiter=',',header=None)

        data_list = [list(row) for row in df.values]

        #sort

        data_list.sort()

 

        return data_list

    except FileNotFoundError:

        print("File not found")

    except Exception as err:

        print("An Exception error",err)

 

def read_oracle(user,password,host,tbl_name):

    conn_string = "%s/%s@%s" % (user, password,host)

   

    try:

        conn= ora.connect(conn_string)

        cursor=conn.cursor()

        cursor.execute("select * from %s" % (tbl_name))

        #sqlquer= 'select * from %s',tbl_name

        row = cursor.fetchall()

        #df=pd.DataFrame(row,columns=['deptno','dname','loc'])

 

        ora_df_list=[]

        for i in row:

            ora_df_list.append(list(i))

        return ora_df_list

        if conn:

            cursor.close()

            conn.close()

 

    except ora.Error as error:

        print(error)

# file='G:\ETL_Automation\data\dept.csv'

# read_csv(file)

 

 

 

 

def csv_db_compare(file,user,password,host,table):

 

 

    try:

        filedf=read_csv(file)

        oradf=read_oracle(user,password,host,table)

        if len(filedf)==len(oradf):

               

            if filedf== oradf:

                print("Count matching and data matching")

                #print("Both are identical")

               

            else:

                print("Count matching but data not matching")

                #print("Both are not same")

 

        else:

            print("Count not matching")

            print("File count is ",len(filedf))

            print("table count is ",len(oradf))

    except Exception as err:

        print("undefine error",err)

 

user='scott'

passowrd='tiger'

host='localhost'

dbname='oracle'

tbl_name='dept'

# read_oracle(user,passowrd,host,tbl_name)

file='G:\ETL_Automation\data\dept.csv'

# read_csv(file)

csv_db_compare(file,user,passowrd,host,tbl_name)

 

Youtube Video link in Telugu

https://youtu.be/njGxidaytD8

 

for English

https://youtu.be/iazy-I6apv8


Post a Comment

0 Comments