How to Exclude Unnecessary Columns and Compare Data in Python

In this article, we will discuss the process of excluding unnecessary columns when comparing data files to tables or tables to tables using the pandas library. This is a commonly encountered scenario in data-related projects where data validation is necessary.

In addition to excluding unnecessary columns, pandas can also be used to filter and compare specific rows of data. By leveraging the filtering capabilities of pandas, you can effectively validate and compare data between different sources or tables.

Why target system has default date columns?

These columns are crucial for tracking the loading and type of data on specific dates. As a result, these columns are mandatory for the project.

However, when comparing the source data to the target data, it is essential to remove these date columns; otherwise, an error will be thrown. Here is a Python code snippet that can help you remove unnecessary columns from the target and perform the comparison.

============= Code===============

import pandas as pd

import cx_Oracle as ora

import csv

import sqlalchemy as sa

def compare_data(file, tbl_name, excluded_columns):

# Read file and make df

file_df = pd.read_csv(file, delimiter=',', header=None, skiprows=1)

file_data = [list(row) for row in file_df.values]

print(file_data)

# connect oracle and make df

connstr = 'oracle://scott:tiger@localhost:1521'

conn = sa.create_engine(connstr)

target_df = pd.read_sql_query("select * from %s" % tbl_name, con=conn)

# Checking if excluded columns exist in DataFrame columns

# Remove excluded columns from target and convert list

if any(col in target_df.columns for col in excluded_columns):

# Removing excluded columns if found in DataFrame

target_df1 = target_df.drop(columns=[col for col in excluded_columns if col in target_df.columns]).reset_index(drop=True)

#df1 = df.drop(columns=excluded_columns).reset_index(drop=True)

table_data = target_df1.values.tolist()

print(type(table_data))

else :

# Creating DataFrame if excluded columns not found

table_data = target_df.values.tolist()

print(type(table_data))

print(table_data)

# Compare the remaining columns and return the differences

if file_data == table_data:

return "Both File and Table having same data"

else:

return "Both File and Table are not having same data"

print("Comparison completed.")

# Specify columns to exclude

excluded_columns=['curr_date','insert_date','source_name']

# Compare file and table data

file = 'G:\Python_Automation\data\dept.csv'

tbl_name = 'dept3'

differences = compare_data(file, tbl_name, excluded_columns)

# Print the differences

print(differences)

print("Success")

If you like it my content , please like, share, subscribe my channel

How to Exclude Unnecessary Columns and Compare Data in Python | Data Analysis Tutorial

YouTube video

Posted by Sreenu Y

Post a Comment

0 Comments

YouTube

Popular Posts

How to Exclude Unnecessary Columns and Compare Data in Python | Data Analysis Tutorial

Data Build Tool: dbt Testing complete Guide

Write SQL Query to fill the missing months with previous months salary

Categories

Posts

Social Media

Footer Menu Widget

Contact form