Find Duplicates and Null Values in a Table using Pandas

Find Duplicates and Null Values in a Table using Pandas

How to Efficiently Check for Duplicate and Null Values in Columns

Traditionally, checking for duplicate values in a column required manual effort, involving grouping the column and counting records with a count greater than 1. Similarly, identifying null values involved filtering the data using "is null" on a column basis. This process becomes time-consuming when dealing with multiple tables.

To overcome these challenges, Python pandas provides a more dynamic and efficient solution. Here are the simple steps to achieve this:

1. Import the pandas library.

2. Read the dataset into a pandas DataFrame.

3. Use the `duplicated()` function to check for duplicate values. This function returns a boolean Series indicating whether each row is a duplicate or not.

4. To count the number of duplicate values in each column, apply the `sum()` function to the boolean Series obtained in the previous step. This will give you a count of duplicates for each column.

5. Use the `isnull()` function to check for null values. This function returns a boolean Series indicating whether each value is null or not.

6. To count the number of null values in each column, apply the `sum()` function to the boolean Series obtained in the previous step. This will give you a count of nulls for each column.


By utilizing these pandas functions, you can efficiently check for duplicate and null values in columns without the need for manual effort.


-------------------Python Code-------------

import pandas as pd
import cx_Oracle
from sqlalchemy import create_engine


def read_oracle_check_duplicates_nulls(conn, table_name, column_name):
"""
This function reads an Oracle table using pandas, checks for duplicates and null values in a specified column,
and returns a dictionary containing the results.

Args:
conn: A valid cx_Oracle connection object.
table_name: The name of the Oracle table to read.
column_name: The name of the column to check for duplicates and nulls.

Returns:
A dictionary containing information about duplicates and null values, or None if an error occurs.
"""

try:
# Read the Oracle table into a Pandas DataFrame
sql = f"SELECT * FROM {table_name}"
df = pd.read_sql(sql, con=conn)

# Check for duplicates
duplicates_count = df[column_name].duplicated().sum()
duplicates_info = None
if duplicates_count > 0:
duplicates_info = df[df[column_name].duplicated()][column_name].tolist()

# Check for null values
null_count = df[column_name].isnull().sum()

# Return results as a dictionary
return {
"table_name": table_name,
"column_name": column_name,
"duplicates_count": duplicates_count,
"duplicates_info": duplicates_info,
"null_count": null_count,
}
except Exception as e:
print(f"Error reading Oracle table or checking duplicates/nulls: {e}")
return None


# Example usage
conn = create_engine('oracle://project:tiger@localhost:1521')
table_name = "product"
column_name = "pr_id"

result = read_oracle_check_duplicates_nulls(conn, table_name, column_name)

if result:
print(f"Table: {result['table_name']}")
print(f"Column: {result['column_name']}")
print(f"Duplicates: {result['duplicates_count']}")
print(f"Duplicates info: {result['duplicates_info']}")
print(f"Null values: {result['null_count']}")
else:
print("Error occurred, please check logs for details")

-----------------------------------

Test result



This function first establishes a connection to the Oracle database using #sqlalchemy. Then, it uses the pd.read_sql function to read the specified table into a Pandas DataFrame.

Next, it checks for duplicates in the specified column by using the duplicated method on the DataFrame. If duplicates are found, it retrieves a list of the duplicate values.

Finally, it checks for null values in the specified column using the isnull method and counts the number of nulls.

The results are then returned as a dictionary. If any errors occur during the process, the function prints an error message and returns None.

Remember to replace "username/password@dsn" with your actual Oracle database credentials and "MY_TABLE" and "MY_COLUMN" with the names of your table and column, respectively.

I hope this helps! Let me know if you have any other questions.


Post a Comment

0 Comments