Data Engineering Real-Time Scenario - For Practice

Data Engineering Real-Time Scenario - For Practice

Data Engineering Project Scenario

 

In this post, I will share a real-time project scenario designed for practice purposes.

 This comprehensive example will help you understand the end-to-end workflow of a data engineering project, highlighting how real-world projects operate. Additionally, I will include several important scenarios to enhance your coding skills and problem-solving abilities.

 The project will cover various key aspects, including reading data from files, handling file format errors, dynamically converting different date formats, and ensuring data quality through cleaning processes. You will also learn about best practices for loading data into target tables, including efficient data clearing and closing procedures.

 I am confident that you won't find such detailed and practical project scenarios elsewhere, especially not on YouTube. This hands-on approach aims to prepare you for real industry challenges.

  

Sample Json file format

[

  {

    "author": "xyz",

    "file_name": "customer",

    "file type": "csv",

    "table": "customer"

  }

]

 

File process scenarios

 

1.     The user uploads a CSV, Excel, or text file through the frontend application.

2.     The file is then stored in an S3 bucket.

3.     A Kafka message is generated in JSON format containing the relevant file information.

4.     Once the Kafka message is produced, the ingestion process begins, reading data from the JSON details and loading it into the target table.

5.     Please read the JSON file to access its content. Then,

6.     read the CSV file from the specified S3 bucket path (for practice, use a local folder).

7.     If the file type is CSV, XLS, or TXT, process the file accordingly. For CSV files, ingest the data into the related table.

8.     The same approach applies for XLS and TXT files.

9.     If the file, file type, or table information is missing, display an error message

10.                        If the file contains different dates in formats other than YYYY-MM-DD, convert these date fields to the YYYY-MM-DD format during data loading into the target table.

 

File validation

 

1.     Ensure all sheet names are in lowercase only.

2.     If a sheet name contains any special characters, numbers, or double spaces, throw an error.

3.     In an Excel workbook with multiple sheets, all sheets should be in the same order (e.g., 1, 2, 3).

4.     Verify that the columns are in the correct order.

5.     If any column contains data that doesn't match the defined data type, throw an error and do not load the data.

 


File link DE Scenario Doc

Post a Comment

0 Comments