Drop Duplicate Rows - Practice Coding Problems

Drop Duplicate Rows - Problem

Databse Easy

You have a DataFrame called customers with the following structure:

Column Name	Type
customer_id	int
name	object
email	object

There are some duplicate rows in the DataFrame based on the email column.

Write a solution to remove these duplicate rows and keep only the first occurrence.

Return the cleaned DataFrame in the same format.

Input & Output

Example 1 — Basic Duplicate Removal

$ Input: customers = [{"customer_id": 1, "name": "Ella", "email": "[email protected]"}, {"customer_id": 2, "name": "David", "email": "[email protected]"}, {"customer_id": 3, "name": "Zachary", "email": "[email protected]"}, {"customer_id": 4, "name": "Alice", "email": "[email protected]"}]

› Output: [{"customer_id": 1, "name": "Ella", "email": "[email protected]"}, {"customer_id": 2, "name": "David", "email": "[email protected]"}, {"customer_id": 3, "name": "Zachary", "email": "[email protected]"}]

💡 Note: Row with customer_id=4 (Alice) is removed because [email protected] already exists in row with customer_id=1 (Ella). We keep the first occurrence.

Example 2 — Multiple Duplicates

$ Input: customers = [{"customer_id": 1, "name": "John", "email": "[email protected]"}, {"customer_id": 2, "name": "Bob", "email": "[email protected]"}, {"customer_id": 3, "name": "Johnny", "email": "[email protected]"}, {"customer_id": 4, "name": "Robert", "email": "[email protected]"}]

› Output: [{"customer_id": 1, "name": "John", "email": "[email protected]"}, {"customer_id": 2, "name": "Bob", "email": "[email protected]"}]

💡 Note: Both [email protected] and [email protected] have duplicates. We keep only the first occurrence of each email: John (ID=1) and Bob (ID=2).

Example 3 — No Duplicates

$ Input: customers = [{"customer_id": 1, "name": "Alice", "email": "[email protected]"}, {"customer_id": 2, "name": "Bob", "email": "[email protected]"}]

› Output: [{"customer_id": 1, "name": "Alice", "email": "[email protected]"}, {"customer_id": 2, "name": "Bob", "email": "[email protected]"}]

💡 Note: All emails are unique, so no rows are removed. The DataFrame remains unchanged.

Constraints

1 ≤ customers.length ≤ 10⁴
customer_id, name, and email are non-empty
All customer_id values are unique

Visualization

Tap to expand

Understanding the Visualization

Input DataFrame

Customer data with potential duplicate emails

Process

Identify and remove duplicate emails, keeping first occurrence

Output

Clean DataFrame with unique emails only

Key Takeaway

🎯 Key Insight: Use hash set to track seen emails for efficient O(1) duplicate detection

Asked in

N Netflix 25 S Spotify 20 U Uber 15

The key insight is to track seen emails using a hash set for O(1) lookup time. The optimal approach uses pandas' drop_duplicates() function or manual hash set filtering. Time: O(n), Space: O(n).

Common Approaches

Approach	Time	Space	Notes
✓ Pandas drop_duplicates()	O(n)	O(n)	Use pandas built-in method to efficiently remove duplicates
Manual Duplicate Detection	O(n²)	O(n)	Iterate through each row and manually check for duplicate emails

Pandas drop_duplicates() — Algorithm Steps

Step 1: Call drop_duplicates() on DataFrame
Step 2: Specify subset=['email'] to check only email column
Step 3: Use keep='first' parameter
Step 4: Reset index to clean up result

Visualization

Tap to expand

Step-by-Step Walkthrough

Hash Set

Use hash set for O(1) email lookup

Single Pass

Process each row once

Filter

Keep only first occurrence of each email

Code -

solution.c — C

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <cjson/cJSON.h>

typedef struct {
    int customer_id;
    char name[100];
    char email[100];
} Customer;

int solution(Customer* customers, int n, Customer* result) {
    char seenEmails[1000][100];
    int seenCount = 0;
    int resultCount = 0;
    
    for (int i = 0; i < n; i++) {
        int found = 0;
        for (int j = 0; j < seenCount; j++) {
            if (strcmp(customers[i].email, seenEmails[j]) == 0) {
                found = 1;
                break;
            }
        }
        
        if (!found) {
            strcpy(seenEmails[seenCount], customers[i].email);
            seenCount++;
            result[resultCount] = customers[i];
            resultCount++;
        }
    }
    
    return resultCount;
}

int main() {
    char input[10000];
    fgets(input, sizeof(input), stdin);
    input[strcspn(input, "\n")] = 0;
    
    cJSON *json = cJSON_Parse(input);
    int n = cJSON_GetArraySize(json);
    
    Customer customers[1000];
    Customer result[1000];
    
    for (int i = 0; i < n; i++) {
        cJSON *item = cJSON_GetArrayItem(json, i);
        customers[i].customer_id = cJSON_GetObjectItem(item, "customer_id")->valueint;
        strcpy(customers[i].name, cJSON_GetObjectItem(item, "name")->valuestring);
        strcpy(customers[i].email, cJSON_GetObjectItem(item, "email")->valuestring);
    }
    
    int resultCount = solution(customers, n, result);
    
    cJSON *output = cJSON_CreateArray();
    for (int i = 0; i < resultCount; i++) {
        cJSON *obj = cJSON_CreateObject();
        cJSON_AddNumberToObject(obj, "customer_id", result[i].customer_id);
        cJSON_AddStringToObject(obj, "name", result[i].name);
        cJSON_AddStringToObject(obj, "email", result[i].email);
        cJSON_AddItemToArray(output, obj);
    }
    
    char *jsonString = cJSON_Print(output);
    printf("%s\n", jsonString);
    
    cJSON_Delete(json);
    cJSON_Delete(output);
    free(jsonString);
    
    return 0;
}

Time & Space Complexity

Time Complexity

⏱️

O(n)

Pandas uses optimized algorithms for duplicate detection

✓ Linear Growth

Space Complexity

O(n)

Creates new DataFrame with unique rows

⚡ Linearithmic Space

22.3K Views

High Frequency

~10 min Avg. Time

890 Likes

Ln 1, Col 1

Smart Actions

💡 Explanation

AI Ready

💡 Suggestion Tab to accept Esc to dismiss

// Output will appear here after running code

Code Editor Closed

Click the red button to reopen

Input & Output

Constraints

Visualization

Related Problems

Common Approaches

Pandas drop_duplicates() — Algorithm Steps

Visualization

Code -

Time & Space Complexity

Select Compiler