Drop Duplicate Rows - Problem
You have a DataFrame called customers with the following structure:
| Column Name | Type |
|---|---|
| customer_id | int |
| name | object |
| object |
There are some duplicate rows in the DataFrame based on the email column.
Write a solution to remove these duplicate rows and keep only the first occurrence.
Return the cleaned DataFrame in the same format.
Input & Output
Example 1 — Basic Duplicate Removal
$
Input:
customers = [{"customer_id": 1, "name": "Ella", "email": "[email protected]"}, {"customer_id": 2, "name": "David", "email": "[email protected]"}, {"customer_id": 3, "name": "Zachary", "email": "[email protected]"}, {"customer_id": 4, "name": "Alice", "email": "[email protected]"}]
›
Output:
[{"customer_id": 1, "name": "Ella", "email": "[email protected]"}, {"customer_id": 2, "name": "David", "email": "[email protected]"}, {"customer_id": 3, "name": "Zachary", "email": "[email protected]"}]
💡 Note:
Row with customer_id=4 (Alice) is removed because [email protected] already exists in row with customer_id=1 (Ella). We keep the first occurrence.
Example 2 — Multiple Duplicates
$
Input:
customers = [{"customer_id": 1, "name": "John", "email": "[email protected]"}, {"customer_id": 2, "name": "Bob", "email": "[email protected]"}, {"customer_id": 3, "name": "Johnny", "email": "[email protected]"}, {"customer_id": 4, "name": "Robert", "email": "[email protected]"}]
›
Output:
[{"customer_id": 1, "name": "John", "email": "[email protected]"}, {"customer_id": 2, "name": "Bob", "email": "[email protected]"}]
💡 Note:
Both [email protected] and [email protected] have duplicates. We keep only the first occurrence of each email: John (ID=1) and Bob (ID=2).
Example 3 — No Duplicates
$
Input:
customers = [{"customer_id": 1, "name": "Alice", "email": "[email protected]"}, {"customer_id": 2, "name": "Bob", "email": "[email protected]"}]
›
Output:
[{"customer_id": 1, "name": "Alice", "email": "[email protected]"}, {"customer_id": 2, "name": "Bob", "email": "[email protected]"}]
💡 Note:
All emails are unique, so no rows are removed. The DataFrame remains unchanged.
Constraints
- 1 ≤ customers.length ≤ 104
- customer_id, name, and email are non-empty
- All customer_id values are unique
Visualization
Tap to expand
Understanding the Visualization
1
Input DataFrame
Customer data with potential duplicate emails
2
Process
Identify and remove duplicate emails, keeping first occurrence
3
Output
Clean DataFrame with unique emails only
Key Takeaway
🎯 Key Insight: Use hash set to track seen emails for efficient O(1) duplicate detection
💡
Explanation
AI Ready
💡 Suggestion
Tab
to accept
Esc
to dismiss
// Output will appear here after running code