Drop Duplicate Rows - Problem

You have a DataFrame called customers with the following structure:

Column NameType
customer_idint
nameobject
emailobject

There are some duplicate rows in the DataFrame based on the email column.

Write a solution to remove these duplicate rows and keep only the first occurrence.

Return the cleaned DataFrame in the same format.

Input & Output

Example 1 — Basic Duplicate Removal
$ Input: customers = [{"customer_id": 1, "name": "Ella", "email": "[email protected]"}, {"customer_id": 2, "name": "David", "email": "[email protected]"}, {"customer_id": 3, "name": "Zachary", "email": "[email protected]"}, {"customer_id": 4, "name": "Alice", "email": "[email protected]"}]
Output: [{"customer_id": 1, "name": "Ella", "email": "[email protected]"}, {"customer_id": 2, "name": "David", "email": "[email protected]"}, {"customer_id": 3, "name": "Zachary", "email": "[email protected]"}]
💡 Note: Row with customer_id=4 (Alice) is removed because [email protected] already exists in row with customer_id=1 (Ella). We keep the first occurrence.
Example 2 — Multiple Duplicates
$ Input: customers = [{"customer_id": 1, "name": "John", "email": "[email protected]"}, {"customer_id": 2, "name": "Bob", "email": "[email protected]"}, {"customer_id": 3, "name": "Johnny", "email": "[email protected]"}, {"customer_id": 4, "name": "Robert", "email": "[email protected]"}]
Output: [{"customer_id": 1, "name": "John", "email": "[email protected]"}, {"customer_id": 2, "name": "Bob", "email": "[email protected]"}]
💡 Note: Both [email protected] and [email protected] have duplicates. We keep only the first occurrence of each email: John (ID=1) and Bob (ID=2).
Example 3 — No Duplicates
$ Input: customers = [{"customer_id": 1, "name": "Alice", "email": "[email protected]"}, {"customer_id": 2, "name": "Bob", "email": "[email protected]"}]
Output: [{"customer_id": 1, "name": "Alice", "email": "[email protected]"}, {"customer_id": 2, "name": "Bob", "email": "[email protected]"}]
💡 Note: All emails are unique, so no rows are removed. The DataFrame remains unchanged.

Constraints

  • 1 ≤ customers.length ≤ 104
  • customer_id, name, and email are non-empty
  • All customer_id values are unique

Visualization

Tap to expand
Drop Duplicate Rows: Email-Based FilteringInput DataFrame1, Ella, [email protected]2, David, [email protected]3, Zachary, [email protected]4, Alice, [email protected]↑ Duplicate emailProcessingTrack seen emails:[email protected][email protected][email protected][email protected] (skip)Output DataFrame1, Ella, [email protected]2, David, [email protected]3, Zachary, [email protected]↑ Alice row removedFirst occurrence of each email is keptTime: O(n) | Space: O(n)Result: Clean DataFrame with unique emails
Understanding the Visualization
1
Input DataFrame
Customer data with potential duplicate emails
2
Process
Identify and remove duplicate emails, keeping first occurrence
3
Output
Clean DataFrame with unique emails only
Key Takeaway
🎯 Key Insight: Use hash set to track seen emails for efficient O(1) duplicate detection
Asked in
Netflix 25 Spotify 20 Uber 15
22.3K Views
High Frequency
~10 min Avg. Time
890 Likes
Ln 1, Col 1
Smart Actions
💡 Explanation
AI Ready
💡 Suggestion Tab to accept Esc to dismiss
// Output will appear here after running code
Code Editor Closed
Click the red button to reopen