UTF-8 Validation - Problem
Given an integer array data representing the data, return whether it is a valid UTF-8 encoding.
A character in UTF-8 can be from 1 to 4 bytes long, subjected to the following rules:
- For a 1-byte character, the first bit is
0, followed by its Unicode code. - For an n-bytes character, the first
nbits are all1s, then + 1bit is0, followed byn - 1bytes with the most significant 2 bits being10.
UTF-8 Encoding Rules:
| Number of Bytes | UTF-8 Octet Sequence (binary) |
|---|---|
| 1 | 0xxxxxxx |
| 2 | 110xxxxx 10xxxxxx |
| 3 | 1110xxxx 10xxxxxx 10xxxxxx |
| 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data.
Input & Output
Example 1 — Valid UTF-8 Sequence
$
Input:
data = [197,130,1]
›
Output:
true
💡 Note:
197 (11000101) is a 2-byte start, 130 (10000010) is valid continuation, 1 (00000001) is valid 1-byte character
Example 2 — Invalid Continuation
$
Input:
data = [235,140,4]
›
Output:
false
💡 Note:
235 (11101011) starts a 3-byte character but only has one continuation byte 140, missing second continuation
Example 3 — Single Bytes Only
$
Input:
data = [1,2,3,4]
›
Output:
true
💡 Note:
All bytes have pattern 0xxxxxxx, which are valid 1-byte UTF-8 characters
Constraints
- 1 ≤ data.length ≤ 2 × 104
- 0 ≤ data[i] ≤ 255
Visualization
Tap to expand
Understanding the Visualization
1
Input Data
Array of integers representing bytes
2
UTF-8 Patterns
Identify start bytes and continuation bytes
3
Validation
Check if sequence forms valid UTF-8 characters
Key Takeaway
🎯 Key Insight: UTF-8 validation requires state tracking to ensure continuation bytes follow start bytes in correct sequences
💡
Explanation
AI Ready
💡 Suggestion
Tab
to accept
Esc
to dismiss
// Output will appear here after running code