UnicodeDecodeError: 'utf-8' codec can't decode byte (Fix)
If you see a Python error like this:
UnicodeDecodeError: 'utf-8' codec can't decode byte ...
Python is trying to read data as UTF-8 text, but the bytes do not match valid UTF-8.
This usually happens when:
- you open a text file with the wrong encoding
- a CSV file came from Excel or another Windows program
- you decode bytes from a file, download, or API using the wrong encoding
- the file is actually binary, not text
In most cases, the fix is to open the file with the correct encoding instead of relying on the default.
Quick fix #
with open('data.txt', 'r', encoding='latin-1') as file:
text = file.read()
print(text)
Use the correct file encoding instead of the default UTF-8. If you do not know the encoding, first check where the file came from.
What this error means #
This error means:
- Python is trying to turn bytes into text using UTF-8
- at least one byte is not valid in UTF-8
- the file or data was probably saved with a different encoding
- common encodings include:
utf-8latin-1cp1252utf-16
A simple way to think about it:
- bytes are raw data
- encoding tells Python how to interpret those bytes as text
If Python uses the wrong encoding, it cannot read the text correctly.
When this error usually happens #
You will often see this error when:
- reading a text file with
open()without the rightencoding - loading CSV or text data created on another system
- decoding raw bytes from a network response or binary source
- reading Windows-created files that use
cp1252instead of UTF-8
If you are new to file reading, see how to read a file in Python and Python file handling basics.
Example that causes the error #
Here is a typical example.
with open('data.txt', 'r', encoding='utf-8') as file:
text = file.read()
print(text)
This code works only if data.txt is really saved as UTF-8.
If the file actually uses another encoding such as cp1252, Python may raise:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 10: invalid start byte
What happened:
- the file was opened in text mode
- Python tried to decode it as UTF-8
- one or more bytes were invalid for UTF-8
- reading failed
Main ways to fix it #
Open the file with the correct encoding #
If you know the file encoding, pass it to open().
with open('data.txt', 'r', encoding='cp1252') as file:
text = file.read()
print(text)
You can also try latin-1 if that matches the source:
with open('data.txt', 'r', encoding='latin-1') as file:
text = file.read()
print(text)
For a beginner-friendly explanation of open(), see Python open() function explained.
If the file has a UTF-8 BOM, try utf-8-sig #
Some files are UTF-8 but include a special marker at the start called a BOM.
with open('data.txt', 'r', encoding='utf-8-sig') as file:
text = file.read()
print(text)
This is often useful for text files saved by certain editors or export tools.
Use errors='ignore' or errors='replace' only if needed #
If you must keep reading even when some characters are bad, you can tell Python how to handle decode errors.
Ignore bad characters:
with open('data.txt', 'r', encoding='utf-8', errors='ignore') as file:
text = file.read()
print(text)
Replace bad characters:
with open('data.txt', 'r', encoding='utf-8', errors='replace') as file:
text = file.read()
print(text)
Use this carefully:
ignoresilently removes charactersreplacechanges them to replacement symbols- both can hide the real problem
If the text matters, finding the correct encoding is safer.
Open binary files in binary mode #
If the file is not really text, do not open it in text mode.
Wrong:
with open('image.jpg', 'r', encoding='utf-8') as file:
data = file.read()
Correct:
with open('image.jpg', 'rb') as file:
data = file.read()
print(type(data))
Output:
<class 'bytes'>
How to find the right encoding #
The best way is to check where the file came from.
Useful clues:
- files from modern tools and APIs are often UTF-8
- text files from Windows programs often use
cp1252 - some exported files use
utf-8-sig - older data may use
latin-1
Try these steps:
- Check how the file was created or exported.
- Look at the program that produced it.
- Test common encodings used by that source.
- If it is a CSV file, check the export settings.
CSV files are a common cause of this issue. If that is your case, see how to read a CSV file in Python.
Debugging steps #
These quick tests can help you find the problem.
1. Print the raw bytes #
Open the file in binary mode and inspect the start of the file:
with open('data.txt', 'rb') as f:
print(f.read(40))
This helps you confirm that:
- the file exists
- the file contains bytes
- the content may not be plain UTF-8 text
2. Try utf-8-sig #
with open('data.txt', 'r', encoding='utf-8-sig') as f:
print(f.read())
3. Try cp1252 #
with open('data.txt', 'r', encoding='cp1252') as f:
print(f.read())
4. Try latin-1 #
with open('data.txt', 'r', encoding='latin-1') as f:
print(f.read())
Also make sure you are opening the correct file path. If Python cannot find the file at all, see FileNotFoundError: No such file or directory.
What not to do #
Common mistakes:
- Do not use
errors='ignore'as your first fix if the text matters. - Do not guess random encodings without checking the data source.
- Do not open binary files like images, PDFs, or Excel files in text mode.
- Do not assume every CSV file is UTF-8.
Common causes #
Here are the most common reasons for this error:
- The file was saved in
cp1252orlatin-1, not UTF-8. - The file contains mixed or damaged text encoding.
- A binary file was opened as if it were a text file.
- Bytes from an API or download were decoded with the wrong encoding.
- A CSV file came from Excel or another tool using a non-UTF-8 encoding.
FAQ #
Why does Python try UTF-8 by default? #
UTF-8 is a common text encoding and is the default in many Python setups and tools.
Should I use errors='ignore' to fix this? #
Only if losing some characters is acceptable. It hides the problem and can remove important text.
What is the difference between latin-1 and cp1252? #
They are similar single-byte encodings, but cp1252 supports some extra printable characters where latin-1 has control codes.
Why does this happen more with CSV files from Excel? #
Some CSV exports use encodings like cp1252 instead of UTF-8, especially on Windows.