UnicodeDecodeError: `'utf-8' codec can't decode byte` (Fix)

If you see a Python error like this:

UnicodeDecodeError: 'utf-8' codec can't decode byte ...

Python is trying to read data as UTF-8 text, but the bytes do not match valid UTF-8.

This usually happens when:

you open a text file with the wrong encoding
a CSV file came from Excel or another Windows program
you decode bytes from a file, download, or API using the wrong encoding
the file is actually binary, not text

In most cases, the fix is to open the file with the correct encoding instead of relying on the default.

Quick fix #

with open('data.txt', 'r', encoding='latin-1') as file:
    text = file.read()

print(text)

Use the correct file encoding instead of the default UTF-8. If you do not know the encoding, first check where the file came from.

What this error means #

This error means:

Python is trying to turn bytes into text using UTF-8
at least one byte is not valid in UTF-8
the file or data was probably saved with a different encoding
common encodings include:
- utf-8
- latin-1
- cp1252
- utf-16

A simple way to think about it:

bytes are raw data
encoding tells Python how to interpret those bytes as text

If Python uses the wrong encoding, it cannot read the text correctly.

When this error usually happens #

You will often see this error when:

reading a text file with open() without the right encoding
loading CSV or text data created on another system
decoding raw bytes from a network response or binary source
reading Windows-created files that use cp1252 instead of UTF-8

If you are new to file reading, see how to read a file in Python and Python file handling basics.

Example that causes the error #

Here is a typical example.

with open('data.txt', 'r', encoding='utf-8') as file:
    text = file.read()

print(text)

This code works only if data.txt is really saved as UTF-8.

If the file actually uses another encoding such as cp1252, Python may raise:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 10: invalid start byte

Read it bottom-up: a byte in the file is not valid UTF-8, raised while reading the file.

What happened:

the file was opened in text mode
Python tried to decode it as UTF-8
one or more bytes were invalid for UTF-8
reading failed

Main ways to fix it #

Open the file with the correct encoding #

If you know the file encoding, pass it to open().

with open('data.txt', 'r', encoding='cp1252') as file:
    text = file.read()

print(text)

You can also try latin-1 if that matches the source:

with open('data.txt', 'r', encoding='latin-1') as file:
    text = file.read()

print(text)

For a beginner-friendly explanation of open(), see Python open() function explained.

If the file has a UTF-8 BOM, try `utf-8-sig` #

Some files are UTF-8 but include a special marker at the start called a BOM.

with open('data.txt', 'r', encoding='utf-8-sig') as file:
    text = file.read()

print(text)

This is often useful for text files saved by certain editors or export tools.

Use `errors='ignore'` or `errors='replace'` only if needed #

If you must keep reading even when some characters are bad, you can tell Python how to handle decode errors.

Ignore bad characters:

with open('data.txt', 'r', encoding='utf-8', errors='ignore') as file:
    text = file.read()

print(text)

Replace bad characters:

with open('data.txt', 'r', encoding='utf-8', errors='replace') as file:
    text = file.read()

print(text)

Use this carefully:

ignore silently removes characters
replace changes them to replacement symbols
both can hide the real problem

If the text matters, finding the correct encoding is safer.

Open binary files in binary mode #

If the file is not really text, do not open it in text mode.

Wrong:

with open('image.jpg', 'r', encoding='utf-8') as file:
    data = file.read()

Correct:

with open('image.jpg', 'rb') as file:
    data = file.read()

print(type(data))

Output:

<class 'bytes'>

How to find the right encoding #

The best way is to check where the file came from.

Useful clues:

files from modern tools and APIs are often UTF-8
text files from Windows programs often use cp1252
some exported files use utf-8-sig
older data may use latin-1

Try these steps:

Check how the file was created or exported.
Look at the program that produced it.
Test common encodings used by that source.
If it is a CSV file, check the export settings.

CSV files are a common cause of this issue. If that is your case, see how to read a CSV file in Python.

Debugging steps #

These quick tests can help you find the problem.

1. Print the raw bytes #

Open the file in binary mode and inspect the start of the file:

with open('data.txt', 'rb') as f:
    print(f.read(40))

This helps you confirm that:

the file exists
the file contains bytes
the content may not be plain UTF-8 text

2. Try `utf-8-sig` #

with open('data.txt', 'r', encoding='utf-8-sig') as f:
    print(f.read())

3. Try `cp1252` #

with open('data.txt', 'r', encoding='cp1252') as f:
    print(f.read())

4. Try `latin-1` #

with open('data.txt', 'r', encoding='latin-1') as f:
    print(f.read())

Also make sure you are opening the correct file path. If Python cannot find the file at all, see FileNotFoundError: No such file or directory.

What not to do #

Common mistakes:

Do not use errors='ignore' as your first fix if the text matters.
Do not guess random encodings without checking the data source.
Do not open binary files like images, PDFs, or Excel files in text mode.
Do not assume every CSV file is UTF-8.

Common causes #

Here are the most common reasons for this error:

The file was saved in cp1252 or latin-1, not UTF-8.
The file contains mixed or damaged text encoding.
A binary file was opened as if it were a text file.
Bytes from an API or download were decoded with the wrong encoding.
A CSV file came from Excel or another tool using a non-UTF-8 encoding.

FAQ #

Why does Python try UTF-8 by default? #

UTF-8 is a common text encoding and is the default in many Python setups and tools.

Should I use `errors='ignore'` to fix this? #

Only if losing some characters is acceptable. It hides the problem and can remove important text.

What is the difference between `latin-1` and `cp1252`? #

They are similar single-byte encodings, but cp1252 supports some extra printable characters where latin-1 has control codes.

Why does this happen more with CSV files from Excel? #

Some CSV exports use encodings like cp1252 instead of UTF-8, especially on Windows.

UnicodeDecodeError: 'utf-8' codec can't decode byte (Fix)