UnicodeDecodeError: 'utf-8' codec can't decode byte (Fix)

If you see a Python error like this:

UnicodeDecodeError: 'utf-8' codec can't decode byte ...

Python is trying to read data as UTF-8 text, but the bytes do not match valid UTF-8.

This usually happens when:

  • you open a text file with the wrong encoding
  • a CSV file came from Excel or another Windows program
  • you decode bytes from a file, download, or API using the wrong encoding
  • the file is actually binary, not text

In most cases, the fix is to open the file with the correct encoding instead of relying on the default.

Quick fix

with open('data.txt', 'r', encoding='latin-1') as file:
    text = file.read()

print(text)

Use the correct file encoding instead of the default UTF-8. If you do not know the encoding, first check where the file came from.

What this error means

This error means:

  • Python is trying to turn bytes into text using UTF-8
  • at least one byte is not valid in UTF-8
  • the file or data was probably saved with a different encoding
  • common encodings include:
    • utf-8
    • latin-1
    • cp1252
    • utf-16

A simple way to think about it:

  • bytes are raw data
  • encoding tells Python how to interpret those bytes as text

If Python uses the wrong encoding, it cannot read the text correctly.

When this error usually happens

You will often see this error when:

  • reading a text file with open() without the right encoding
  • loading CSV or text data created on another system
  • decoding raw bytes from a network response or binary source
  • reading Windows-created files that use cp1252 instead of UTF-8

If you are new to file reading, see how to read a file in Python and Python file handling basics.

Example that causes the error

Here is a typical example.

with open('data.txt', 'r', encoding='utf-8') as file:
    text = file.read()

print(text)

This code works only if data.txt is really saved as UTF-8.

If the file actually uses another encoding such as cp1252, Python may raise:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 10: invalid start byte

What happened:

  • the file was opened in text mode
  • Python tried to decode it as UTF-8
  • one or more bytes were invalid for UTF-8
  • reading failed

Main ways to fix it

Open the file with the correct encoding

If you know the file encoding, pass it to open().

with open('data.txt', 'r', encoding='cp1252') as file:
    text = file.read()

print(text)

You can also try latin-1 if that matches the source:

with open('data.txt', 'r', encoding='latin-1') as file:
    text = file.read()

print(text)

For a beginner-friendly explanation of open(), see Python open() function explained.

If the file has a UTF-8 BOM, try utf-8-sig

Some files are UTF-8 but include a special marker at the start called a BOM.

with open('data.txt', 'r', encoding='utf-8-sig') as file:
    text = file.read()

print(text)

This is often useful for text files saved by certain editors or export tools.

Use errors='ignore' or errors='replace' only if needed

If you must keep reading even when some characters are bad, you can tell Python how to handle decode errors.

Ignore bad characters:

with open('data.txt', 'r', encoding='utf-8', errors='ignore') as file:
    text = file.read()

print(text)

Replace bad characters:

with open('data.txt', 'r', encoding='utf-8', errors='replace') as file:
    text = file.read()

print(text)

Use this carefully:

  • ignore silently removes characters
  • replace changes them to replacement symbols
  • both can hide the real problem

If the text matters, finding the correct encoding is safer.

Open binary files in binary mode

If the file is not really text, do not open it in text mode.

Wrong:

with open('image.jpg', 'r', encoding='utf-8') as file:
    data = file.read()

Correct:

with open('image.jpg', 'rb') as file:
    data = file.read()

print(type(data))

Output:

<class 'bytes'>

How to find the right encoding

The best way is to check where the file came from.

Useful clues:

  • files from modern tools and APIs are often UTF-8
  • text files from Windows programs often use cp1252
  • some exported files use utf-8-sig
  • older data may use latin-1

Try these steps:

  1. Check how the file was created or exported.
  2. Look at the program that produced it.
  3. Test common encodings used by that source.
  4. If it is a CSV file, check the export settings.

CSV files are a common cause of this issue. If that is your case, see how to read a CSV file in Python.

Debugging steps

These quick tests can help you find the problem.

1. Print the raw bytes

Open the file in binary mode and inspect the start of the file:

with open('data.txt', 'rb') as f:
    print(f.read(40))

This helps you confirm that:

  • the file exists
  • the file contains bytes
  • the content may not be plain UTF-8 text

2. Try utf-8-sig

with open('data.txt', 'r', encoding='utf-8-sig') as f:
    print(f.read())

3. Try cp1252

with open('data.txt', 'r', encoding='cp1252') as f:
    print(f.read())

4. Try latin-1

with open('data.txt', 'r', encoding='latin-1') as f:
    print(f.read())

Also make sure you are opening the correct file path. If Python cannot find the file at all, see FileNotFoundError: No such file or directory.

What not to do

Common mistakes:

  • Do not use errors='ignore' as your first fix if the text matters.
  • Do not guess random encodings without checking the data source.
  • Do not open binary files like images, PDFs, or Excel files in text mode.
  • Do not assume every CSV file is UTF-8.

Common causes

Here are the most common reasons for this error:

  • The file was saved in cp1252 or latin-1, not UTF-8.
  • The file contains mixed or damaged text encoding.
  • A binary file was opened as if it were a text file.
  • Bytes from an API or download were decoded with the wrong encoding.
  • A CSV file came from Excel or another tool using a non-UTF-8 encoding.

FAQ

Why does Python try UTF-8 by default?

UTF-8 is a common text encoding and is the default in many Python setups and tools.

Should I use errors='ignore' to fix this?

Only if losing some characters is acceptable. It hides the problem and can remove important text.

What is the difference between latin-1 and cp1252?

They are similar single-byte encodings, but cp1252 supports some extra printable characters where latin-1 has control codes.

Why does this happen more with CSV files from Excel?

Some CSV exports use encodings like cp1252 instead of UTF-8, especially on Windows.

See also