Python Data Cleaning Script Example

This beginner-friendly example shows one practical Python data cleaning script.

You will start with messy data stored as a list of dictionaries, then clean it by:

  • removing extra spaces
  • standardizing text
  • handling missing values
  • converting strings to numbers
  • building a new clean result

This is a useful pattern when you want to clean data before saving it, analyzing it, or using it in another part of your program.

Quick example

Use this when you want a simple example of cleaning list-of-dictionary data before saving or using it.

raw_data = [
    {"name": " Alice ", "age": "25", "city": "new york"},
    {"name": "Bob", "age": "", "city": " london "},
    {"name": "  Cara", "age": "31", "city": "PARIS"}
]

cleaned_data = []

for row in raw_data:
    name = row["name"].strip()
    age_text = row["age"].strip()
    city = row["city"].strip().title()

    if not name:
        continue

    age = int(age_text) if age_text else None

    cleaned_data.append({
        "name": name,
        "age": age,
        "city": city
    })

print(cleaned_data)

What this example does

This script shows a simple data cleaning workflow:

  • Uses a small list of dictionaries as sample messy data
  • Cleans text values with strip()
  • Standardizes city names with title()
  • Converts age from string to int when possible
  • Keeps missing age as None
  • Builds a new cleaned list instead of changing the original

The messy data we start with

Here is the raw data:

raw_data = [
    {"name": " Alice ", "age": "25", "city": "new york"},
    {"name": "Bob", "age": "", "city": " london "},
    {"name": "  Cara", "age": "31", "city": "PARIS"}
]

print(raw_data)

This data has several common problems:

  • Extra spaces around values
  • Mixed uppercase and lowercase text
  • Missing values stored as empty strings
  • Numbers stored as text instead of real numbers

Real files often contain this kind of messy data, especially when the data comes from user input or CSV files.

Step 1: Remove extra whitespace

Use strip() on string values to remove spaces at the beginning and end.

row = {"name": " Alice ", "age": "25", "city": " london "}

name = row["name"].strip()
age_text = row["age"].strip()
city_text = row["city"].strip()

print(name)
print(age_text)
print(city_text)

Expected output:

Alice
25
london

Leading and trailing spaces can cause matching problems. For example, "Alice" and " Alice " look similar, but Python treats them as different strings.

If you want a focused guide, see how to remove whitespace from a string in Python.

Step 2: Standardize text values

In this example, we use title() for city names.

city = "PARIS".strip().title()
print(city)

Output:

Paris

Standard formatting makes data easier to compare.

For example:

  • "paris"
  • "PARIS"
  • " Paris "

can all become:

  • "Paris"

This simple script uses title() because it is easy to understand. In real projects, the best format depends on the data and the rules you want to enforce.

Step 3: Handle missing values

Before converting the age, check whether the string is empty.

age_text = ""

age = int(age_text) if age_text else None
print(age)

Output:

None

Using None makes it clear that the value is missing.

That is usually better than keeping an empty string for numeric data, because:

  • empty strings are still strings
  • numbers should be stored as numbers
  • missing values should be handled intentionally

Step 4: Convert data types

The age value starts as text:

age_text = "25"
print(type(age_text))

Output:

<class 'str'>

You can convert it to an integer with int():

age = int(age_text)
print(age)
print(type(age))

Output:

25
<class 'int'>

This matters because numbers should be stored as numbers when you want to:

  • compare values
  • sort correctly
  • do math

If the text is not numeric, int() will fail with a ValueError. If you need help with that, see ValueError: invalid literal for int() with base 10 and how to convert string to int in Python.

Step 5: Create the cleaned result

Now put the steps together into one script.

raw_data = [
    {"name": " Alice ", "age": "25", "city": "new york"},
    {"name": "Bob", "age": "", "city": " london "},
    {"name": "  Cara", "age": "31", "city": "PARIS"}
]

cleaned_data = []

for row in raw_data:
    name = row["name"].strip()
    age_text = row["age"].strip()
    city = row["city"].strip().title()

    if not name:
        continue

    age = int(age_text) if age_text else None

    cleaned_data.append({
        "name": name,
        "age": age,
        "city": city
    })

print(cleaned_data)

What this script does:

  • Loops through each row
  • Cleans each field
  • Skips rows where name is empty
  • Converts age when possible
  • Appends a new cleaned dictionary to cleaned_data

Using a new list is safer for beginners because you keep the original data unchanged.

Expected output

Running the full script prints:

[{'name': 'Alice', 'age': 25, 'city': 'New York'}, {'name': 'Bob', 'age': None, 'city': 'London'}, {'name': 'Cara', 'age': 31, 'city': 'Paris'}]

Notice what changed:

  • Extra spaces were removed
  • City names were standardized
  • Age values were converted from strings to integers
  • Missing age became None

Useful improvements for real projects

This example is intentionally simple. In real projects, you may also want to:

  • Skip rows with missing required fields
  • Use try-except when converting numbers
  • Read messy data from CSV files
  • Write cleaned data back to a file
  • Move cleaning steps into a function

For example, if your age values might contain invalid text, you can make the conversion safer:

raw_data = [
    {"name": "Alice", "age": "25", "city": "new york"},
    {"name": "Bob", "age": "unknown", "city": "london"}
]

cleaned_data = []

for row in raw_data:
    name = row["name"].strip()
    age_text = row["age"].strip()
    city = row["city"].strip().title()

    try:
        age = int(age_text) if age_text else None
    except ValueError:
        age = None

    cleaned_data.append({
        "name": name,
        "age": age,
        "city": city
    })

print(cleaned_data)

If your data comes from a file, the next practical step is to read a CSV file in Python, clean each row, and then write a CSV file in Python.

Common mistakes

These are common causes of messy data problems:

  • Extra spaces in user input or file data
  • Numbers stored as strings
  • Empty strings used instead of missing values
  • Mixed uppercase and lowercase text
  • Assuming every row has valid data

If your script is not working, these quick debug prints can help:

print(raw_data)
print(row)
print(type(row['age']))
print(repr(row['name']))
print(cleaned_data)

Why these help:

  • print(raw_data) shows the full original data
  • print(row) shows the current row being processed
  • print(type(row['age'])) confirms whether age is a string or number
  • print(repr(row['name'])) makes hidden spaces visible
  • print(cleaned_data) shows the result after cleaning

FAQ

Why use a new cleaned list instead of changing the old one?

It is safer for beginners. You keep the original data and can compare before and after.

Why is None used for missing values?

None clearly means no value. It is better than keeping an empty string for numeric data.

What if int() fails during cleaning?

Use a check first or wrap the conversion in try-except if the input may contain invalid numbers.

Can I use this script with CSV data?

Yes. The same cleaning steps work after reading rows from a CSV file into dictionaries.

See also