Python Data Cleaning Script Example

This beginner-friendly example shows one practical Python data cleaning script.

You will start with messy data stored as a list of dictionaries, then clean it by:

  • removing extra spaces
  • standardizing text
  • handling missing values
  • converting strings to numbers
  • building a new clean result

This is a useful pattern when you want to clean data before saving it, analyzing it, or using it in another part of your program.

Quick example #

Use this when you want a simple example of cleaning list-of-dictionary data before saving or using it.

raw_data = [
    {"name": " Alice ", "age": "25", "city": "new york"},
    {"name": "Bob", "age": "", "city": " london "},
    {"name": "  Cara", "age": "31", "city": "PARIS"}
]

cleaned_data = []

for row in raw_data:
    name = row["name"].strip()
    age_text = row["age"].strip()
    city = row["city"].strip().title()

    if not name:
        continue

    age = int(age_text) if age_text else None

    cleaned_data.append({
        "name": name,
        "age": age,
        "city": city
    })

print(cleaned_data)

What this example does #

This script shows a simple data cleaning workflow:

  • Uses a small list of dictionaries as sample messy data
  • Cleans text values with strip()
  • Standardizes city names with title()
  • Converts age from string to int when possible
  • Keeps missing age as None
  • Builds a new cleaned list instead of changing the original

The messy data we start with #

Here is the raw data:

raw_data = [
    {"name": " Alice ", "age": "25", "city": "new york"},
    {"name": "Bob", "age": "", "city": " london "},
    {"name": "  Cara", "age": "31", "city": "PARIS"}
]

print(raw_data)

This data has several common problems:

  • Extra spaces around values
  • Mixed uppercase and lowercase text
  • Missing values stored as empty strings
  • Numbers stored as text instead of real numbers

Real files often contain this kind of messy data, especially when the data comes from user input or CSV files.

Step 1: Remove extra whitespace #

Use strip() on string values to remove spaces at the beginning and end.

row = {"name": " Alice ", "age": "25", "city": " london "}

name = row["name"].strip()
age_text = row["age"].strip()
city_text = row["city"].strip()

print(name)
print(age_text)
print(city_text)

Expected output:

Alice
25
london

Leading and trailing spaces can cause matching problems. For example, "Alice" and " Alice " look similar, but Python treats them as different strings.

If you want a focused guide, see how to remove whitespace from a string in Python.

Step 2: Standardize text values #

In this example, we use title() for city names.

city = "PARIS".strip().title()
print(city)

Output:

Paris

Standard formatting makes data easier to compare.

For example:

  • "paris"
  • "PARIS"
  • " Paris "

can all become:

  • "Paris"

This simple script uses title() because it is easy to understand. In real projects, the best format depends on the data and the rules you want to enforce.

Step 3: Handle missing values #

Before converting the age, check whether the string is empty.

age_text = ""

age = int(age_text) if age_text else None
print(age)

Output:

None

Using None makes it clear that the value is missing.

That is usually better than keeping an empty string for numeric data, because:

  • empty strings are still strings
  • numbers should be stored as numbers
  • missing values should be handled intentionally

Step 4: Convert data types #

The age value starts as text:

age_text = "25"
print(type(age_text))

Output:

<class 'str'>

You can convert it to an integer with int():

age = int(age_text)
print(age)
print(type(age))

Output:

25
<class 'int'>

This matters because numbers should be stored as numbers when you want to:

  • compare values
  • sort correctly
  • do math

If the text is not numeric, int() will fail with a ValueError. If you need help with that, see ValueError: invalid literal for int() with base 10 and how to convert string to int in Python.

Step 5: Create the cleaned result #

Now put the steps together into one script.

raw_data = [
    {"name": " Alice ", "age": "25", "city": "new york"},
    {"name": "Bob", "age": "", "city": " london "},
    {"name": "  Cara", "age": "31", "city": "PARIS"}
]

cleaned_data = []

for row in raw_data:
    name = row["name"].strip()
    age_text = row["age"].strip()
    city = row["city"].strip().title()

    if not name:
        continue

    age = int(age_text) if age_text else None

    cleaned_data.append({
        "name": name,
        "age": age,
        "city": city
    })

print(cleaned_data)

What this script does:

  • Loops through each row
  • Cleans each field
  • Skips rows where name is empty
  • Converts age when possible
  • Appends a new cleaned dictionary to cleaned_data

Using a new list is safer for beginners because you keep the original data unchanged.

Expected output #

Running the full script prints:

[{'name': 'Alice', 'age': 25, 'city': 'New York'}, {'name': 'Bob', 'age': None, 'city': 'London'}, {'name': 'Cara', 'age': 31, 'city': 'Paris'}]

Notice what changed:

  • Extra spaces were removed
  • City names were standardized
  • Age values were converted from strings to integers
  • Missing age became None

Useful improvements for real projects #

This example is intentionally simple. In real projects, you may also want to:

  • Skip rows with missing required fields
  • Use try-except when converting numbers
  • Read messy data from CSV files
  • Write cleaned data back to a file
  • Move cleaning steps into a function

For example, if your age values might contain invalid text, you can make the conversion safer:

raw_data = [
    {"name": "Alice", "age": "25", "city": "new york"},
    {"name": "Bob", "age": "unknown", "city": "london"}
]

cleaned_data = []

for row in raw_data:
    name = row["name"].strip()
    age_text = row["age"].strip()
    city = row["city"].strip().title()

    try:
        age = int(age_text) if age_text else None
    except ValueError:
        age = None

    cleaned_data.append({
        "name": name,
        "age": age,
        "city": city
    })

print(cleaned_data)

If your data comes from a file, the next practical step is to read a CSV file in Python, clean each row, and then write a CSV file in Python.

Common mistakes #

These are common causes of messy data problems:

  • Extra spaces in user input or file data
  • Numbers stored as strings
  • Empty strings used instead of missing values
  • Mixed uppercase and lowercase text
  • Assuming every row has valid data

If your script is not working, these quick debug prints can help:

print(raw_data)
print(row)
print(type(row['age']))
print(repr(row['name']))
print(cleaned_data)

Why these help:

  • print(raw_data) shows the full original data
  • print(row) shows the current row being processed
  • print(type(row['age'])) confirms whether age is a string or number
  • print(repr(row['name'])) makes hidden spaces visible
  • print(cleaned_data) shows the result after cleaning

FAQ #

Why use a new cleaned list instead of changing the old one? #

It is safer for beginners. You keep the original data and can compare before and after.

Why is None used for missing values? #

None clearly means no value. It is better than keeping an empty string for numeric data.

What if int() fails during cleaning? #

Use a check first or wrap the conversion in try-except if the input may contain invalid numbers.

Can I use this script with CSV data? #

Yes. The same cleaning steps work after reading rows from a CSV file into dictionaries.

See also #

Press Esc to close