Python Data Cleaning Script Example

This beginner-friendly example shows one practical Python data cleaning script.

You will start with messy data stored as a list of dictionaries, then clean it by:

removing extra spaces
standardizing text
handling missing values
converting strings to numbers
building a new clean result

This is a useful pattern when you want to clean data before saving it, analyzing it, or using it in another part of your program.

Quick example #

Use this when you want a simple example of cleaning list-of-dictionary data before saving or using it.

raw_data = [
    {"name": " Alice ", "age": "25", "city": "new york"},
    {"name": "Bob", "age": "", "city": " london "},
    {"name": "  Cara", "age": "31", "city": "PARIS"}
]

cleaned_data = []

for row in raw_data:
    name = row["name"].strip()
    age_text = row["age"].strip()
    city = row["city"].strip().title()

    if not name:
        continue

    age = int(age_text) if age_text else None

    cleaned_data.append({
        "name": name,
        "age": age,
        "city": city
    })

print(cleaned_data)

What this example does #

This script shows a simple data cleaning workflow:

Uses a small list of dictionaries as sample messy data
Cleans text values with strip()
Standardizes city names with title()
Converts age from string to int when possible
Keeps missing age as None
Builds a new cleaned list instead of changing the original

The messy data we start with #

Here is the raw data:

raw_data = [
    {"name": " Alice ", "age": "25", "city": "new york"},
    {"name": "Bob", "age": "", "city": " london "},
    {"name": "  Cara", "age": "31", "city": "PARIS"}
]

print(raw_data)

This data has several common problems:

Extra spaces around values
Mixed uppercase and lowercase text
Missing values stored as empty strings
Numbers stored as text instead of real numbers

Real files often contain this kind of messy data, especially when the data comes from user input or CSV files.

Step 1: Remove extra whitespace #

Use strip() on string values to remove spaces at the beginning and end.

row = {"name": " Alice ", "age": "25", "city": " london "}

name = row["name"].strip()
age_text = row["age"].strip()
city_text = row["city"].strip()

print(name)
print(age_text)
print(city_text)

Expected output:

Alice
25
london

Leading and trailing spaces can cause matching problems. For example, "Alice" and " Alice " look similar, but Python treats them as different strings.

If you want a focused guide, see how to remove whitespace from a string in Python.

Step 2: Standardize text values #

In this example, we use title() for city names.

city = "PARIS".strip().title()
print(city)

Output:

Paris

Standard formatting makes data easier to compare.

For example:

"paris"
"PARIS"
" Paris "

can all become:

"Paris"

This simple script uses title() because it is easy to understand. In real projects, the best format depends on the data and the rules you want to enforce.

Step 3: Handle missing values #

Before converting the age, check whether the string is empty.

age_text = ""

age = int(age_text) if age_text else None
print(age)

Output:

None

Using None makes it clear that the value is missing.

That is usually better than keeping an empty string for numeric data, because:

empty strings are still strings
numbers should be stored as numbers
missing values should be handled intentionally

Step 4: Convert data types #

The age value starts as text:

age_text = "25"
print(type(age_text))

Output:

<class 'str'>

You can convert it to an integer with int():

age = int(age_text)
print(age)
print(type(age))

Output:

25
<class 'int'>

This matters because numbers should be stored as numbers when you want to:

compare values
sort correctly
do math

If the text is not numeric, int() will fail with a ValueError. If you need help with that, see ValueError: invalid literal for int() with base 10 and how to convert string to int in Python.

Step 5: Create the cleaned result #

Now put the steps together into one script.

raw_data = [
    {"name": " Alice ", "age": "25", "city": "new york"},
    {"name": "Bob", "age": "", "city": " london "},
    {"name": "  Cara", "age": "31", "city": "PARIS"}
]

cleaned_data = []

for row in raw_data:
    name = row["name"].strip()
    age_text = row["age"].strip()
    city = row["city"].strip().title()

    if not name:
        continue

    age = int(age_text) if age_text else None

    cleaned_data.append({
        "name": name,
        "age": age,
        "city": city
    })

print(cleaned_data)

What this script does:

Loops through each row
Cleans each field
Skips rows where name is empty
Converts age when possible
Appends a new cleaned dictionary to cleaned_data

Using a new list is safer for beginners because you keep the original data unchanged.

Expected output #

Running the full script prints:

[{'name': 'Alice', 'age': 25, 'city': 'New York'}, {'name': 'Bob', 'age': None, 'city': 'London'}, {'name': 'Cara', 'age': 31, 'city': 'Paris'}]

Notice what changed:

Extra spaces were removed
City names were standardized
Age values were converted from strings to integers
Missing age became None

Useful improvements for real projects #

This example is intentionally simple. In real projects, you may also want to:

Skip rows with missing required fields
Use try-except when converting numbers
Read messy data from CSV files
Write cleaned data back to a file
Move cleaning steps into a function

For example, if your age values might contain invalid text, you can make the conversion safer:

raw_data = [
    {"name": "Alice", "age": "25", "city": "new york"},
    {"name": "Bob", "age": "unknown", "city": "london"}
]

cleaned_data = []

for row in raw_data:
    name = row["name"].strip()
    age_text = row["age"].strip()
    city = row["city"].strip().title()

    try:
        age = int(age_text) if age_text else None
    except ValueError:
        age = None

    cleaned_data.append({
        "name": name,
        "age": age,
        "city": city
    })

print(cleaned_data)

If your data comes from a file, the next practical step is to read a CSV file in Python, clean each row, and then write a CSV file in Python.

Common mistakes #

These are common causes of messy data problems:

Extra spaces in user input or file data
Numbers stored as strings
Empty strings used instead of missing values
Mixed uppercase and lowercase text
Assuming every row has valid data

If your script is not working, these quick debug prints can help:

print(raw_data)
print(row)
print(type(row['age']))
print(repr(row['name']))
print(cleaned_data)

Why these help:

print(raw_data) shows the full original data
print(row) shows the current row being processed
print(type(row['age'])) confirms whether age is a string or number
print(repr(row['name'])) makes hidden spaces visible
print(cleaned_data) shows the result after cleaning

FAQ #

Why use a new cleaned list instead of changing the old one? #

It is safer for beginners. You keep the original data and can compare before and after.

Why is None used for missing values? #

None clearly means no value. It is better than keeping an empty string for numeric data.

What if int() fails during cleaning? #

Use a check first or wrap the conversion in try-except if the input may contain invalid numbers.

Can I use this script with CSV data? #

Yes. The same cleaning steps work after reading rows from a CSV file into dictionaries.

Python Data Cleaning Script Example

Quick example #

What this example does #

The messy data we start with #

Step 1: Remove extra whitespace #

Step 2: Standardize text values #

Step 3: Handle missing values #

Step 4: Convert data types #

Step 5: Create the cleaned result #

Expected output #

Useful improvements for real projects #

Common mistakes #

FAQ #

Why use a new cleaned list instead of changing the old one? #

Why is None used for missing values? #

What if int() fails during cleaning? #

Can I use this script with CSV data? #

See also #