Python Data Cleaning Script Example
This beginner-friendly example shows one practical Python data cleaning script.
You will start with messy data stored as a list of dictionaries, then clean it by:
- removing extra spaces
- standardizing text
- handling missing values
- converting strings to numbers
- building a new clean result
This is a useful pattern when you want to clean data before saving it, analyzing it, or using it in another part of your program.
Quick example
Use this when you want a simple example of cleaning list-of-dictionary data before saving or using it.
raw_data = [
{"name": " Alice ", "age": "25", "city": "new york"},
{"name": "Bob", "age": "", "city": " london "},
{"name": " Cara", "age": "31", "city": "PARIS"}
]
cleaned_data = []
for row in raw_data:
name = row["name"].strip()
age_text = row["age"].strip()
city = row["city"].strip().title()
if not name:
continue
age = int(age_text) if age_text else None
cleaned_data.append({
"name": name,
"age": age,
"city": city
})
print(cleaned_data)
What this example does
This script shows a simple data cleaning workflow:
- Uses a small list of dictionaries as sample messy data
- Cleans text values with
strip() - Standardizes city names with
title() - Converts age from string to
intwhen possible - Keeps missing age as
None - Builds a new cleaned list instead of changing the original
The messy data we start with
Here is the raw data:
raw_data = [
{"name": " Alice ", "age": "25", "city": "new york"},
{"name": "Bob", "age": "", "city": " london "},
{"name": " Cara", "age": "31", "city": "PARIS"}
]
print(raw_data)
This data has several common problems:
- Extra spaces around values
- Mixed uppercase and lowercase text
- Missing values stored as empty strings
- Numbers stored as text instead of real numbers
Real files often contain this kind of messy data, especially when the data comes from user input or CSV files.
Step 1: Remove extra whitespace
Use strip() on string values to remove spaces at the beginning and end.
row = {"name": " Alice ", "age": "25", "city": " london "}
name = row["name"].strip()
age_text = row["age"].strip()
city_text = row["city"].strip()
print(name)
print(age_text)
print(city_text)
Expected output:
Alice
25
london
Leading and trailing spaces can cause matching problems. For example, "Alice" and " Alice " look similar, but Python treats them as different strings.
If you want a focused guide, see how to remove whitespace from a string in Python.
Step 2: Standardize text values
In this example, we use title() for city names.
city = "PARIS".strip().title()
print(city)
Output:
Paris
Standard formatting makes data easier to compare.
For example:
"paris""PARIS"" Paris "
can all become:
"Paris"
This simple script uses title() because it is easy to understand. In real projects, the best format depends on the data and the rules you want to enforce.
Step 3: Handle missing values
Before converting the age, check whether the string is empty.
age_text = ""
age = int(age_text) if age_text else None
print(age)
Output:
None
Using None makes it clear that the value is missing.
That is usually better than keeping an empty string for numeric data, because:
- empty strings are still strings
- numbers should be stored as numbers
- missing values should be handled intentionally
Step 4: Convert data types
The age value starts as text:
age_text = "25"
print(type(age_text))
Output:
<class 'str'>
You can convert it to an integer with int():
age = int(age_text)
print(age)
print(type(age))
Output:
25
<class 'int'>
This matters because numbers should be stored as numbers when you want to:
- compare values
- sort correctly
- do math
If the text is not numeric, int() will fail with a ValueError. If you need help with that, see ValueError: invalid literal for int() with base 10 and how to convert string to int in Python.
Step 5: Create the cleaned result
Now put the steps together into one script.
raw_data = [
{"name": " Alice ", "age": "25", "city": "new york"},
{"name": "Bob", "age": "", "city": " london "},
{"name": " Cara", "age": "31", "city": "PARIS"}
]
cleaned_data = []
for row in raw_data:
name = row["name"].strip()
age_text = row["age"].strip()
city = row["city"].strip().title()
if not name:
continue
age = int(age_text) if age_text else None
cleaned_data.append({
"name": name,
"age": age,
"city": city
})
print(cleaned_data)
What this script does:
- Loops through each row
- Cleans each field
- Skips rows where
nameis empty - Converts age when possible
- Appends a new cleaned dictionary to
cleaned_data
Using a new list is safer for beginners because you keep the original data unchanged.
Expected output
Running the full script prints:
[{'name': 'Alice', 'age': 25, 'city': 'New York'}, {'name': 'Bob', 'age': None, 'city': 'London'}, {'name': 'Cara', 'age': 31, 'city': 'Paris'}]
Notice what changed:
- Extra spaces were removed
- City names were standardized
- Age values were converted from strings to integers
- Missing age became
None
Useful improvements for real projects
This example is intentionally simple. In real projects, you may also want to:
- Skip rows with missing required fields
- Use
try-exceptwhen converting numbers - Read messy data from CSV files
- Write cleaned data back to a file
- Move cleaning steps into a function
For example, if your age values might contain invalid text, you can make the conversion safer:
raw_data = [
{"name": "Alice", "age": "25", "city": "new york"},
{"name": "Bob", "age": "unknown", "city": "london"}
]
cleaned_data = []
for row in raw_data:
name = row["name"].strip()
age_text = row["age"].strip()
city = row["city"].strip().title()
try:
age = int(age_text) if age_text else None
except ValueError:
age = None
cleaned_data.append({
"name": name,
"age": age,
"city": city
})
print(cleaned_data)
If your data comes from a file, the next practical step is to read a CSV file in Python, clean each row, and then write a CSV file in Python.
Common mistakes
These are common causes of messy data problems:
- Extra spaces in user input or file data
- Numbers stored as strings
- Empty strings used instead of missing values
- Mixed uppercase and lowercase text
- Assuming every row has valid data
If your script is not working, these quick debug prints can help:
print(raw_data)
print(row)
print(type(row['age']))
print(repr(row['name']))
print(cleaned_data)
Why these help:
print(raw_data)shows the full original dataprint(row)shows the current row being processedprint(type(row['age']))confirms whetherageis a string or numberprint(repr(row['name']))makes hidden spaces visibleprint(cleaned_data)shows the result after cleaning
FAQ
Why use a new cleaned list instead of changing the old one?
It is safer for beginners. You keep the original data and can compare before and after.
Why is None used for missing values?
None clearly means no value. It is better than keeping an empty string for numeric data.
What if int() fails during cleaning?
Use a check first or wrap the conversion in try-except if the input may contain invalid numbers.
Can I use this script with CSV data?
Yes. The same cleaning steps work after reading rows from a CSV file into dictionaries.