Python Data Cleaning Script Example
This beginner-friendly example shows one practical Python data cleaning script.
You will start with messy data stored as a list of dictionaries, then clean it by:
- removing extra spaces
- standardizing text
- handling missing values
- converting strings to numbers
- building a new clean result
This is a useful pattern when you want to clean data before saving it, analyzing it, or using it in another part of your program.
Quick example #
Use this when you want a simple example of cleaning list-of-dictionary data before saving or using it.
raw_data = [
{"name": " Alice ", "age": "25", "city": "new york"},
{"name": "Bob", "age": "", "city": " london "},
{"name": " Cara", "age": "31", "city": "PARIS"}
]
cleaned_data = []
for row in raw_data:
name = row["name"].strip()
age_text = row["age"].strip()
city = row["city"].strip().title()
if not name:
continue
age = int(age_text) if age_text else None
cleaned_data.append({
"name": name,
"age": age,
"city": city
})
print(cleaned_data)
What this example does #
This script shows a simple data cleaning workflow:
- Uses a small list of dictionaries as sample messy data
- Cleans text values with
strip() - Standardizes city names with
title() - Converts age from string to
intwhen possible - Keeps missing age as
None - Builds a new cleaned list instead of changing the original
The messy data we start with #
Here is the raw data:
raw_data = [
{"name": " Alice ", "age": "25", "city": "new york"},
{"name": "Bob", "age": "", "city": " london "},
{"name": " Cara", "age": "31", "city": "PARIS"}
]
print(raw_data)
This data has several common problems:
- Extra spaces around values
- Mixed uppercase and lowercase text
- Missing values stored as empty strings
- Numbers stored as text instead of real numbers
Real files often contain this kind of messy data, especially when the data comes from user input or CSV files.
Step 1: Remove extra whitespace #
Use strip() on string values to remove spaces at the beginning and end.
row = {"name": " Alice ", "age": "25", "city": " london "}
name = row["name"].strip()
age_text = row["age"].strip()
city_text = row["city"].strip()
print(name)
print(age_text)
print(city_text)
Expected output:
Alice
25
london
Leading and trailing spaces can cause matching problems. For example, "Alice" and " Alice " look similar, but Python treats them as different strings.
If you want a focused guide, see how to remove whitespace from a string in Python.
Step 2: Standardize text values #
In this example, we use title() for city names.
city = "PARIS".strip().title()
print(city)
Output:
Paris
Standard formatting makes data easier to compare.
For example:
"paris""PARIS"" Paris "
can all become:
"Paris"
This simple script uses title() because it is easy to understand. In real projects, the best format depends on the data and the rules you want to enforce.
Step 3: Handle missing values #
Before converting the age, check whether the string is empty.
age_text = ""
age = int(age_text) if age_text else None
print(age)
Output:
None
Using None makes it clear that the value is missing.
That is usually better than keeping an empty string for numeric data, because:
- empty strings are still strings
- numbers should be stored as numbers
- missing values should be handled intentionally
Step 4: Convert data types #
The age value starts as text:
age_text = "25"
print(type(age_text))
Output:
<class 'str'>
You can convert it to an integer with int():
age = int(age_text)
print(age)
print(type(age))
Output:
25
<class 'int'>
This matters because numbers should be stored as numbers when you want to:
- compare values
- sort correctly
- do math
If the text is not numeric, int() will fail with a ValueError. If you need help with that, see ValueError: invalid literal for int() with base 10 and how to convert string to int in Python.
Step 5: Create the cleaned result #
Now put the steps together into one script.
raw_data = [
{"name": " Alice ", "age": "25", "city": "new york"},
{"name": "Bob", "age": "", "city": " london "},
{"name": " Cara", "age": "31", "city": "PARIS"}
]
cleaned_data = []
for row in raw_data:
name = row["name"].strip()
age_text = row["age"].strip()
city = row["city"].strip().title()
if not name:
continue
age = int(age_text) if age_text else None
cleaned_data.append({
"name": name,
"age": age,
"city": city
})
print(cleaned_data)
What this script does:
- Loops through each row
- Cleans each field
- Skips rows where
nameis empty - Converts age when possible
- Appends a new cleaned dictionary to
cleaned_data
Using a new list is safer for beginners because you keep the original data unchanged.
Expected output #
Running the full script prints:
[{'name': 'Alice', 'age': 25, 'city': 'New York'}, {'name': 'Bob', 'age': None, 'city': 'London'}, {'name': 'Cara', 'age': 31, 'city': 'Paris'}]
Notice what changed:
- Extra spaces were removed
- City names were standardized
- Age values were converted from strings to integers
- Missing age became
None
Useful improvements for real projects #
This example is intentionally simple. In real projects, you may also want to:
- Skip rows with missing required fields
- Use
try-exceptwhen converting numbers - Read messy data from CSV files
- Write cleaned data back to a file
- Move cleaning steps into a function
For example, if your age values might contain invalid text, you can make the conversion safer:
raw_data = [
{"name": "Alice", "age": "25", "city": "new york"},
{"name": "Bob", "age": "unknown", "city": "london"}
]
cleaned_data = []
for row in raw_data:
name = row["name"].strip()
age_text = row["age"].strip()
city = row["city"].strip().title()
try:
age = int(age_text) if age_text else None
except ValueError:
age = None
cleaned_data.append({
"name": name,
"age": age,
"city": city
})
print(cleaned_data)
If your data comes from a file, the next practical step is to read a CSV file in Python, clean each row, and then write a CSV file in Python.
Common mistakes #
These are common causes of messy data problems:
- Extra spaces in user input or file data
- Numbers stored as strings
- Empty strings used instead of missing values
- Mixed uppercase and lowercase text
- Assuming every row has valid data
If your script is not working, these quick debug prints can help:
print(raw_data)
print(row)
print(type(row['age']))
print(repr(row['name']))
print(cleaned_data)
Why these help:
print(raw_data)shows the full original dataprint(row)shows the current row being processedprint(type(row['age']))confirms whetherageis a string or numberprint(repr(row['name']))makes hidden spaces visibleprint(cleaned_data)shows the result after cleaning
FAQ #
Why use a new cleaned list instead of changing the old one? #
It is safer for beginners. You keep the original data and can compare before and after.
Why is None used for missing values? #
None clearly means no value. It is better than keeping an empty string for numeric data.
What if int() fails during cleaning? #
Use a check first or wrap the conversion in try-except if the input may contain invalid numbers.
Can I use this script with CSV data? #
Yes. The same cleaning steps work after reading rows from a CSV file into dictionaries.