Back to prompts
Create a Bulletproof Data Cleaning Checklist
Systematically find and fix messy data — missing values, duplicates, outliers, and format issues
Copy & Paste this prompt
I have a dataset that needs cleaning before analysis.
Dataset info:
- Source: [WHERE THE DATA CAME FROM — CRM, survey, API, manual entry, CSV export]
- Size: [ROWS] rows × [COLUMNS] columns
- Key columns: [LIST THE IMPORTANT ONES]
- Known issues: [ANY PROBLEMS YOU ALREADY KNOW ABOUT]
- Tool I am using: [EXCEL / PYTHON / R / SQL / GOOGLE SHEETS]
Create a complete data cleaning plan:
1. INITIAL ASSESSMENT
- What to check first (shape, types, head/tail)
- Exact commands/formulas for my tool
- Expected vs actual column types
2. MISSING DATA AUDIT
- How to find missing/null/blank values per column
- For each column: recommended handling strategy
(drop / fill with mean / fill with mode / flag / interpolate)
- How to decide which strategy fits
3. DUPLICATE DETECTION
- How to find exact duplicates
- How to find near-duplicates (fuzzy matching)
- Decision framework: keep first / keep latest / merge
4. OUTLIER ANALYSIS
- Statistical methods to detect outliers (IQR, Z-score)
- For each numeric column: what counts as suspicious
- When to remove vs. investigate vs. keep outliers
5. FORMAT STANDARDIZATION
- Date formats → one standard
- Text → consistent casing, trim whitespace
- Categories → consolidate similar values
- Numbers → handle currency symbols, commas, units
6. VALIDATION RULES
- Cross-column consistency checks
- Range checks (age > 0, percentage ≤ 100)
- Referential integrity (do all IDs exist?)
7. DOCUMENTATION
- Cleaning log template: what was changed, why, how many rows affected
- Before/after summary statistics
Give me the actual code/formulas, not just concepts.#data#analytics#create#bulletproof#cleaning
Works with
chatgptclaudegemini
💡 Pro Tips
- •Always make a backup before cleaning — work on a copy, never the original
- •Clean columns in dependency order — fix IDs before fixing foreign key references
- •Document every cleaning step — future you will forget why you removed those 47 rows
✨ Example Output
Dataset: Customer orders export (CSV from Shopify)
Tool: Python (pandas)
INITIAL ASSESSMENT:
df = pd.read_csv("orders.csv")
print(df.shape) → (5,712, 23)
print(df.dtypes) → Found 3 columns typed as object that should be numeric
print(df.isnull().sum()) → 6 columns have missing values
MISSING DATA AUDIT:
| Column | Missing | % | Strategy |
| email | 23 | 0.4% | Flag — likely guest checkout |
| shipping_address | 5 | 0.09% | Drop — cannot ship without |
| phone | 1,247 | 22% | Keep as-is — optional field |
| discount_code | 3,891 | 68% | Fill with "none" — absence is meaningful |
DUPLICATE DETECTION:
Found 12 exact duplicates → removed (kept first occurrence)
Found 34 potential near-duplicates (same email + date + total) → flagged for manual review
OUTLIERS:
order_total: IQR method flagged 89 orders over ,340 → investigated, 3 were data entry errors (extra zero), rest legitimate bulk orders