Data & AnalysisPremiumintermediate

Create a Bulletproof Data Cleaning Checklist

Systematically find and fix messy data — missing values, duplicates, outliers, and format issues

Copy & Paste this prompt

I have a dataset that needs cleaning before analysis.

Dataset info:
- Source: [WHERE THE DATA CAME FROM — CRM, survey, API, manual entry, CSV export]
- Size: [ROWS] rows × [COLUMNS] columns
- Key columns: [LIST THE IMPORTANT ONES]
- Known issues: [ANY PROBLEMS YOU ALREADY KNOW ABOUT]
- Tool I am using: [EXCEL / PYTHON / R / SQL / GOOGLE SHEETS]

Create a complete data cleaning plan:

1. INITIAL ASSESSMENT
   - What to check first (shape, types, head/tail)
   - Exact commands/formulas for my tool
   - Expected vs actual column types

2. MISSING DATA AUDIT
   - How to find missing/null/blank values per column
   - For each column: recommended handling strategy
     (drop / fill with mean / fill with mode / flag / interpolate)
   - How to decide which strategy fits

3. DUPLICATE DETECTION
   - How to find exact duplicates
   - How to find near-duplicates (fuzzy matching)
   - Decision framework: keep first / keep latest / merge

4. OUTLIER ANALYSIS
   - Statistical methods to detect outliers (IQR, Z-score)
   - For each numeric column: what counts as suspicious
   - When to remove vs. investigate vs. keep outliers

5. FORMAT STANDARDIZATION
   - Date formats → one standard
   - Text → consistent casing, trim whitespace
   - Categories → consolidate similar values
   - Numbers → handle currency symbols, commas, units

6. VALIDATION RULES
   - Cross-column consistency checks
   - Range checks (age > 0, percentage ≤ 100)
   - Referential integrity (do all IDs exist?)

7. DOCUMENTATION
   - Cleaning log template: what was changed, why, how many rows affected
   - Before/after summary statistics

Give me the actual code/formulas, not just concepts.

#data#analytics#create#bulletproof#cleaning

Works with

chatgptclaudegemini

💡 Pro Tips

•Always make a backup before cleaning — work on a copy, never the original
•Clean columns in dependency order — fix IDs before fixing foreign key references
•Document every cleaning step — future you will forget why you removed those 47 rows

✨ Example Output

Dataset: Customer orders export (CSV from Shopify)
Tool: Python (pandas)

INITIAL ASSESSMENT:
df = pd.read_csv("orders.csv")
print(df.shape) → (5,712, 23)
print(df.dtypes) → Found 3 columns typed as object that should be numeric
print(df.isnull().sum()) → 6 columns have missing values

MISSING DATA AUDIT:
| Column | Missing | % | Strategy |
| email | 23 | 0.4% | Flag — likely guest checkout |
| shipping_address | 5 | 0.09% | Drop — cannot ship without |
| phone | 1,247 | 22% | Keep as-is — optional field |
| discount_code | 3,891 | 68% | Fill with "none" — absence is meaningful |

DUPLICATE DETECTION:
Found 12 exact duplicates → removed (kept first occurrence)
Found 34 potential near-duplicates (same email + date + total) → flagged for manual review

OUTLIERS:
order_total: IQR method flagged 89 orders over ,340 → investigated, 3 were data entry errors (extra zero), rest legitimate bulk orders