The Data Cleansing.xlsx file has data on 1500 customers of a particular company. A portion of these data appears in Figure 4.39, where many of the rows have been hidden. How much of this data set is usable? How much needs to be cleansed?
Figure 4.39


Details
Data Set with Bad Data
Objective

To find and fix errors in this company’s data set.
Solution

We purposely constructed this data set to have a number of problems, all of which you might encounter in real data sets. We begin with the Social Security Number (SSN). Presumably, all 1500 customers are distinct people, so all 1500 SSNs should be different. How can you tell if they are? One simple way is as follows.

Sort on the SSN column.
Once the SSNs are sorted, enter the formula =If(B3=B2,1,0) in cell J3 and copy this formula down column J. This formula checks whether two adjacent SSNs are equal.
Enter the formula =SUM(J3:J501) in cell J2 to see if there are any duplicate SSNs (see Figure 4.40). As you can see, there are two pairs of duplicate SSNs.
Figure 4.40


Details
Checking for Duplicate SSNs