Data Cleaning

Small business
Nick David/Taxi/Getty Images

Data cleaning is a crucial part of data analysis, particularly when you collect your own quantitative data. After you collect the data, you must enter it into a computer program such as SAS, SPSS, or Excel. During this process, whether it is done by hand or a computer scanner does it, there will be errors. No matter how carefully the data has been entered, errors are inevitable. This could mean incorrect coding, incorrect reading of written codes, incorrect sensing of blackened marks, missing data, and so on.

Data cleaning is the process of detecting and correcting these coding errors.

There are two types of data cleaning that needs to be performed to data sets. They are: possible code cleaning and contingency cleaning. Both are crucial to the data analysis process because if ignored, you will almost always produce misleading research finding.

Possible-Code Cleaning

Any given variable will have a specified set of answer choices and codes to match each answer choice. For example, the variable gender will have three answer choices and codes for each: 1 for male, 2 for female, and 0 for no answer. If you have a respondent coded as 6 for this variable, it is clear that an error has been made since that is not a possible answer code. Possible-code cleaning is the process of checking to see that only the codes assigned to the answer choices for each question (possible codes) appear in the data file.

Some computer programs and statistical software packages available for data entry check for these types of errors as the data is being entered.

Here, the user defines the possible codes for each question before the data is entered. Then, if a number outside of the pre-defined possibilities is entered, an error message appears. For example, if the user tried to enter a 6 for gender, the computer might beep and refuse the code. Other computer programs are designed to test for illegitimate codes in completed data files.

That is, if they were not checked during the data entry process as just described, there are ways to check the files for coding errors after data entry is complete.

If you are not using a computer program that checks for coding errors during the data entry process, you can locate some errors simply by examining the distribution of responses to each item in the data set. For example, you could generate a frequency table for the variable gender and here you would see the number 6 that was mis-entered. You could then search for that entry in the data file and correct it.

Contingency Cleaning

The second type of data cleaning is called contingency cleaning and is a little more complicated than possible-code cleaning. The logical structure of the data may place certain limits on the responses of certain respondents or on certain variables. Contingency cleaning is the process of checking that only those cases that should have data on a particular variable do in fact have such data. For example, let’s say that you have a questionnaire in which you ask respondents how many times they have been pregnant. All female respondents should have a response coded in the data. Males, however, should either be left blank or should have a special code for failing to answer.

If any males in the data are coded as having 3 pregnancies, for example, you know there is an error and it needs to be corrected.

References

Babbie, E. (2001). The Practice of Social Research: 9th Edition. Belmont, CA: Wadsworth Thomson.

Format
mla apa chicago
Your Citation
Crossman, Ashley. "Data Cleaning." ThoughtCo, Mar. 2, 2017, thoughtco.com/data-cleaning-3026541. Crossman, Ashley. (2017, March 2). Data Cleaning. Retrieved from https://www.thoughtco.com/data-cleaning-3026541 Crossman, Ashley. "Data Cleaning." ThoughtCo. https://www.thoughtco.com/data-cleaning-3026541 (accessed January 19, 2018).