43 Data preprocessing with Excel
In this chapter, we will show you how to apply these preprocessing steps in Excel for one participant. That is, for this participant, we will:
- Remove extreme RTs.
- Remove incorrect trials.
- Remove outlier RTs.
Then, we will calculate measures of central tendency for this participant. Please note that not all researchers will apply all of these steps. However, you should be aware of the possible processing steps and know how to implement them using Excel. In addition, we will also calculate accuracies.
The processing steps, explained in detail below, are the following:
- Step 1: Convert reaction times to milliseconds.
- Step 2: Calculate overall accuracy.
- Step 3: Remove trials with extreme RTs.
- Step 4: Calculate condition-specific accuracies.
- Step 5: Calculate condition-specific mean RTs (before outlier removal).
- Step 6: Calculate SDs and thresholds for outlier removal.
- Step 7: Calculate condition-specific mean RTs (after outlier removal).
Please note that there are other ways to achieve the same aims in Excel. We have opted for an approach that is relatively verbose and keeps the formulas mostly simple.
A partially worked example
Please download the partially worked example below. This is an output file from a single participant who previously completed a letter flanker task (note that this experiment also included a flanker distance manipulation which we will ignore for our analysis).
Please note that the following modifications were made to the original output file:
- The file was converted to an Excel
.xlsx
file. - A few columns not relevant for the current analysis were removed.
- Rows corresponding to the practice trials were removed.
- Columns that were part of the original output file are highlighted in green.
- Columns added to the output file are highlighted in yellow.
- Analysis results are in cells highlighted in blue.
The following short video demonstrates how to adjust column widths, remove practice trials, and remove unnecessary columns (note that the video has no sound).
The most relevant columns for our purposes are:
congruency
: whether the trial was congruent (e.g., “HHHHH”) or incongruent (e.g., “SSHSS”)response.corr
: the accuracy of the response (0 = incorrect, 1 = correct)response.rt
: the response time (in seconds)
In the example file, we have already done all the calculations for the congruent trials. We will walk you through the formulas common to all trials and those for the congruent trials. In the second part of the lab, your task will be to complete the calculations for the incongruent trials. Note that the partially worked example already includes all the columns you need—you will only need to add the formulas. In addition, note that you don’t need to reinvent the wheel—simply copy and paste the formulas from the congruent trials and adapt them for the incongruent trials. How to copy and paste Excel formulas was described in Section 40.4.
Step 1: Convert reaction times to milliseconds
Aim: Convert RTs to ms.
Why do we do this?: This is done for convenience only. Most of us find it easier to use integers as opposed to decimals. Also, the numbers are shorter when using ms (e.g., 527 ms vs. 0.527 s).
In the partially worked example, we created a new column RTms
and added the following formula to the first cell:
=H2*1000
That is, we simply multiplied the RT in seconds by 1,000 to obtain milliseconds. This formula was then copied downwards by dragging the fill handle (alternatively, you can also double-click on the fill handle).
Step 2: Calculate the overall accuracy
Aim: Calculate overall accuracy, taking into account all trials.
Why do we do this?: Sanity check of the data. If the overall accuracy is close to chance, you should probably exclude this participant.
This is how to calculate accuracy:
\[Overall\ accuracy = \frac{Number\ of\ correct\ trials}{Total\ number\ of\ trials}\]
We used COUNTIF
and COUNT
to calculate the overall accuracy in cell U5
in the partially worked example. As the accuracy data are in cells G2
to G73
, this is what the formula should look like:
=ROUND((COUNTIF(G2:G73, 1)/COUNT(G2:G73))*100,1)
In words:
- Count all the correct trials in cells
G2
toG73
(i.e., the ones where the value is 1). - Count all trials in cells
G2
toG73
. - Multiply by 100 to get a percentage (optional).
- Round the result to 1 decimal place (optional).
Without ROUND
:
=(COUNTIF(G2:G73, 1)/COUNT(G2:G73))*100
Without percentage calculation:
=COUNTIF(G2:G73, 1)/COUNT(G2:G73)
Step 3: Remove trials with extreme RTs
Aim: Remove trials that are either extremely fast (anticipatory responses) or extremely slow (attentional lapses or distractions).
Why do we do this?: See Section 42.2.
In the partially worked example, we removed trials with extreme RTs from our column with the accuracy information. To this aim, we created a column called accNoExtremes
. We added the following formula to the first cell and copied it downwards:
=IF(AND(I2>=150, I2<=3000), G2, "")
In words:
- If the RT is slower than 150 ms (
I2>=150
) and - …if the RT is faster than 3 seconds (
I2<=3000
) - …then copy the accuracy information to the new column.
Second, we removed trials with extreme RTs from our column with the RT information. To this aim, we created a column called rtNoExtremes
. We added the following formula to the first cell and copied it downwards:
=IF(AND(I2>=150, I2<=3000), I2, "")
In words:
- If the RT is slower 150 ms (
I2>=150
) and - …if the RT is faster than 3 seconds (
I2<=3000
) - …then copy the RT information to the new column.
Finally, in cell U8
we calculated the percentage of extreme RTs:
=ROUND((COUNTIF(J2:J73,"")/ROWS(J2:J73))*100,1)
In words:
- Count the empty cells in the range J2 to J73 (
COUNTIF(J2:J73,"")
). - Divide this number by the number of rows in the range J2 to J73 (
ROWS(J2:J73)
). - Optional: Multiply by 100 and round to one decimal place.
Formula without percentage calculation and rounding:
=COUNTIF(J2:J73,"")/ROWS(J2:J73)
Step 4: Calculate condition-specific accuracies
Aim: Calculate the accuracies for all experimental conditions (ignoring extreme RTs).
Why do we do this?: We would like to know if our experimental conditions had an influence on accuracy.
In the partially worked example, we created two new columns just for congruent and incongruent trial accuracies called conAccNoExtremes
and inconAccNoExtremes
, respectively.
Formula for column conAccNoExtremes
: =IF(D2="con", J2, "")
Using this new column, we calculated the accuracy in congruent trials in cell U12
with the following formula:
=ROUND((COUNTIF(L2:L73, 1)/COUNT(L2:L73))*100, 1)
In words:
- Count the cells in the range
L2
toL73
that are equal to 1 ((COUNTIF(L2:L73, 1)
). - Count the cells in the range
L2
toL73
that are any number (i.e., 0 or 1 in our case) ((COUNT(L2:L73)
). - Divide the first count by the second count (
/
). - Optional: Multiply by 100 and round to one decimal place.
Step 5: Calculate condition-specific mean RTs (before outlier removal)
Aim: Calculate condition-specific mean RTs after removing incorrect trials.
Why do we do this?: Calculating mean RTs before removing outliers is a necessary step if one is using an SD-based approach for outlier rejection. We also remove incorrect trials as there is good evidence that these are typically faster than correct trials.
In the partially worked example, we created two new columns just for congruent and incongruent RTs called conRTNoExtremesCorr
and inconRTNoExtremesCorr
, respectively.
Formula for column conRTNoExtremesCorr
: =IF(AND(D2="con", J2=1), K2, "")
In words:
- If the trial is congruent and
- …if the trial is correct
- …then copy the RT to the new cell.
We then copied the formula downwards.
Using this new column, we then calculated the mean RT in congruent trials in cell U17
with the following formula:
=ROUND(AVERAGE(N2:N73), 0)
In words:
- Average trials in range N2 to N73.
- Round to 0 decimal places.
Step 6: Calculate SDs and thresholds for outlier removal
Aim: Calculate condition-specific mean RTs after removing outliers based on standard deviations (SDs).
Why do we do this?: Calculating SDs and SD-based outlier rejection thresholds is a necessary step if one is using an SD-based approach for outlier rejection.
Frequently, RTs located 2 or more SDs away from the mean are considered outliers (you might also find 2.5 SDs or 3 SDs in the literature). To illustrate this approach: Imagine a participant’s mean RT is 1,000 ms and their SD is 200 ms. If you apply the 2 SDs rule, you would remove all RTs below 600 ms and above 1,400 ms from the analysis.
As a first step, we need to calculate the SDs of our two trial types. In the partially worked example, we used the following formula in cell U21
for congruent trials:
=ROUND(STDEV.S(N2:N73), 0)
We then determined the lower and upper thresholds for outlier rejection for both conditions. The basic idea is:
- Lower threshold:
Mean - 2 SDs
- Upper threshold:
Mean + 2 SDs
It is important to use the correct mean and SD for this. That is, for example, thresholds for congruent trials need to use the congruent mean and the congruent SD. We used the following formulas in cells U25
and U26
, respectively:
- Lower congruent threshold:
=U17-2*U21
- Upper congruent threshold:
=U17+2*U21
Step 7: Calculate condition-specific mean RTs (after outlier removal)
Aim: Calculate condition-specific mean RTs after removing outliers based on standard deviations (SDs).
Why do we do this?: The result of this step is what we will input into SPSS to find out if our experimental conditions had an influence on RTs.
In our partially worked example, we’ve added two more columns for this, conRTFinal
and inconRTFinal
.
We then added the following formula for congruent trials to cell P2
and copied it downwards:
=IF(AND(N2>$U$25, N2<$U$26), N2, "")
In words:
- If the RT is above the lower threshold (
N2>$U$25
) and- Note that we use an absolute cell reference here to make sure the reference to cell
U25
does not change!
- Note that we use an absolute cell reference here to make sure the reference to cell
- …if the RT is below the upper threshold (
N2<$U$26
) - …then copy the RT to the new cell.
Based on the information in conRTFinal
, we can then calculate the final mean RT for congruent trials in cell U32
:
=ROUND(AVERAGE(P2:P73),0)
Note that this mean fulfils the following criteria:
- It is not influenced by extreme RTs.
- It is not influenced by outlier RTs.
- It is not influenced by RTs in incorrect trials.
We also counted how many congruent trials were removed as outliers in cell U35
:
=COUNTIF(N2:N73, "<"&U25) + COUNTIF(N2:N73, ">"&U26)
In words:
- Count all RTs below the threshold
=COUNTIF(N2:N73, "<"&U25)
.<
means less than.&
means combine (“concatenate”) what comes before and after the&
.U25
is the lower threshold.
- Count all RTs above the threshold.
- Add both counts (
+
).
43.1 Calculate medians
In addition, we are going to calculate the medians. The median is insensitive to outliers. Therefore, we don’t need to reject outliers (and we therefore used column N
in the partially worked example). We calculated the median in cell U39
:
The formula for congruent trials: =ROUND(MEDIAN(N2:N73), 0)
In words:
- Calculate the median for RTs in the range
N2:N73
. - Round to 0 decimal places.
43.2 Comparison of means with and without outlier removal and medians
Let’s compare the different values:
Condition | Mean (in ms) before outlier removal | Mean (in ms) after outlier removal | Median (in ms) |
---|---|---|---|
Congruent RT | 484 | 475 | 456 |
Please note that mean before outlier removal \(>\) mean after outlier removal \(>\) median. This is a typical finding. RT distributions tend to be positively skewed (i.e., skewed towards larger values). The median is insensitive to these values. The mean after outlier removal will be influenced by some of the slower RTs (the ones that were not removed as outliers), and thus it will typically be greater than the median. The mean without outlier removal is influenced by all data points, so it will typically have the largest value.
43.3 Evaluation
After completing these steps we have the mean RTs and accuracies for one participant. In terms of the data file for SPSS, we have completed one row. Clearly this is not the most efficient way to do this. It is slow and writing formulas in Excel tends to be non-intuitive thanks to the awkward Excel syntax. We could have used R, but at present the School has asked us not to use R (which is a bit ironic given that I have written the HHG using R Markdown). Alternatively, we could have used a Python tool I have written, but that requires the installation of additional software.
On the plus side, as mentioned previously, knowing how to use Excel formulas is a really useful skill! Please note though that you will not need to remember these formulas for any of the upcoming assessments. There will also be no time limit for the upcoming assessments involving Excel. Therefore, you will have sufficient time to look these formulas up.
43.4 Preprocessing Shiny app
Here is an interactive Shiny app visualising the effects of preprocessing choices on the mean and SD.