# Chapter 42 Data preprocessing with Excel

In this chapter, we will show you how to apply these preprocessing steps in Excel for one participant. That is, for this participant, we will:

- Remove extreme RTs
- Remove incorrect trials
- Remove outlier RTs

Then, we will calculate measures of central tendency for this participant. Please note that not all researchers will apply all of these steps. However, you should be aware of the possible processing steps and know how to implement them using Excel. In addition, we will also calculate accuracies.

The processing steps, explained in detail below, are the following:

- Step 1: Convert reaction times to milliseconds
- Step 2: Calculating overall accuracy
- Step 3: Removing trials with extreme RTs
- Step 4: Calculating condition-specific accuracies
- Step 5: Calculating condition-specific mean RTs (before outlier removal)
- Step 6: Calculating SDs and thresholds for outlier removal
- Step 7: Calculating condition-specific mean RTs (after outlier removal)

Please note that there are other ways to achieve the same aims in Excel. We have opted for an approach that is relatively verbose and keeps the formulas mostly simple.

Please download the partially worked example. This is an output file from a single participant who previously completed a letter flanker task (note that this experiment also included a flanker distance manipulation which we will ignore for our analysis).

Please note the following modifications to the original output file:

- The file was converted to an Excel
`.xlsx`

file. - A few columns not relevant for our analysis were removed.
- Information from the practice trials was removed.
- Columns that were part of the original output file were highlighted in green.
- Columns added to the output file were highlighted in yellow.
- Analysis results are in cells highlighted in blue.

The following short video demonstrates how to adjust column widths, remove practice trials, and remove unnecessary columns (note that the video has no sound).

The most relevant columns for our purposes are:

`congruency`

: whether the trial was congruent (e.g., “HHHHH”) or incongruent (e.g., “SSHSS”)`response.corr`

: the accuracy of the response (0 = incorrect, 1 = correct)`response.rt`

: the response time (in seconds)

In the example file, we have already done all the calculations for the congruent trials.

Your task will be to do the calculations for the incongruent trials while you read this chapter. Note that the partially worked example already includes all the columns you need. You will only need to add the formulas.

**Hint**

Note that you don’t need to reinvent the wheel. Copy and paste the formulas from the congruent trials and adapt them for the incongruent trials. How to copy and paste Excel formulas was described in Section 39.3.

## Step 1: Converting reaction times to milliseconds

**Aim**: Convert RTs to ms

**Why do we do this?**: This is done for convenience only. Most of us find it easier to use integers as opposed to decimals. Also, the numbers are shorter when using ms (e.g., 527 ms vs. 0.527 s).

We create a new column `RTms`

and add the following formula to the first cell:

`=H2*1000`

That is, we simply multiply the RT in seconds by 1,000 to obtain milliseconds. This formula is then copied downwards by dragging the fill handle or by double-clicking on the fill handle.

## Step 2: Calculating the overall accuracy

**Aim**: Calculate overall accuracy, taking into account all trials

**Why do we do this?**: Sanity check of the data. If the overall accuracy is close to chance, you should probably exclude this participant.

What is chance performance? For a task with two response alternatives, chance performance would be 50%. That is, if you were to press keys randomly, you would still be correct 50% of the time. For a task with four response alternatives, chance performance would be 25%.

This is how to calculate accuracy:

\[Overall\ accuracy = \frac{Number\ of\ correct\ trials}{Total\ number\ of\ trials}\]

We are going to use `COUNTIF`

and `COUNT`

to calculate the accuracy in Excel. Assuming our accuracy data are in cells `G2`

to `G73`

, this is what the formula should look like:

`=ROUND((COUNTIF(G2:G73, 1)/COUNT(G2:G73))*100,1)`

In words:

- Count all the correct trials in cells
`G2`

to`G73`

(i.e., the ones where the value is 1) - Count all trials in cells
`G2`

to`G73`

- Multiply by 100 to get a percentage (optional)
- Round the result to 1 decimal place (optional)

Without `ROUND`

:

`=(COUNTIF(G2:G73, 1)/COUNT(G2:G73))*100`

Without percentage calculation:

`=COUNTIF(G2:G73, 1)/COUNT(G2:G73)`

## Step 3: Removing trials with extreme RTs

**Aim**: Remove trials that are either extremely fast (anticipatory responses) or extremely slow (attentional lapses or distractions).

**Why do we do this?**: See Section 41.2.

First, let’s remove trials with extreme RTs from our column with the accuracy information. To this aim, we create a column called `accNoExtremes`

. We add the following formula to the first cell and copy it downwards:

`=IF(AND(I2>=150, I2<=3000), G2, "")`

In words:

- If the RT is slower than 150 ms (
`I2>=150`

)*and* - If the RT is faster than 3 seconds (
`I2<=3000`

) - Then copy the accuracy information to the new column

Second, let’s remove trials with extreme RTs from our column with the RT information. To this aim, we create a column called `rtNoExtremes`

. We add the following formula to the first cell and copy it downwards:

`=IF(AND(I2>=150, I2<=3000), I2, "")`

In words:

- If the RT is slower 150 ms (
`I2>=150`

)*and* - If the RT is faster than 3 seconds (
`I2<=3000`

) - Then copy the RT information to the new column

Finally, let’s also calculate the percentage of extreme RTs. In a new cell, add the following formula:

`=ROUND((COUNTIF(J2:J73,"")/ROWS(J2:J73))*100,1)`

In words:

- Count the empty cells in the range J2 to J73 (
`COUNTIF(J2:J73,"")`

) - Divide this number by the number of rows in the range J2 to J73 (
`ROWS(J2:J73)`

) - Optional: multiply by 100 and round to one decimal place

Formula without percentage calculation and rounding:

`=COUNTIF(J2:J73,"")/ROWS(J2:J73)`

## Step 4: Calculating condition-specific accuracies

**Aim**: Calculate the accuracies for all experimental conditions (ignoring extreme RTs)

**Why do we do this?**: We would like to know if our experimental conditions had an influence on accuracy.

We are initially going to create two new columns just for congruent and incongruent trial accuracy, respectively.

Formula for column `conAccNoExtremes`

: `=IF(D2="con", J2, "")`

Using this new column, we are going to calculate the accuracy in congruent trials in a new cell with the following formula:

`=ROUND((COUNTIF(L2:L73, 1)/COUNT(L2:L73))*100, 1)`

In words:

- Count the cells in the range
`L2`

to`L73`

that are equal to 1 (`(COUNTIF(L2:L73, 1)`

) - Count the cells in the range
`L2`

to`L73`

that are any number (i.e., 0 or 1 in our case) (`(COUNT(L2:L73)`

) - Divide the first count by the second count (
`/`

) - Optional: multiply by 100 and round to one decimal place

## Step 5: Calculating condition-specific mean RTs (before outlier removal)

**Aim**: Calculate condition-specific mean RTs after removing incorrect trials

**Why do we do this?**: Ultimately, we would like to find out if our experimental conditions had an influence on RTs. Calculating mean RTs before removing outliers is a necessary step for removing outliers using SDs in a later step. We also remove incorrect trials as there is good evidence that these are typically faster than correct trials.

We are going to create two new columns just for congruent and incongruent RTs, respectively:

Formula for column `conRTNoExtremesCorr`

: `=IF(AND(D2="con", J2=1), K2, "")`

In words:

- If the trial is congruent
*and* - If the trial is correct
- Then copy the RT to the new cell

We then copy the formula downwards.

Using this new column, we are going to calculate the mean RT in congruent trials in a new cell with the following formula:

`=ROUND(AVERAGE(N2:N73), 0)`

In words:

- Average trials in range N2 to N73
- Round to 0 decimal places

## Step 6: Calculating SDs and thresholds for outlier removal

**Aim**: Calculate condition-specific mean RTs after removing outliers based on standard deviations (SDs)

**Why do we do this?**: Ultimately, we would like to find out if our experimental conditions had an influence on RTs. Calculating SDs is a necessary step for SD-based outlier removal in a later step.

Please note: Outlier removal based on SDs is not ideal (because the outliers themselves will affect the SD), but it is easy to implement in Excel, and still frequently applied in published research. Therefore, we will use it, but highlight that the median absolute deviation often is a better choice.^{35}

Frequently, RTs located 2 or more SDs away from the mean are considered outliers (you might also find 2.5 SDs or 3 SDs in the literature). To illustrate this approach: Imagine a participant’s mean RT is 1,000 ms and their SD is 200 ms. If you apply the 2 SDs rule, you would remove all RTs below 600 ms and above 1,400 ms from the analysis.

As a first step, we need to calculate the SDs of our two trial types. We use the following formula for congruent trials:

`=ROUND(STDEV.S(N2:N73), 0)`

We then determine the lower and upper thresholds for outlier rejection for both conditions. The basic idea is:

- Lower threshold:
`Mean - 2 SDs`

- Upper threshold:
`Mean + 2 SDs`

It is important to use the correct mean and SD for this. That is, for example, thresholds for congruent trials need to use the congruent mean and the congruent SD. We use the following formulas:

- Lower congruent threshold:
`=U17-2*U21`

- Upper congruent threshold:
`=U17+2*U21`

## Step 7: Calculating condition-specific mean RTs (after outlier removal)

**Aim**: Calculate condition-specific mean RTs after removing outliers based on standard deviations (SDs)

**Why do we do this?**: To find out if our experimental conditions had an influence on RTs

Initially, we’re going to add two more columns, `conRTFinal`

and `inconRTFinal`

.

Then add the following formula for congruent trials and copy it downwards:

`=IF(AND(N2>$U$25, N2<$U$26), N2, "")`

In words:

- If the RT is above the lower threshold (
`N2>$U$25`

)*and*- Note that we use an absolute cell reference here to make sure the reference to cell
`U25`

does not change! (See Section 38.8)

- Note that we use an absolute cell reference here to make sure the reference to cell
- If the RT is below the upper threshold (
`N2<$U$26`

) - Then copy the RT to the new cell

Based on the information in `conRTFinal`

, we can then calculate the final mean RT for congruent trials:

`=ROUND(AVERAGE(P2:P73),0)`

Note that this mean fulfills the following criteria:

- It is not influenced by extreme RTs
- It is not influenced by outlier RTs
- It is not influenced by RTs in incorrect trials

Let’s also count how many congruent trials were removed as outliers:

`=COUNTIF(N2:N73, "<"&U25) + COUNTIF(N2:N73, ">"&U26)`

In words:

- Count all RTs below the threshold
`=COUNTIF(N2:N73, "<"&U25)`

`<`

means less than`&`

means combine (“concatenate”) what comes before and after the`&`

`U25`

is the lower threshold

- Count all RTs above the threshold
- Add both counts (
`+`

)

## 42.1 Calculating medians

In addition, we are going to calculate the medians. The median is insensitive to outliers. Therefore, we don’t need to reject outliers and we will use column `N`

:

The formula for congruent trials: `=ROUND(MEDIAN(N2:N73), 0)`

In words:

- Calculate the median for RTs in the range
`N2:N73`

- Round to 0 decimal places

## 42.2 Comparison of means with and without outlier removal and medians

Let’s compare the different values:

Condition | Mean (in ms) before outlier removal | Mean (in ms) after outlier removal | Median (in ms) |
---|---|---|---|

Congruent RT | 484 | 475 | 456 |

Please note that mean before outlier removal \(>\) mean after outlier removal \(>\) median. This is a typical finding. RT distributions tend to be positively skewed (i.e., skewed towards larger values). The median is insensitive to these values. The mean after outlier removal will be influenced by some of the slower RTs (the ones that were not removed as outliers), and thus it will typically be greater than the median. The mean without outlier removal is influenced by all values, thus, if present, it will be more heavily influenced by outliers.

**Results for incongruent trials**

These are the results for incongruent trials:

- Accuracy: 88.2%
- Mean before outlier removal: 572 ms
- SD: 189 ms
- Mean after outlier removal: 531 ms
- Median: 504 ms

## 42.3 Evaluation

After completing these steps we have the mean RTs and accuracies for one participant. In terms of the data file for SPSS, we have completed one row. Clearly this is not the most efficient way to do this. It’s slow and writing formulas in Excel tends to be non-intuitive thanks to the awkward Excel syntax. We could have used R, but at present the School has asked us not to use R (which is a bit ironic given that I have written these pages using R Markdown). Alternatively, we could have used a Python tool I have written, but that requires the installation of additional software. On the plus side, as mentioned previously, knowing how to use Excel formulas *is* a really useful skill! Please note though that you will not need to remember these formulas for any of the upcoming assessments. There will also be no time limit for the upcoming assessments involving Excel. Therefore, you will have sufficient time to look these formulas up. You will be required to adapt these formulas though.

## 42.4 Preprocessing Shiny app

Here is a Shiny app visualising the effects of preprocessing choices on the mean and SD.

### References

*Frontiers in Psychology*,

*12*, e675558. https://doi.org/10.3389/fpsyg.2021.675558

That said, there is also evidence from simulated data that under certain conditions, outlier rejection based on 2 SDs is actually the best approach (Berger & Kiefer, 2021).↩︎