Worksheets for Sensory Difference Testing

1. Overview

The purpose of this web page is to explain and illustrate mathematical terms and processes for two types of sensory difference tests, "triangle" and "3-alternative forced choice" (3-AFC). (A "sensory difference test" is also called a "perceptual difference test" or a "sensory discrimination test".) The topics covered here are not new, but I've tried to organize things into a beginning-to-end tutorial focused only on sensory testing.

To help explain things, this tutorial is set up as a series of interactive worksheets, with graphs that illustrate the effect of different parameter values. You can enter different values for some parameters and see in the graphs how the statistics change. You can also run simulations, which generate test data according to a model and analyze the results. If you use a positive simulation delay, test cases will be graphically generated and evaluated within this delay period. If you use a negative simulation delay, the per-sample graphics will be omitted, and the simulation will show only the summary results. The simulations depend on pseudo-random numbers; if you want to see the effect of changing the starting point for these pseudo-random numbers, change the value of "random number seed" (above) to any value you like between 0 and 10000.

Disclaimer: I am not a statistician. While I am very familiar with a number of concepts and methods from statistics, having taught a small subset of statistical concepts for 10 years at the graduate level, the area of statistical significance testing sometimes seems like a game of Fizzbin: just when you think you understand the rules, there's another exception or special case to consider. There has also been a vocal debate in many research communities about the best methods for significance testing and evaluating results. This tutorial certainly won't settle any of that debate, and I'm not at all interested in debating points on which reasonable people may disagree. This page is simply my attempt to explain what I have learned about evaluating sensory difference tests in a (hopefully) logical and intuitive manner.

Although this tutorial discusses probabilities, the math used here is limited to addition, subtraction, multiplication, division, squaring, and the square root. A variable x when squared is notated as x², and the square root of x is notated as x^0.5. I use a number of variables to refer to things, and I sometimes use long variable names. I think that long variable names, such as "R_L(EE)", are easier to remember as meaning "the region of likely experimental error" than a shorter variable name such as "r". Longer names do not imply greater complexity; they are intended to be helpful. All variable names are, I hope, adequately explained below.

2. Objective Stimuli

In sensory difference tests, we have two stimuli that will be evaluated. The stimuli might be two tones of different frequencies, two shades of color, or two samples of the same food with different levels of added sugar. The question we have is fairly straightforward: "Is there a perceptual difference between the two stimuli, or are they perceived as being the same?" The answer to this question is not straightforward, so we'll begin at the beginning by talking about the two (objective) stimuli.

Focusing our example on the case of two foods with different levels of sugar, we'll call one of these samples "A" and the other one "B". The A sample might have more sugar then B, or vice versa; we might know that A has more sugar than B, or we might not know. We will assume that A and B are consistently produced, and therefore that A always has the same, more, or less of some physical characteristic of interest than B.

If we know, for example, that B has 200 mg more sugar than A, then this is a "known difference" case. In this case, we can plot samples of A and B on a graph and label the X axis as "milligrams of sugar". We can also define this known difference (200 mg) as one "objective unit", and by this definition A and B are one unit apart.

In another case, which we can call the "unknown difference" case, let's say that we have the same food produced by two manufacturers, and we suspect that one is sweeter than the other, but we don't know how much sugar or other sweetener is used in each. In this second case, we can label the X axis as "sweetener concentration", but we don't know what units to use. The actual concentration of ingredients isn't important, however, just the perceived difference (or lack thereof). Without knowing what units to use, we can still say that if A and B are physically different, that they differ by one unit. We will never know what this unit is physically measuring, but it doesn't really matter. We also don't know in advance if A is one unit greater than B or vice versa, but we do know that this difference will be consistent: A will always be one unit greater than B, or B will always be one unit greater than A.

For simplicity, we will always plot A on the left and B on the right, implying that B has more of the physical characteristic(s) than A. If this is not true, then the locations of A and B on the plots can be reversed; the conclusion of a perceptual difference (or lack thereof) will be the same. Because of our definitions, we plot A and B as being one unit apart, whether this is a case of a "known difference" or an "unknown difference".

One sample each of A and B are plotted in Figure 1. This is a simple graph, with each line at a height of 1 (for one sample) and an X-axis difference between A and B of one objective unit. The sample of A is plotted at an arbitrary X-axis value of 0; subtraction can be used to map all X-axis values so that A is always located at 0 regardless of the original scale (e.g. 200-mg units of sugar). The sample of B is therefore plotted at an X-axis value of 1 unit.

Figure 1: A plot of two stimuli on the objective-unit scale.

3. Internal Sensory Measurement

Now we'll switch from the objective (physical) stimuli to our perception of these stimuli. What units or metrics should we use to measure this perception (such as sweetness)? There will be some mapping from the physical stimulus (e.g. concentration of sugar) to the neural response to this stimulus. Humans have evolved to be very sensitive to some stimuli and rather insensitive to others. (In the case of bitterness, there is a very wide range of sensitivity levels between people, with some people being very sensitive (so-called "super tasters") and others not so sensitive.) Since we're using a generic term "objective units" to describe the physical stimuli, we can use a similarly generic term, "perceptual units," to describe the sensory reaction to the stimuli. It doesn't matter if these perceptual units are in neural impulses per second or some other scale, as long as there is a direct (but possibly nonlinear) mapping between objective units and perceptual units. For simplicity, we'll say that this direct mapping translates objective units into perceptual units on the same scale, and so A and B are still one unit apart.

When a person is given a sample (or stimulus) for evaluation, there are a number of factors that affect their perception of the sample. This change in their perception may be caused by general random variation in neural responses, memory and adaptation effects, an unintended priming effect, or (for between-subjects studies) differences in people's sensitivity. Because of these factors, when we give subjects multiple samples of A, the perceived sweetness won't always be exactly at 0.0 perceptual units; sometimes it will be higher, and sometimes lower. On average, though, A will have 0.0 perceptual units and B will have 1.0 perceptual units, and we say that the mean of A is 0.0 and the mean of B is 1.0.

This is a good place to introduce probabilities, since probabilities allow us to talk about uncertain information such as perceived sweetness. There are different ways to define probability, but the definition I'd like to use is very practical: a probability is the relative frequency with which we expect an event to occur. Probabilities can be determined in one of two ways: by observation or by a model. Let's consider the example of a coin toss. I can flip a coin 100 times and collect data on how often I get heads and how often I get tails. Let's say that I get 48 heads and 52 tails. Based on these observations, I can estimate that the probability of heads (my expectation for getting heads in the future) is 48/100 or 0.48. Or, I can use a model that incorporates what I know about coin tosses and never use any data at all. In this case, knowing that there are two possibilities and no reason to favor one over the other, the probability of heads is 1/2 or 0.50. If I'm guessing the month you were born, a simple formula says that I have a probability of 1/12 (or 0.0833) of getting the correct answer. (A more complicated formula might account for the different number of days in each month.) Either way, we can refer to the probability of some event x as p(x). In this example, p(heads) is either 0.48 or 0.50, depending on which method we use.

For the type of model we'll talk about in this tutorial (namely a Thurstonian model), we'll also assume that the perceived sweetness varies according to a normal distribution (also called a bell curve), and so the probability of getting a certain level of perceived sweetness can be described using the standard deviation of the perceived values, called sigma (represented with the Greek letter σ or the letter s). The standard deviation is a measure of the amount of variation in a set of values. For simplicity, we'll also assume that the standard deviations for the perception of A and B are the same. (A normal distribution is commonly found in nature. If an event in nature (or perception) is the result of many independent processes, the central limit theorem says that the resulting event will have a normal distribution. In the case of bitterness, it would be better to model the perception as (at least) bimodal, or the summation of two (or more) normal distributions, reflecting both the variance within an individual and the different perceptual-unit means for "super-tasters" and less sensitive subjects. We can make things quite complicated, but for this explanation we'll try to keep things relatively simple.)

If we know the standard deviation of perceived sweetness, we can plot the relative probability of a certain level of perceived sweetness when a subject is given sample A, and the relative probability of perceived sweetness given sample B. In mathematical notation, these probability functions are called p(s | A) and p(s | B), where p(x | Y) is the probability of some value x "given" some condition(s) Y and s is the perceived stimulus level. (The "|" or "given" notation in probabilities can be thought of as first constraining the world of possibilities so that Y is considered to be true. This conditional probability function tells us the probability of observing x "under the condition" that Y is true. For example, p(sunshine | summer) is a large value in most places in the world, whether it is currently summer or not.) The relative probability of sweetness is described using the normal distribution with mean value μ (using the Greek letter "mu") and standard deviation σ. Therefore, we can say p(s | A) = N(μ_A, σ_A²) and p(s | B) = N(μ_B, σ_B²) where N() is the normal distribution described by a mean μ and standard deviation σ. We've already defined μ_A as 0.0 and μ_B as 1.0, and σ_A has been defined as equal to σ_B, so the only thing we need to decide on is the value of σ. (I use the term "relative probability" to avoid a longer discussion about probabilities and probability density functions.)

While we believe this model of perception to be true, we can't measure it directly. We can only measure cognitively higher-level responses to stimuli, and humans aren't very good at providing magnitude estimates of their perceptions. We're much better at communicating relative perceptions. This model is therefore called a latent (or "hidden") model, because it can't be observed directly.

An important concept in this model is that of d' (d prime), or the sensitivity index. Intuitively, d' is a measurement of how perceptually different two stimuli are. If two stimuli are perceptually indistinguishable, the d' value is 0. If two stimuli are easily distinguished, the value of d' may be 4 or larger. Mathematically, d' is defined as the difference between the mean perceptual units of the two stimuli divided by the standard deviation of the perception of these stimuli. We have conveniently defined the difference between the means as always 1.0, and so in our case d' is 1.0/σ. (Cohen's d effect size is the same as d', in case you're familiar with effect size.)

This leads to the concept of a "just noticeable difference", or JND. The JND is the threshold at which a stimulus or change is detected half the time. This forms an often-used threshold for determining whether two stimuli are the same or different. In a two-alternative forced-choice (2-AFC) task, if the two stimuli are identical then performance will be 50% (reflecting random guessing). The JND is at 75% correct in a 2-AFC task, halfway in between random guessing (50%) and perfect identification (100%). A 75% correct response rate on a 2-AFC task corresponds with a d' value of 0.954, using a formula for this mapping. It is common to round d' up to 1.0, corresponding to 76% correct on the 2-AFC task.

By defining "perceptual difference" as a "just noticeable difference" or greater, we can now re-state the original question, "is there a perceptual difference between the two stimuli", in a more quantitative way: "is the d' value greater than (or equal to) 1?" The rest of these worksheets are designed to help answer this question about the value of d'. If we can use test results to conclude that d' is probably greater than or equal to 1.0, that would be ideal. If we can use test results to conclude that d' is probably greater than 0, that would at least imply that the two stimuli are probably not perceptually identical.

Worksheet 1 (below) plots the relative probability of the perception of a stimulus (such as sweetness) when a person is given samples of A, and the relative probability of the perception of that stimulus given samples of B. This relative probability is taken from the normal distribution with standard deviation computed from the specified d'. The relative probability of each is maximum at the perceptual-unit mean of these samples (0.0 for A, 1.0 for B). The more overlap there is between these probability functions for A and B, the more difficult it is to distinguish one from the other. With a d' of 10.0, or a standard deviation of 0.1 perceptual units, there is effectively no overlap between A and B, and one can easily perceive the difference between them. With a d' of 0.10, or a standard deviation of 10.0, there is almost no perceptual difference between A and B; even though they are objectively different, they are perceived as being (practically) the same.

In the entry next to "d' =" on the left of Worksheet 1, you can enter different values of d' and see (a) how much overlap there is between the perception of the two stimuli, (b) what the standard deviation (σ) is, (c) the probability of discrimination, p_d, (also called the proportion of distinguishers) with a value of 0% for perceptually-identical stimuli and 100% for easily-distinguished stimuli, and (d) what the probability of correctly identifying A from B would be in a 2-AFC test, called p_2AFC. (p_d is defined as (p_2AFC − p_g)/(1 − p_g), where p_g is the percent correct from guessing, or 0.50 in this case.)

In Worksheet 1 there is also an option to simulate a 2-AFC test. You can specify the number of tests in this simulation and the simulation delay. The delay specifies how long the result of one test is displayed on the worksheet. When you start the simulation, the specified number of tests will begin. In each test, two points will be chosen, one from stimulus A and one from B. The points, or values of A and B in perceptual units, will be chosen with a probability specified by the displayed normal distribution. Most of the time, the selected point will be fairly close to the mean of that stimulus, but sometimes a point will be far from its mean. The point for A will be displayed with a blue square, and the point for B will be displayed with a green circle. The simulation's test subject is then asked to choose which stimulus is greater in perceptual units, and this (virtual) subject responds according to the perceived magnitude of the stimuli. If the perception of B is greater than that of A and therefore the subject responds correctly, then the shapes are filled in with that stimulus' color. If the perception of B is less than that of A and the subject responds incorrectly, then the shapes are filled in with red. You can pause the simulation by clicking the "PAUSE" button, and resume it using the "CONT." ("continue") button. A tally is kept of what percent of the time a response is correct or incorrect.

The probability of correctly identifying A from B, p_2AFC, will also be called the "model correct" or "theoretical correct" value. For the 2-AFC task, the formula can be found in a paper by R. H. B. Christensen, Equation 6. As the number of simulation tests increases (e.g. from 10 to 100 to 1000), the percent correct from the simulation becomes closer to the model-correct value.

4. Sensory Difference Testing

Having established a (hidden) model for the perception of physical stimuli and introduced the concept of d', we'll now discuss sensory difference testing. In such testing, we ask test subjects questions about samples of the stimuli. The test is designed in such a way that we hope to (a) estimate d' and/or (b) determine if the stimuli are (probably) perceptually different. The number of test subjects in one trial (or test) is called N.

This tutorial describes the case where each test sample is presented to a different test subject, which is called a "between-subjects" design. It is also possible to present each test sample to the same subject, which is called a "within-subjects" design. A within-subjects design can have lower perceptual variation (σ in Section 3) and therefore a larger value of d', but the results will only be valid for that one person. In the case of a test of bitterness, for example, one-third of the population might be "super tasters" and have a very large d', one-third might be "average tasters" and have a d' right around 1.0, and one-third might be "non-tasters" and have a d' close to zero. A test of this combined population might then have a d' below 1.0. Knowing (or estimating) this variation in the population may assist in coming to the correct conclusion for a test.

The focus here is on two types of tests, the triangle test and the three-alternative forced-choice (3-AFC) test. The triangle test is commonly used when you have an unknown difference between the stimuli, as described in Section 2. The 3-AFC test is commonly used when you have a known difference and the test subjects can rate the stimuli along that known dimension (e.g. sweetness). There are a number of other types of sensory difference tests not discussed here, including duo-trio, paired comparison, ABX, two-out-of-five, and same-different.

Worksheet 2 (below) shows the same probabilities of perceptual units from the two stimuli A and B (as in Worksheet 1), but it allows the selection of either a triangle test or a 3-AFC sensory difference test.

In a 3-AFC test, each subject is given two samples of A and one sample of B, and is asked to pick the one with the largest value of an attribute (e.g. the highest level of sweetness). Because B is known to have the largest value of this attribute, the subject is effectively asked to pick stimulus B from the three choices. This comparison is performed by N subjects, yielding N correct or incorrect responses per test. Note that in order to perform this test, the experimenter must know that B has more of the attribute than A, making it a "known differences" test.

In a triangle test, three samples are selected for each subject from the two possibilities of A or B. Half of the time there should be two samples of A and one of B, and the other half of the time these conditions should be reversed. Each subject is asked to select the one stimulus that is most different from the other two. In our perceptual model, that means comparing the perceptual distance between each pair of samples and choosing the sample with the largest distance from the other two. This test can be performed whether or not the differences between A and B are known to the experimenter.

Note that in both types of tests, a subject can guess and still be correct one out of three times. Therefore, with a large enough number of subjects, if d' equals zero then the performance on either of these tests will be 33%. (It is important to control the tests so that the subjects can't use other information to guide their decision. For example, if you're testing two types of cookies for perceived sweetness, you don't want one cookie to have sugar crystals on top and the other to be plain.)

Worksheet 2 shows, on the right-hand side, the percent correct that would be obtained from the specified test given the d' from Worksheet 1 and a large enough (or infinite) number of subjects. Again, this is called the "model correct" value. The formulas for the model-correct value from the 3-AFC and triangle test can be found in the paper by Christensen in Equations 5 and 7, respectively. It can be seen that for any non-zero value of d' the 3-AFC test has a larger model-correct value than the triangle test, until both tests reach 100% accuracy. This is because of the perceptual overlap between A and B; the one sample that is different is more likely to be an extreme value than it is to be the sample with the largest distance to the other two. (H. S. Lee and M. O'Mahony have a more detailed and better explanation of the differences between these tests.)

In Worksheet 2, just below the test type, you can enter the number of subjects for a simulation, as well as the simulation delay. As with Worksheet 1, when you start the simulation one test is performed for each subject. The three test samples are selected from the two stimuli according to the specified probability distributions. For each subject's test, the three samples are plotted with a blue square for A and a green circle for B. The three stimuli are then classified according to the criteria of the test (largest distance or maximum value). If the classification is correct, the shapes are filled in with that stimulus' color. If the classification is incorrect, the shapes are filled in with red. A tally is kept of what percent of the time a response is correct or incorrect. The result of this simulated perceptual test is reported as the "simulation correct" number at the bottom right of Worksheet 2. As the number of subjects increases, the simulation-correct value approaches that of the model-correct value.

This may be stating the obvious, but note that as the number of subjects increases, the variation of the perceptual stimuli in Worksheet 2 doesn't change. The perceptual variation is the same, and d' is the same, regardless of the number of subjects in the test.

5. Multiple Trials

We are generally concerned with a single trial of a sensory difference test that has N subjects. We (a) prepare by determining the type of test, number of subjects, etc., (b) conduct the individual tests on each subject, and (c) analyze and interpret the result from this single trial. Once that's done, we have a result and move on to other things. In order to better understand the concepts behind significance testing and how to interpret the result of a trial, it can be helpful to think about running lots of trials. Because we're doing computer simulation, we can easily simulate and visualize many trials, running the same test over and over on a different set of N subjects each time.

The sensory test of one subject yields a binary outcome: a correct or incorrect response. Each test of N subjects yields some number of correct responses, C. The result of one trial can be seen as either this value C or the proportion of correct responses, p_trial (where p_trial = C/N). A series of T trials results in T values of C or p_trial. We'll call the average of all p_trial values from the T trials p_avg.

If we divide the model-correct value from Worksheet 2 (above) by 100, we convert that from a percent to the probability of getting a correct response. We'll call this probability from the model p. It can be seen in Worksheet 2 that as the number of subjects N increases, the simulation results become closer to the model-correct value. Regardless of the number of subjects per trial, if we have a large enough number of trials, we expect that the average proportion of correct responses (p_avg) will approach the probability of a correct response p. In other words, we can reach the model's probability estimate p by either running a single experiment with a very large number of subjects, or a large number of trials with a smaller number of subjects each. (The trials are independent of each other, so we can think of one large trial T_L with N_L subjects as equivalent to two smaller trials, T₁ (with N₁ subjects) and T₂ (with N₂ subjects), where N_L = N₁ + N₂.) The value of p_avg over a very large number of trials should be close to p, no matter what the value of N is.

The values of p_trial will be clustered around an average value (p_avg) with some amount of variation. The more subjects there are in a trial, the closer each value of p_trial will be to p, and so the less variation there will be in these values. The fewer subjects per trial, the more that perceptual variation (Section 3) will affect the results, and the larger the variation in results will be. This difference in the variation in results with the number of subjects in a trial is an important concept in the following worksheets: the larger the value of N in our experiment, the more confidence we can have in the result. To put it another way, as N increases, there is a lower probability of a different trial having a different result.

A trial that has a probability of p correct responses from N subjects can be described using a binomial distribution. A binomial distribution is a function, Binomial(x; N, p), which returns the probability of seeing exactly x correct values from the N subjects, with each test having a probability of success p. If we iterate x from 0 to N, we get a set of probability values that sum to 1 (because a trial always generates a result). If the binomial distribution says that we expect 6 out of 10 subjects to get the correct result with probability 0.24 when p equals 0.634 (i.e. Binomial(6; 10, 0.634) = 0.24), then out of 100 trials we expect that on average there will be 24 trials with exactly 6 out of 10 correct responses. The "expected" or most likely value from a binomial distribution is with x equal to N × p. (Although the values of x used by the binomial function are integers, this expected value doesn't have to be an integer, just as the average population of a household can be 2.5 people instead of an integer 2 or 3.)

I'd like to get into a little math for those who are interested; if you're not interested in the math, you can skip this paragraph. The variance of a binomial distribution is N × p × (1 − p), and so the variance increases proportionally with the number of subjects, N. The standard deviation, σ, is the square root of the variance. Therefore, the standard deviation of probabilities from a binomial distribution also increases with the number of subjects, but not as quickly as the variance. If one looks not at the number of correct values from a trial (x) but at the proportion or probability of correct results (i.e. the number of correct values divided by the total number of subjects, p_bin = x / N), then the most likely value is simply p (instead of N × p). The standard deviation of p_bin can be written as σ_p = (N × p × (1 − p))^1/2 / N, which is equivalent to (p × (1 − p))^1/2 / N^1/2. Therefore, as N increases, the standard deviation of the probability of a correct result decreases. (For example, if N is 4 and the standard deviation of the probability is 0.3, then if N increases to 16 the standard deviation decreases to 0.15.) This conforms with our previous expectation that as N increases, there is a lower probability of another trial having a different result, but the formula describes this expectation in a precise way.

We can plot a binomial distribution and the simulation of many trials using bar graphs. (See Worksheet 3, below, with N from Worksheet 2 and p based on the "model correct" value from Worksheet 2.) The X axis is the number of subjects in one trial with a correct result (x for the binomial distribution or C for the simulation). Also on the X axis we can show the probability of a correct result (p_bin for the binomial distribution) or the proportion of correct results (p_trial for the simulation). For the binomial distribution, the Y axis is the probability of a trial having a particular X-axis value. For the simulation, the Y axis is the relative frequency of values of C or p_trial. In other words, the Y axis is the probability or relative frequency of exactly x or C subjects having a correct result.

We can simulate many trials and see how the values of C cluster around a central value (p_avg). The result of each trial is plotted in Worksheet 3 in blue with cross-hatched lines. As the number of trials increases, p_avg approaches p from the binomial distribution. Because the binomial distribution provides a good description of this type of trial, then with a large number of trials the relative frequency of C values from each trial should have a distribution close to that of the probabilities from a binomial distribution.

The "binomial" numbers on the right side of Worksheet 3 show some values predicted by a binomial distribution with parameters N and p. The first value ("expect") is the number of subjects we expect to provide a correct response, on average. This value is N × p. The second value ("p") is the probability of a correct response, which is the percent correct predicted by the model in Worksheet 2 divided by 100. The third value ("σ_p") is the standard deviation of values of p_bin.

The "simulation" numbers on the right side of Worksheet 3 show the values obtained from the simulation of T trials. The first value ("avg. corr.") is the number of subjects who get a correct result, on average. The second value ("p_avg") is the average proportion of correct responses (i.e. the average number of subjects with a correct response divided by the number of subjects per trial). The third value ("s") is the standard deviation of the values of p_trial. As the number of trials increases, the values from this simulation become closer to the values predicted by the binomial distribution.

6. Significance Testing: H0

Referring back to Section 2 for a moment, recall that our goal is to determine whether the two stimuli are perceptually different or not. The easiest way to show that they are different, mathematically, is to show that they are not the same. Why is it easier? Because a hypothesis that the stimuli are perceptually the same (which we hope to disprove) requires no knowledge about all the ways in which they might be different. We'll call this hypothesis H0, or the null hypothesis. For H0, we only need to know the number of subjects in order to create a binomial distribution; the probability of a correct response is the same as random guessing, or 0.333. We don't need to figure out (or estimate) d'; under this hypothesis, d' is zero.

A hypothesis like H0 is tested in terms of probabilities. We perform one trial of an experiment and measure some result, C. We can then find the probability of observing this result using the binomial distribution under the assumption that this hypothesis is true, i.e. Binomial(C; N, p). For H0, this becomes Binomial(C; N, 0.333). (We can express this probability in mathematical notation as p(X=C | H0), or the probability of a result X being equal to the observed value C, given H0). We want to know the probability of observing at least this result, in other words a value of C or larger (i.e. p(X≥C | H0)), and so we compute probabilities from the binomial distribution for all values from C to N and sum them up. This is called the "p value" of the experiment. If this p value, or the probability of getting a result of at least C when H0 is true, is less than some threshold, then we conclude that this result is sufficiently improbable given the assumption that H0 is true, and therefore we reject H0. If we reject H0 (in which d' equals 0), then we accept the hypothesis that d' is not zero, and we therefore conclude that there is a perceptual difference. A test that rejects H0 is said to have demonstrated significance. This type of hypothesis testing is referred to as "null hypothesis significance testing", or NHST.

One complication with NHST is that if we don't reject H0, then we can't say anything at all about the two stimuli. Not rejecting H0 doesn't lead us to accept H0; it simply leaves us with no conclusion (d' may or may not be zero). The higher the bar for disproving H0, the more likely it is that we won't be able to conclude anything at all about an experimental result.

The way that we've defined H0 above (that the two stimuli are the same, or that d' equals zero) is kind of a classic, tried-and-true null hypothesis that is very commonly used. However, we don't need to define H0 as the case in which the stimuli are the same. In Section 9 we will consider a different definition of H0. For now, though, we'll stick with the classics.

The threshold for rejecting H0 is called alpha, the probability of rejecting the null hypothesis given that it was assumed to be true. I'll use the subscript H0 to be clear that this is the threshold for rejecting H0, i.e. alpha_H0. A commonly used value for alpha_H0 is 0.05, meaning that H0 will be (incorrectly) rejected 5% of the time when H0 is true. (The value of 0.05 is common but other values can be used, depending on the purpose of the test.) Note that (by definition) when alpha_H0 is 0.05, in 1 out of every 20 experiments in which H0 is true, we will incorrectly conclude that there is a perceptual difference between the stimuli.

By the way, I try to avoid phrases such as "the effect is significant" because the result of a significance test depends on the combination of the hypothesis, the data, and how the test was conducted. Significance testing does not provide an objective truth or even a claim about the hypothesis, but a claim about the outcome of a testing process. Changing the process can change whether the result is significant or not, even with the same hypothesis. If a test demonstrates significance, that doesn't prove anything about the hypothesis; in fact, with the standard alpha_H0 at 0.05, H0 will be incorrectly rejected on average 5% of the time. With a larger number of subjects it becomes easier to reject the null hypothesis, to the point that there may be no practical difference between the perception of the two stimuli, but we may still reject H0 (e.g. when d' is 0.05). Therefore, I prefer to say that a test "demonstrates significance". What significance implies about the perception of the two stimuli needs to be evaluated within the context of the test.

Worksheet 4 (below) shows the binomial distribution for H0 with d'=0. No matter how many subjects there are, the probability of X subjects having a correct result is maximum when X is one-third of N, or when the probability of any one subject having a correct result is 0.333. This reflects the fact that the most likely result of randomly guessing one of three options is to be right one-third of the time. (Although it is still quite possible to guess the correct answer more or less of the time, it is not as likely.) As the number of subjects increases, the relative variation in this distribution decreases. Therefore, with a larger number of subjects, it is more likely that the result will be close to one-third of N.

You can specify the value of alpha_H0 in this worksheet. The value of alpha_H0 determines the threshold for the number of subjects that must have a correct result in order to reject H0. If the number of subjects with a correct result is less than this threshold, the probability of this result (given H0) is shown in light red. If the number of subjects with a correct result is greater than or equal to this threshold, the probability is shown in dark red. The sum of all of the dark-red probabilities is, by definition, less than or equal to alpha_H0. If you get a result anywhere in the dark-red region, H0 can be rejected. The light-red probabilities are therefore labeled "H0 inconclusive", and the dark-red probabilities are labeled "H0 reject". An orange vertical dashed line shows the boundary between inconclusive H0 and rejecting H0. Just above this line the symbol α_H0 indicates that this is the alpha_H0 threshold for rejection.

Note that the region in dark red is only for the higher number of correct responses in the binomial probability distribution. Rejecting H0 only when the result is in the high end of the probability distribution is called a one-tailed test. In some cases, it's important to have the region of rejection at both ends of the probability distribution, which is called a two-tailed test. Whether one should use a one-tailed or two-tailed test depends on what the null hypothesis is and what it means to get a test result at either end of the probability distribution.

For sensory difference testing, we interpret a result at the high end of the probability distribution to mean that d' is unlikely to be zero. But what does a result at the low end of the probability distribution mean? If our two stimuli A and B are assumed to be equal and we conduct a triangle or 3-AFC test with 60 subjects, we expect 20 subjects to guess the correct result, on average. With alpha_H0 at 0.05, if 27 subjects get the correct result, we conclude that probably the stimuli aren't really equal after all, and we reject H0. On the flip side, let's say that only 13 subjects get the correct result, which should happen about 2% of the time if our model is a good description of our experiment. If I got such a result, I would suspect that there might be something wrong in the experimental setup, and I would spend time verifying that there were no mistakes in how the experiment was conducted. The smaller the number of correct responses, the more time I would spend trying to find an error. If I had zero correct responses out of 60, I would spend a very long time seeing if anything had gone wrong in the way the experiment was conducted. If I found an error, I would fix it and re-run the experiment; if I found no error, I would accept the result and not reject H0. In this case, the purpose of analyzing a result at the lower end of the distribution is not to reject H0, but to verify that the test was properly conducted. I would like to call this region at the low end of the binomial probability distribution the "region of likely experimental error", and notate it as R_L(EE). This region is shown on the top left of Worksheet 4 with a horizontal orange line and arrows pointing to the beginning and end of this region.

On the right-hand side of Worksheet 4, under "H0:", there are five values. The first ("expect") is the expected number of subjects who will get a correct result (on average), the second (p_H0) is the probability of any one subject getting a correct result (which is always 0.333), and the third value (σ_H0) is the standard deviation of this binomial probability distribution. The next number ("thresh_H0") is the threshold for rejecting H0, expressed as the number of subjects. If a trial has C subjects with a correct result, and if C is greater than or equal to thresh_H0, then H0 is rejected. The final number ("p_RejH0") is the probability of rejecting H0 using this threshold. In theory (or with a large enough number of subjects), p_RejH0 is equal to alpha_H0, but with a small number of subjects, this number may be noticeably less than alpha_H0. If the threshold were lowered by a single subject, though, this probability would become greater than alpha_H0.

When you start the simulation with the specified number of trials, each trial will be conducted (as in Worksheet 2) with the specified number of subjects and d' equal to zero. The result of each trial is plotted in Worksheet 4 in blue with striped lines. On the right-hand side of Worksheet 4, under "H0 simulation:", there are four numbers. The first ("avg. corr.") is the average number of correct responses in this simulation, the second ("p_avgH0") is the average proportion of correct responses (i.e. the average number of correct responses divided by N) given H0, and the third number ("s_H0") is the standard deviation of the results from these simulated trials. The fourth number, on the bottom, ("reject") shows how often the null hypothesis was rejected even when it was true. With a large enough number of trials, "avg. corr." should be close to "expect", p_avgH0 should be close to p_H0, s_H0 should be close to σ_H0, and "reject", when divided by 100, should be close to p_RejH0.

7. Significance Testing: H0 and H1

Using the null hypothesis, H0, we can easily evaluate if a test result C is so large that it's very improbable that the two stimuli are perceptually the same. If we consider an alternate hypothesis, called H1, we can obtain an estimate of how much risk we take by doing the test as planned. We know (from Section 6) that if H0 can not be rejected based on some result, then we can come to no conclusion about the hypothesis (we can neither accept nor reject H0). Having H1 lets us potentially avoid spending a huge effort only to shrug and say "we don't know". With H1, we can design the test so that it is more likely to demonstrate a significant result.

H1 is a different hypothesis about the value of d'. What should this value be? This is a bit of a catch-22. Ideally, it is the best estimate we already have for d', based on pilot data or a literature search. This ideal isn't always practical, though. There are many times when we have no pilot data. Using values of d' from similar studies found in the literature is one possibility, but the uncertainty in our estimate will be very large even if we can find several such studies. And, of course, sometimes we're performing an entirely new study and there is nowhere to obtain a good estimate of d'.

Without pilot data or an estimate from the literature, a common option in some fields is to guess that d' is 0.5, which is considered a "medium" effect size. Jacob Cohen classified an effect size of 0.5 as medium because such an effect is "visible to the naked eye of a careful observer", but he noted that his suggested categorizations should be flexible and he "warned about ... them becoming de facto standards for research". Instead, his "ballpark categories provide a general guide that should also be informed by context". For sensory difference testing, a d' of 0.5 means a probability of discrimination (p_d, defined in Section 3) of only 28%, which seems like a very small perceptual difference between two stimuli. Instead, I will use a d' of 1.0, or approximately the just-noticeable difference (JND) with a probability of discrimination of 50% (Section 3). (Of course, if you think that the JND is too strict or too lenient, feel free to choose a different value for d'.)

If you have some other estimate of (or preference for) d', please go ahead and use it. For the rest of this tutorial, though, I'll define H1 as the hypothesis that d' is 1.0, approximately the JND. This hypothesis is interesting because it's right at the threshold of where we consider the two stimuli to be consistently perceived as the same or different.

Once we have H1, we can estimate not only the probability of rejecting H0 when it is true (alpha_H0), we can also estimate the probability of an inconclusive result when H1 is true, called beta. I will refer to it as beta_H1 to emphasize that this probability is conditioned on H1 being true. It is common to target a beta_H1 value of 0.20, meaning that if H1 is true we will have an inconclusive result (on average) 20% of the time. The power of a test is defined as 1 − beta_H1, and so a test with high power has a low probability of being inconclusive.

Beta_H1 is the area under the binomial distribution of H1 for all subjects with a correct response up to (but not including) the threshold for alpha_H0. Once we have the binomial distribution for H1, beta can be computed by iterating over all number of correct responses from 0 up to (but not including) the value of thresh_H0 defined in Section 6 and Worksheet 4. (Visually, this limit is indicated with the dashed orange line marked α_H0 in Worksheet 5.) All of the probabilities from the binomial distribution in this range are summed up to a total probability, which is beta_H1.

Once we've decided on the type of test, number of subjects, H0, H1, and alpha_H0, we can compute the value of beta_H1. If we're not happy with this value of beta_H1 (because it will lead to too many tests being inconclusive), then we need to change some of the parameters (e.g. N) or assumptions (e.g. H1) of the test. If we make the assumptions more lenient in order to increase the estimated power of the test, we run more of a risk of failing to demonstrate a significant result when one does exist. Therefore, we usually increase N until we reach the desired value of beta_H1.

Having H1 and knowing beta_H1 doesn't change the threshold for rejecting H0, and nothing we do allows us to accept H0 or H1. In that sense, H1 is not necessary for significance testing. However, it can be very useful to have in advance some idea of the likelihood of being able to demonstrate significance (if a difference does exist).

Worksheet 5 shows two binomial distributions, one for H0 (in light red) and one for H1 (in light green). The hypothesis H1 in this worksheet is that d' is the value specified in Worksheet 1 (not necessarily 1.0). Where the two distributions overlap, the color is a dark (or olive) green. This worksheet is largely the same as Worksheet 4, but with the addition of H1. The standard deviation shown in this worksheet is for H1 instead of H0.

In the top-left corner of this worksheet, the values of d' (for H1), the type of test, number of subjects, alpha_H0, beta_H1, and the power are specified. Below this, you can enter the number of trials and delay for a simulation. In this case, the simulation will generate and score data for H1. Each trial is plotted in black with cross-hatched lines.

On the right-hand side of Worksheet 5, under "H0:", there are three values. The first ("p_H0") is the probability of any one subject getting a correct result (always 0.333), the second ("thresh_H0") is the threshold for rejecting H0 (expressed as the number of subjects), and the third ("p_RejH0") is the probability of rejecting H0 using this threshold.

On the right-hand side of Worksheet 5, under "H1:", there are four values. The first ("expect") is the expected number of subjects who will get a correct result (on average) when H1 is true. The second ("p_H1") is the probability of any one subject getting a correct result. The third ("σ_H1") is the standard deviation of the probability distribution of H1. The fourth ("beta_H1") is the computed value of beta, or the probability of an inconclusive result. (This value of beta is the same as on the left-hand side of the worksheet.)

On the right-hand side of Worksheet 5, under "H1 simluation:", there are four values. The first ("avg. corr.") is the average number of correct responses in this simulation of H1, the second ("p_avgH1") is the average proportion of correct responses (i.e. the average number of correct responses divided by N) given H1, and the third number ("s_H1") is the standard deviation of the results from these simulated trials. The fourth number, on the bottom ("incn."), shows the percent of trials that yielded an inconclusive result (i.e. not rejecting H0). With a large enough number of trials, "avg. corr." should be close to "expect", p_avgH1 should be close to p_H1, s_H1 should be close to σ_H1, and "incn.", when divided by 100, should be close to beta_H1.

8. Interpreting a Result That Doesn't Demonstrate Significance

By design, statistical significance is a reasonably high bar to meet. In research areas such as clinical trials, where lives may be at stake, it can be prudent to focus only on those results that demonstrate significance. In sensory difference testing, however, or when running a pilot study, we may want to get as much information as we can from our test, regardless of whether or not the results demonstrate significance. For example, it might not be possible to get the 209 subjects needed to conduct a triangle test with d' = 1.0, alpha_H0 = 0.05, and beta_H1 = 0.20. Sometimes even getting 10 subjects, or one subject to perform 10 tests, can be a logistical challenge. Without lives at stake, the answer to our question "are the two stimuli perceptually different" may not require demonstrating statistical significance. What happens when we conduct a test and the results don't demonstrate significance? Rather than give up entirely, we can use the likelihood ratio to interpret our result. The likelihood ratio tells us how likely one hypothesis is over the other, given the result. There is no threshold for acceptance, but the higher the ratio, the more compelling the result is.

The likelihood of a hypothesis given a result is defined as the probability of the result given the hypothesis. In mathematical notation, L(H | C) = p(C | H). The likelihood function, L, "measures the goodness of fit of a statistical model to a sample of data". When talking about probabilities, the hypothesis is fixed (or given) and we evaluate the possible outcomes. With likelihoods, the outcome is fixed (e.g. 13 correct responses out of 20) and we evaluate the possible parameter(s) of the model, which in our case is the value of d'.

The likelihood ratio of two hypotheses estimates how much more likely one hypothesis is compared with the other. A likelihood ratio of 1.0 indicates that both hypotheses are equally likely. The likelihood ratio of H1 to H0 is the likelihood of H1 divided by the likelihood of H0, or L(H1 | C) / L(H0 | C). (For those of you who are into Bayesian statistics, the likelihood ratio is the same as the Bayes factor when constraining the Bayes factor hypotheses to single outcomes.) If L(H1 | C) is 0.132 and L(H0 | C) is 0.025, then we can say that H1 is 5.3 times more likely than H0 given the test result C.

We are using d'=0 for H0 and d'=1 for H1. Is the true value of d' in our test going to be exactly 0.0 or exactly 1.0? Probably not. But the likelihood ratio of these two hypotheses tells us the relative strength of a just-noticeable difference to no perceptual difference, given the result. These are two points of interest that help us judge whether two stimuli are likely to be (just) perceptually different or not. If the likelihood ratio favors a d' of 0, then it is more likely that there is no difference; if it favors a d' of 1, then it is more likely that there is a perceptual difference. It would be interesting to consider a range of possible d' values for H1, such as d' ≥ 1. However, doing this (using the Bayes factor), requires knowing or assuming prior probabilities that we don't really know anything about. To keep things simple, we can focus on two specific values of d'.

In Section 6, I named the region where there was a likely experimental error as R_L(EE). H1 specifies that d' equals some value D (usually 1 in this tutorial). Considering the region where the likelihood of H0 is greater than H1, this is equivalent to saying that d'=0 is more likely than d'=D. I'd therefore like to call this region R_L(d'=0), i.e. the region where d'=0 is more likely. For each number of correct responses, C, in this region, we can determine the likelihood ratio for H0 to H1, or how much more likely H0 is than H1. The region where the likelihood of H1 is greater than H0 can be called R_L(d'=D), and for each value of C in this region we can determine how much more likely H1 is than H0.

Worksheet 6 shows the same two binomial distributions from Worksheet 5, one for H0 (in red) and one for H1 (in light green). In addition to showing R_L(EE) with a horizontal orange line, this worksheet shows the region where d' is more likely to be zero, R_L(d'=0), and the region where d' is more likely to be the value specified for H1, R_L(d'=D). Vertical dashed orange lines are placed at the boundaries of these regions for greater visual clarity.

In the top left corner of this worksheet, the values of d' (for H1), the type of test, number of subjects, alpha_H0, beta_H1, and the power are specified. Below this, you can enter the number of correct responses in a test, C. (Sorry, there are no more simulations in this tutorial.) A vertical gray bar is plotted at the specified value of C.

On the right-hand side of Worksheet 6, under "H0:", there are the same three values as in Worksheet 5. In addition, there is "pval_H0", which shows the p value for an experiment with this value of C. If this p value is less than or equal to p_RejH0, then H0 can be rejected. Under "H1:", there are three of the same values from Worksheet 5. (The value of σ_H1 has been omitted because it is less interesting.)

Finally, on the right-hand side of Worksheet 5, under "Likelihoods:", there are three values. The first ("L(H0)") is the likelihood of H0 given the result C. The second ("L(H1)") is the likelihood of H1 given the result C. The third value ("LR:") specifies which hypothesis is more likely and the likelihood ratio for this hypothesis against the other. (Because the more likely hypothesis is specified, this likelihood ratio will always be greater than or equal to 1.) If this likelihood ratio is greater than 200, the value "200+" is shown for visual clarity, because anything over 150 is generally considered "decisive". For example, a value such as "H1 = 18.51" says that the d' value from H1 is 18.51 times more likely than a d' value of 0. A value such as "H0 = 2.24" says that a d' value of 0 is 2.24 times more likely than the value of d' from H1.

9. Significance Testing: Is d' Greater Than One?

The use of the likelihood ratio in Worksheet 6 illustrates one problem with these ratios at higher correct-response rates. If, for example, we have 20 subjects in a 3-AFC test when H1 is d'=1.0, and there are 19 correct responses, then the likelihood ratio is (correctly) very large in favor of H1. But with 19 correct responses, the likelihoods of both hypotheses are tiny, at 0.000000011 for H0 and 0.0013 for H1. These tiny likelihoods strongly suggest that neither H0 nor H1 are true. If the true value of d' isn't even close to either hypothesis, then the likelihood ratio of the two hypotheses is in some sense a "misdirection," forcing us to look over here (at H0 and H1) when the interesting information is over there (a much larger d'). Fortunately, there is an easy solution to this problem that we have already covered: use conventional null-hypothesis significance testing (NHST) to reject H1 (in addition to H0) at the specified alpha value. In other words, treat H1 as our null hypothesis and see if the evidence is strong enough to reject it. If we can reject H1 with d'=1, we can conclude that d' is (probably) greater than 1. If we can demonstrate significance and therefore claim that there is (probably) more than a just-noticeable difference between the two stimuli, we have come to a very satisfactory conclusion.

The previous paragraph implies that we could have skipped Sections 7 and 8 entirely, and simply used d'=1 as our null hypothesis in Section 6. However, it is often difficult enough to demonstrate significance for the hypothesis that d' is greater than zero. Presumably it will be a rare event when we demonstrate that d' is probably greater than 1. By keeping H0 as d'=0, H1 as d'=1, and using likelihood ratios, we can extract much more information from our result, which might be a large or small number of correct responses.

Testing multiple hypotheses at the same time is possible, but it is important to understand the implications. A common pitfall with testing multiple hypotheses is that as you add more and more comparisons, it becomes more and more likely that at least one hypothesis will demonstrate significance. (This effect has been nicely illustrated by Randall Munroe at XKCD.) If the success of an experiment (or a series of experiments) is dependent on demonstrating significance in only one case, then testing multiple hypotheses often makes it (artificially) easier to succeed. In this case, however, testing both H0 and H1 doesn't change the number of times we demonstrate a significant result, relative to testing only H0. If we reject H1 we always reject H0, and so rejecting H1 implies rejecting H0. If we come to no conclusion about H0, we also come to no conclusion about H1. If we reject H0 but do not reject H1, we come only to the conclusion that d' is probably greater than 0; we have no opinion about whether d' is 1 or not. We have three possible conclusions from the combination of the two tests: (1) we don't reject H0 (or H1), (2) we reject H0 and conclude that d' > 0, or (3) we reject H1 (and H0) and conclude that d' > 1. The number of times that we demonstrate a significant result is no different than if we had tested only H0. The purpose of testing these two hypotheses is not to determine a single outcome of significance from two independent tests; the purpose is to better understand, before conducting a test, what this one test result will tell us about d'. Why not carry this approach to its logical extreme and test all possible values of d', and then select the hypothesis that best fits the data? The first reason is that this is confounding a hypothesis (which, by definition, is made before conducting a test) with a result. If we specify our hypothesis based on the outcome of a test, then it is a result, not a hypothesis. We can't do hypothesis testing without a hypothesis. The second (and related) reason is that if we want to estimate d' and not just test a hypothesis or two about it, then we should use a technique designed for that purpose, as discussed in Section 10.

The approach of using both NHST and likelihood ratios leads to several classification regions in the binomial distribution. Where H1 is the hypothesis that d' is some value D (usually 1.0 in this tutorial), then there are four non-overlapping regions: (1) the region where there is likely experimental error (R_L(EE)) (and if there is no error, then the likelihood ratio favors d'=0), (2) the region where it's more likely that d' is 0 than D (R_L(d'=0)), (3) the region where it's more likely that d' is D than 0 (R_L(d'=D)), and (4) using NHST to reject H1, the region where we conclude that d' is probably greater than D (R_L(d'>D)). Another region has a boundary within either R_L(d'=0) or R_L(d'=D), namely the region where we conclude that d' is probably greater than 0 (from Section 6). For the regions R_L(d'=0) and R_L(d'=D), the likelihood ratio can give us more detail about the relative strength of the two hypotheses. Note that if a result is in R_L(d'=D), that doesn't mean that it's more likely that d'=D than d'>D; it just means that the evidence isn't strong enough to reject the hypothesis that d'=D.

Worksheet 7 is very similar to Worksheet 6, but the region where it's more likely that d'=D than 0 has been reduced, and there is a new region where it's likely that d'>D. The worksheet also displays different information about H1. In particular, the "expect" and beta_H1 values are no longer displayed (for conciseness). The figure now shows (a) thresh_H1, the threshold for rejecting H1 (expressed as the number of subjects), (b) p_RejH1, the probability of incorrectly rejecting H1 using this threshold when H1 is true, and (c) pval_H1, the p value for an experiment with this value of C given H1. If this p value is less than or equal to p_RejH1, then H1 can be rejected and we can conclude that d' is probably greater than the value specified by H1.

10. Estimating d' and Confidence Intervals

We may want to estimate d' from our test result, C. Our current experiment might be a pilot study, and we want to estimate d' in order to use a well-motivated value of d' when estimating the power of a subsequent (and possibly larger) experiment (as discussed in Section 7). Maybe we've run a trial with 200 subjects and we got 160 correct responses, far greater than we'd get with a d' of 1.0, and we want some estimate of d', not just the conclusion that d' > 1. Even without a large number of correct responses, after we come to the conclusion that d' probably is greater than zero, or that it probably isn't, or that we can't be sure but it's more likely than not, it's logical to then ask for our best estimate of d', if only for curiosity's sake.

The first step is to consider the binomial distribution for the unknown but true value of d' in our experiment. This distribution has a known number of subjects but an unknown probability p_d', the probability of success of an experiment with the true value of d'. We have result C from our test, and this test has, according to our model, a perceptual distance with the true value of d'. We don't know if our test result is at the peak of this binomial distribution associated with d', but the best (maximum-likelihood) estimate we have of p for this test is C/N which we can call p_est. Because this test has the true value of d', the best estimate of p_d' is the same as p_est, namely C/N. The larger the value of N, the better we expect this estimate to be, as discussed in Section 5.

Given N and the test type, every value of d' yields a unique value of p. Previously we've computed p from d', N, and the test type (Sections 3 and 4, using equations from Christensen). We can determine the value of d' that corresponds with p_est by computing p for a large number of values of d' and finding the closest match of these p values to p_est. The value of d' associated with this closest match to p_est is the maximum-likelihood estimate of d'.

For example, let's say that a 3-AFC test yielded 4 correct responses out of 20. (This should happen about 10% of the time when d' is 0, and almost never when d' is 1.) This yields p_est = 0.20. For the 3-AFC test, a d' of 0 yields a probability of 0.33, and this is the minimum probability we can get from this test. Therefore, the closest probability we can get to 0.20 is 0.33, and the best estimate we have of d' in this case is 0. Now let's say that instead of 4 correct responses, we get 17 correct responses. This yields p_est = 0.85. The d' that yields a probability of 0.85 is 1.91, and so our best estimate of d' is 1.91.

How good is this estimate of d'? That depends on how good our estimate is of p_est, which in turn depends on how close C is to the expected value of the binomial distribution associated with d'. The more atypical our result C is, compared with a large number of similar (but hypothetical) tests, the worse our estimate is. We will never know how good our estimate is, but we can use confidence intervals to get some sense of the likely range of d' values. Before getting to the confidence interval, we should define the "population mean". If we had measurements of C from every human on earth taking multiple tests, each with the true value of d', we could compute the mean value of C associated with d' over all humans; this mean is the population mean (taken over the entire population of interest, which in our case is presumably the population of all humans). The population mean is what we think the "true" mean is. The one measured value of C that we have from our trial is called the sample mean.

A confidence interval (CI) specifies a range of values for, in our case, the expected number of correct responses, C. In particular, if one were to conduct the same experiment again with the same number of subjects and otherwise identical circumstances, the 95% confidence interval says that 95% of the time that one does such an experiment, the confidence interval will contain the population mean. Once we have a lower limit and upper limit for the expected number of correct responses (based on the confidence interval), we can compute the lower and upper limit for the proportion of correct responses (by dividing by the number of subjects), and we can use the same technique to map from a probability of success to d'. We can therefore use the low and high estimates of C to determine the low and high values of d' associated with the confidence interval.

There are many ways to compute a confidence interval. My favorite is the "bootstrap confidence interval". It has the advantages of not making assumptions about the underlying distribution of the data, it doesn't assume that the interval is symmetric around the point estimate, and it's very easy to implement on a computer. The worksheet below uses bootstrapping to compute the confidence interval.

Worksheet 8, below, is very similar to Worksheet 7. In addition to specifying the number of correct responses, there is a field for specifying the confidence interval. The default value, 95%, should be suitable for most cases. On the right-hand side, under d' Estimates:, there are three rows. The first row shows the (maximum-likelihood, or ML) estimate of d' based on the number of correct responses. The second row shows the lower estimate of expected correct responses based on the specified confidence interval, followed by the d' value associated with this number of correct responses. The third row shows the higher estimate of expected correct responses based on the confidence interval, along with the associated d'. The graph has a vertical gray bar showing the value of C, as in Figures 6 and 7, and also a lighter-gray area illustrating the estimated range of correct values specified by the confidence interval.

11. Conclusion

This tutorial has presented a number of techniques for evaluating whether two stimuli are perceptually different or not. If you're a fan of Ronald Fisher, then go ahead and use only H0 and alpha in your analysis. (If you're really a fan, then be flexible with alpha.) If you prefer the hybrid Fisher and Neyman-Pearson approach described in Section 7, then use beta and the power of the test. If it makes you uncomfortable to consider rejecting H1, then don't attempt it. (If you do consider rejecting H1, then be clear about this option before starting the test.) If you don't object to likelihood ratios, maximum likelihood estimates, or confidence intervals, then use them. If you think that confidence intervals are the way to go and that NHST has too many flaws, then use only confidence intervals. The purpose of these different analyses isn't to prove anything, it is to convince your intended (and probably skeptical) audience that your conclusions are supported by the data. The different methods of analysis provide different tools for that purpose.

The combination of NHST, likelihood ratios, maximum-likelihood estimates, and confidence intervals allows us to construct a nuanced interpretation of a test result. NHST allows us to (sometimes) come to a clear conclusion about the result but it provides no assessment of the merit of a hypothesis. Likelihood ratios and maximum-likelihood estimates provide us with specific estimates about different possibilities, but they don't tell us how accurate those estimates are. The use of a confidence interval on the maximum-likelihood estimate of C and d' provides a range of plausible values, but this interval may be too large to be useful by itself. By analyzing a test result in multiple ways, we can get a more complete picture of what the result tells us. The one thing to keep in mind is that we should specify our analysis techniques, hypotheses, and the criteria for success before conducting the experiment.

I've avoided talking about prior probabilities, i.e. the probabilities that H0 and H1 are true before conducting the experiment. Such probabilities can give us an even better picture of our data if we have them. In some fields it may be possible to estimate these probabilities; for example, there may be only a 10% chance of a clinical trial being successful, and so p(H0) is 0.90 and p(H1) is 0.10. For sensory difference testing, I'm not sure that these probabilities can be estimated very well in the general case, and so I've left them, and the whole world of Bayesian estimation, out of this tutorial.

It has been commented that researchers should publish not just the results that show success, but all results. When simply testing H0, the story that "we did all this work and can conclude nothing" is not very interesting to the writer, the reviewer, or the reader. However, being able to say "we did all this work and can conclude that H0 is 5 times more likely than H1" is potentially more interesting and potentially more publishable. Or, saying that the experiment did not demonstrate significance with the available number of subjects but that the estimated value of d' is 1.46 may provide useful information to future researchers. One advantage of using likelihood ratios and maximum-likelihood estimates is that all results become, at some level, interesting and therefore potentially publishable.

At the beginning of this tutorial, we asked a simple question: "Is there a perceptual difference between the two stimuli?". Hopefully it is clear now that there is very rarely a definitive answer to this question. We can only answer it in terms of probabilities that need to be evaluated within the context of the testing procedure. With a small number of subjects, the amount of uncertainty in the results may be large. With a very large number of subjects, we have better confidence in our estimates and conclusions, but we need to be sure that the result is meaningful as well as significant. For example, we may be confident that d' is 0.25, but simply rejecting the hypothesis that d' is zero doesn't mean that a value of 0.25 is perceptually important. Finally, all of these analysis methods are only as good as the models they are built upon, and different assumptions in the models or details of the test procedures can lead to different results. Therefore, it's important to understand the test results within the context of both the testing details and the model assumptions.

12. Final Worksheet

Sometimes you may want to use this web page just to try different testing scenarios. Worksheet 9 allows you to specify all of the relevant test parameters (d', test type, number of subjects, alpha_H0), the number of correct responses, and the desired confidence interval all in one worksheet.

d' =		σ: N/A perceptual units
simulation tests		p_d: N/A
second(s) simulation delay		2-AFC test model correct: N/A
simulation:		2-AFC test simulation correct: N/A

d' =
test type = triangle 3-AFC		σ: N/A perceptual units
subjects (N)
second(s) simulation delay		N/A test model correct: N/A
simulation:		N/A test simulation correct: N/A

d' = , test test, N subjects
simulation trials (T)		binomial: expect N/A p = N/A σ_p = N/A
second(s) delay		simulation: avg. corr. N/A p_avg = N/A s = N/A
simulation:

H0: d' = , test test, N subjects		H0: expect N/A p_H0 = N/A σ_H0 = N/A
alpha_H0 =		thresh_H0 = N/A p_RejH0 = N/A
trials (T)		H0 simulation: avg. corr. N/A p_avgH0 = N/A s_H0 = N/A
second(s) delay		reject = N/A%
simulation:

H1: d' = , test, subjects, alpha_H0 = , beta_H1 = , power =		H0: p_H0 = N/A thresh_H0 = N/A p_RejH0 = N/A
trials (T)		H1: expect N/A p_H1 = N/A σ_H1 = N/A beta_H1 = N/A
second(s) delay		H1 simulation: avg. corr. N/A p_avgH1 = N/A s_H1 = N/A incn. = N/A%
simulation:

H1: d' = , test, subjects, alpha_H0 = , beta_H1 = , power =		H0: p_H0 = N/A thresh_H0 = N/A p_RejH0 = N/A pval_H0 = N/A
		H1: expect N/A p_H1 = N/A beta_H1 = N/A
correct responses		Likelihoods: L(H0) = N/A L(H1) = N/A LR: Hx = N/A

H1: d' = , test, subjects, alpha = , beta_H1 = , power =		H0: p_H0 = N/A thresh_H0 = N/A p_RejH0 = N/A pval_H0 = N/A
		H1: p_H1 = N/A thresh_H1 = N/A p_RejH1 = N/A pval_H1 = N/A
correct responses		Likelihoods: L(H0) = N/A L(H1) = N/A LR: Hx = N/A

H1: d' = , test, subjects, alpha_H0 = , beta_H1 = , power =		H0: thresh_H0 = N/A p_RejH0 = N/A pval_H0 = N/A
		H1: thresh_H1 = N/A p_RejH1 = N/A pval_H1 = N/A
correct responses		Likelihoods: L(H0) = N/A L(H1) = N/A LR: Hx = N/A
% CI		d' Estimates: ML d' = N/A low: N/A ⇒ N/A high: N/A ⇒ N/A

H1 d' = test type = triangle 3-AFC subjects =		H0: thresh_H0 = N/A p_RejH0 = N/A pval_H0 = N/A
alpha_H0 = beta_H1 = power =		H1: thresh_H1 = N/A p_RejH1 = N/A pval_H1 = N/A
correct responses		Likelihoods: L(H0) = N/A L(H1) = N/A LR: Hx = N/A
% CI		d' Estimates: ML d' = N/A low: N/A ⇒ N/A high: N/A ⇒ N/A