Four Pilot Studies for Maximizing Hop Flavor with Late-Hop Additions

Abstract
The purpose of the experiments described here was to estimate at what point in the boil, and at what temperature, hops should be added in order to maximize hop flavor. The first two perceptual tests were conducted using beers with the same amount of hops added at different times before flameout (from 1 to 20 minutes). The third test was conducted with the same amount of hops added at 10 minutes before flameout and the kettle covered or not covered. The fourth test was conducted with hops added at 10 minutes before flameout to boiling wort or to wort held at 170°F (77°C) . The bitterness of the beers within each perceptual test was kept constant by adjusting the amount of a 40- or 45-minute hop addition. These experiments were pilot studies due to the small number of test comparisons, the use of a single test subject, and the use of a single variety of hops. The results indicate that hop flavor may be most pronounced with a 1-minute steep time, that evaporation has a gradual effect on hop flavor (with 10 minutes probably corresponding to a just-noticeable difference), and that the difference between a 1-minute and 20-minute steep time with an uncovered kettle was the most easily perceived of the conditions tested. The 10-minute hop stand at 170°F (77°C) showed no perceptual difference from a 10-minute boil. The results suggest that a "best practice" for maximizing hop flavor may be to add the hops very close to flameout, but that other late-hopping techniques may produce results that are perceptually very similar.

1. Introduction
The purpose of the experiments described here was to estimate at what point in the boil, and at what temperature, hops should be added for maximum hop flavor. The term "hop flavor" can mean different things to different people. For example, George Fix says that it has been traditionally (and not quite correctly) believed that the hop resins (which are responsible for bitterness) contribute to hop flavor, while the hop oils (including flavor compounds) contribute to hop aroma [Fix and Fix, p. 33 (emphasis mine)]. In this case, because the resins are responsible for bitterness, the term "hop flavor" is associated with the taste of bitterness. Somewhat more recently, it has been recognized that hop oils contribute to hop "flavor and aroma" [Oliver, p. 539] and that "late-hopping [is] a well-accepted technique for adding hop flavor and aroma" [Oliver, p. 539], and so "hop flavor" can refer not to a bitter taste, but to a distinct non-bitter flavor. Mark Garetz uses the term "character" to define this non-bitter flavor [Garetz, p. 14]. In this post, I use the term "flavor" for the non-bitter hop flavor that comes from the hop oils, with typical descriptions such as "floral," "citrus," "spicy," "grapefruit," or "earthy." These oils are also responsible for hop aroma [Oliver, p. 539], and so the terms "flavor" and "aroma" are often used together to describe their sensory impact. I will use the term "flavor" with the understanding that flavor and aroma are intertwined.

It is usually said that hops should be added earlier in the boil for bitterness and later in the boil for flavor and/or aroma [e.g. Fix and Fix, p. 33; Garetz, pp. 10-11; Noonan, p. 160; Oliver, p. 539]. Therefore, the experiments in this blog post focus on late-hop additions ranging from 1 to 20 minutes before flameout and forced cooling. (The distinction between "early" and "late" hopping is at around 30 minutes before flameout [Oliver, p. 539].)

While the belief in late hopping for flavor is nearly universal, it is difficult to find in the literature a "best" time for maximizing flavor or a quantified relationship between hop steep time and flavor. Greg Noonan says that "flavoring hops are commonly added ten or fifteen minutes before the end of the boil for lager beer" [Noonan, p. 159]. Charlie Papazian is the only source I know of who provides a graph of the relationship between steep time and hop flavor, with a peak at 10 minutes before flameout (and a separate peak at 0 minutes for aroma) [Papazian, p. 68], but it's unclear what set of data was used to produce this graph. It is possible that chemical reactions between boiling wort and hop oils require some amount of time to produce the most hop flavor in finished beer. Because flavor and aroma are intertwined, and the oils responsible for hop aroma are lost with evaporating steam [e.g. Lewis and Young, p. 271], it's also possible that peak hop flavor comes from flameout additions. The use of hop stands, with hops steeped at below-boiling temperatures, are common in hop-forward ales and might also contribute to increased hop flavor.

Attempting to answer the question of when to add hops for maximum flavor presents two logistical challenges. The first challenge is that the bitterness of beer increases with hop steep time and temperature, and so simply adding the same amount of hops at different times or temperatures will change the bitterness level in addition to any flavor changes. This topic is discussed more in Section 2. The second challenge is how to measure hop flavor in order to know when it has been maximized. The perceptual-testing approach used here is discussed in more detail in Section 3.

I've created a separate web page as an interactive tutorial for the mathematics behind perceptual difference testing, including significance testing, the power of a test, likelihood ratios, estimating the effect size (d'), and confidence intervals. These different analysis methods can be used to obtain a detailed interpretation of the results, which can be especially useful when the number of samples per trial is small and/or the statistical power of the test is low.

The perceptual experiments described below used only a single test subject and a single hop variety (Amarillo). In addition, the number of test samples used in these experiments was too small to reliably detect minor perceptual differences. These experiments are therefore pilot studies; results are tentative and these results may or may not be supported by future studies. Having tentative results is at least a first step toward having more conclusive results.

2. Controlling for Bitterness
In order to control the bitterness level of the beers in these experiments, I used up to two hop additions in each condition. One addition was the same weight of hops added at different times or temperatures before flameout. Another addition (if used) was always made at 40 or 45 minutes before flameout (40 minutes for the first two experiments; 45 minutes for the second two), and the weight of this other addition was varied in order to target the same IBU value across all conditions within a test. Because additions at 40 or 45 minutes are considered to be primarily for bittering and not for flavor, the goal was to change the flavor with the timing of the late-hop addition but to keep total bitterness of each condition the same with the smaller but earlier addition.

To predict IBU values for each condition, I used the technique described in Estimating Isomerized Alpha Acids and nonIAA from Multiple IBU Measurements. This technique is used, with Mark Malowicki's model of alpha-acid isomerization [Malowicki], to estimate two parameters for modeling IBUs: scalingIAA and scalingnonIAAhops. The scalingIAA parameter indicates how much of the isomerized alpha acids (IAA) are lost during the boil and fermentation, and the scalingnonIAAhops parameter indicates (a) what percent of the weight of the hops becomes auxiliary bittering compounds during the boil and (b) to what degree these compounds are lost during the boil and fermentation. I obtained initial estimates of these two parameters from a preliminary study. I used these values, along with wort volume, weight of the hops, AA rating, pH, and original gravity to predict IBUs. The preliminary study and all experiments described here used hops from the same one-pound (0.45 kg) bag, to keep the alpha-acid (AA) rating and alpha-acid decay factor [e.g. Garetz, pp. 103-118] as equal as possible across conditions.

For the late-hop addition, I targeted an initial alpha-acid concentration close to the estimated alpha-acid solubility limit of about 200 ppm. The IBU prediction technique estimates a certain IBU value from this amount of hops, wort, temperature, and steep time (ranging from 1 up to 20 minutes). I then adjusted the weight of another hops addition, always added at 40 or 45 minutes before flameout, so that the model predicted the same total IBU value across all conditions within an experiment. The goal was to have all of the conditions in a perceptual comparison within 5 measured IBUs of each other, as 5 IBUs has been reported to be the perceptual threshold [Daniels, p. 76]. Up to about 50 or 60 IBUs there is a strong linear relationship between IBUs and perceived bitterness [Hahn, p. 50], and so for beers in this range the IBU is a good (and linear) metric for perceived bitterness.

3. Flavor Testing Methodology

3.1 Overview
To measure hop flavor, I used the triangle test (also used at Brülosophy) in order to judge whether two conditions can be distinguished from each other [e.g. Angevaare; Society of Sensory Professionals]. In the triangle test, a test subject tastes three samples of beer where two of the samples are from the same condition and one is from a different condition. The subject is asked which one of the three beers is different. This test is repeated a number of times. If the number of correct answers is above a threshold, then the two conditions can be considered perceptually different. It is important to note that if the number of correct answers is below the threshold, nothing can be concluded from a standard significance test; standard significance testing can not accept the hypothesis that there is no perceptual difference between two conditions. However, likelihood ratios can be used to estimate the relative strength of the evidence for whether two beers are perceptually the same or different. We can also estimate the effect size (d'), which indicates the amount of difference between the two conditions. A d' of 0 indicates identical conditions, a d' of 1.0 corresponds to a just-noticeable difference, and larger values of d' indicate greater perceptual differences.

In this test, the beer judged as different was also rated by the subject as having either "more hop flavor" or "less hop flavor" than the others. By comparing beers at a range of steep times, one can first determine which steep times can be distinguished from each other. Then, for those samples that are correctly identified as different, one can look at how often one steep time is judged more flavorful than the other.

3.2 Testing Details
These experiments used a single subject or taster (this author). This single-subject design has advantages and disadvantages. One significant disadvantage of using a single subject is that the results from these experiments may or may not generalize to the larger population. One significant advantage of using a single subject is that there is probably a lower threshold for detecting perceptual differences, compared with a larger group of subjects. (Even if the one subject has a high threshold compared with the average population, the variance in the responses will be less for one subject than for many subjects due to individual threshold differences. This variance in responses negatively affects the effect size (lowering the value of d'), making it more difficult to distinguish between conditions in a study with many subjects.)

In the first two perceptual studies, both experiments had four conditions for different hop steep times, labeled A, B, C, and D in Experiment #1 and E, F, G, and H in Experiment #2. This resulted in six comparisons between conditions (in Experiment #1, Condition A vs. B, A vs. C, A vs. D, B vs. C, B vs. D, and C vs. D). Each comparison was tested eight times, for a total of 48 tests per experiment. The third experiment had two conditions: (J) kettle covered or (K) uncovered during the 10-minute late-hop addition. The fourth experiment also had two conditions: hops added to (L) boiling or (M) 170°F (77°C) wort for 10 minutes. Each of the comparisons in the third and fourth experiments was tested 24 times. The final two perceptual studies were conducted simultaneously, for a total of 48 tests.

A computer program was written to arrange the tests in random order with random ordering of conditions within a test. Tests were conducted up to four times per day with at least an hour between tests (to reduce order effects), and so each experiment took about two weeks to test. A second person poured samples for two to four tests every morning according to an instruction sheet with the randomized order of conditions. Each test sample was 1.5 oz (44 ml), and so more than 74 oz (2.2 liters) of each condition were required for testing. While the beers were stored close to freezing to preserve flavor, each sample of beer came up to room temperature before tasting.

The subject marked their responses (i.e. indicated the beer that was judged different, and if they thought this beer was more or less flavorful than the others) on a separate sheet. Testing was conducted in a quiet room with as much time as needed for making a decision. The subject did not know the correct answers until the end of the experiment.

3.3 Evaluating Results
With eight tests of a comparison and a significance level of 0.05, six tests need to be correctly identified in order to reach statistical significance and reject the null hypothesis of "no perceptual difference." At the same significance level, seven of the eight comparisons need to be correctly identified in order to reach statistical significance rejecting the null hypothesis of a just-noticeable difference (JND). Unfortunately, with only eight results per trial, the power of a significance test comparing no perceptual difference against the JND is an abysmal 6%, meaning that 94% of the time that there really is a just-noticeable perceptual difference, a statistically-significant result will not be obtained. (This is one reason why a test result that does not show significance should not be used to conclude that the conditions are perceptually equal. These experiments were conducted with the expectation that there would be more than a just-noticeable difference in at least one comparison.)

With 24 tests of a comparison and the same significance level, 13 tests need to be correctly identified to reach statistical significance and reject the null hypothesis of "no perceptual difference", and 16 tests need to be correctly identified in order to reject the null hypothesis of a just-noticeable difference. The power of a test comparing no perceptual difference against the JND is still a miserable 15%, meaning that 85% of the time that there really is a just-noticeable perceptual difference, a statistically-significant result will not be obtained.

In order to obtain more information from the test results, the likelihood ratios and maximum-likelihood estimates of the effect size (d') with a 95% confidence interval were computed, in addition to significance testing. For those less familiar with these concepts, there is an interactive tutorial on the terminology and mathematics of perceptual testing.

4. Experiment #1: Varying Steep Times with an Uncovered Kettle

4.1 Experiment #1: Experimental Overview
In this experiment, a late-hop addition was made at 1, 5, 10, or 20 minutes. The kettle was uncovered during the final 20 minutes of the boil, allowing volatile hop oils to evaporate.

4.2 Experiment #1: Experimental Methods
All conditions used 2.55 lbs (1.16 kg) of Briess Pilsen Dried Malt Extract with 3.37 G (12.75 liters) of 120°F (49°C) water to create 3.50 G (13.25 liters) of room-temperature wort with specific gravity 1.031. The wort sat for about 90 minutes to let the pH stabilize, at which point the pH was adjusted with phosphoric acid to 5.30. The wort was boiled (uncovered) for 5 minutes to reduce the foam associated with the start of the boil. A 12-oz (0.35 liter) sample was taken for measuring specific gravity and a 40-minute timer was started. The first addition of Amarillo hops (AA rating 8.8%) was made with the weight listed in Table 1 (using a weighted coarse-mesh bag). The kettle was covered for the first 20 minutes of the boil to reduce evaporation, after which time the cover was removed to allow evaporation. At each target time, the second addition of 0.850 oz (24.1 g) of the same Amarillo hops (with the steep time listed in Table 1) was added in a weighted coarse-mesh bag. At flameout the wort was quickly cooled with an immersion chiller to 75°F (24°C) and the hops were removed. Sterilized, room-temperature water was added to bring the volume up to about 3.0 G (11.36 liters). The wort was stirred and then sat for about 15 minutes, covered, to settle the heavier trub. Then, 0.813 G (3.08 liters) of wort was transferred to a sanitized fermentation vessel. This wort was aerated for 1 minute by vigorous shaking, and 0.08 oz (2.20 g) of Safale US-05 yeast was added. A final sample was taken from the kettle for measuring specific gravity.

The wort fermented for one week, after which time 92 oz (2.72 liters) were decanted, leaving the trub behind. From that, a 4-oz (0.12 liter) sample was taken for IBU measurement by Oregon BrewLab. The remainder was stored at close to freezing with minimal exposure to oxygen until the results from Oregon BrewLab confirmed that the samples were all within 5 IBUs of each other. Except when bringing samples up to room temperature for tasting, the beers were kept at near freezing and with minimal exposure to oxygen.

The perceptual experiment was conducted as described in Section 3.2. Conducting up to four tests per day took 17 days. Due to the difficulty in detecting clear differences between samples, tasting of each sample was spaced out by about 30 seconds and small sips of water or a tiny amount of dry bread was taken between tastings to reset the palate.
Condition: A B C D
weight of 1st addition: 0.379 oz /
10.75 g
0.289 oz /
8.20 g
0.185 oz /
5.25 g
0 oz /
0 g
steep time of 2nd addition: 1 min. 5 min. 10 min. 20 min.
pre-boil specific gravity (SG): 1.031 1.031 1.031 1.031
pre-boil volume:(measured, room temp.) 3.51 G /
13.30 liters
3.49 G /
13.22 liters
3.50 G /
13.25 liters
3.50 G /
13.26 liters
SG at 1st addition: 1.033 1.033 1.034 1.033
volume at 1st addition:(estimated from SG) 3.30 G /
12.50 liters
3.27 G /
12.38 liters
3.26 G /
12.34 liters
3.29 G /
12.46 liters
post-boil SG:(after volume correction) 1.036 1.036 1.036 1.036
post-boil volume:(estimated from SG) 3.03 G /
11.46 liters
2.99 G /
11.32 liters
3.02 G /
11.43 liters
3.02 G /
11.44 liters
measured IBUs 23.6 24.5 23.0 22.8
Table 1. Measured and estimated (where indicated) values for the four conditions with an uncovered kettle.

4.3 Experiment #1: Results and Analysis
The IBU levels from the four conditions were well within the perceptual threshold of 5 IBUs. The average was 23.5 IBUs, with a standard deviation 0.66 IBUs. The maximum difference between two conditions was 1.7 IBUs. These results indicate that the beers were not perceptually different in terms of bitterness.

The results of the perceptual test are shown in Table 2. The top-right corner of the table provides the number of correct responses, the p value associated with this response rate (with the value in bold font if significance was reached), the likelihood ratio for a just-noticeable difference relative to no perceptual difference, and the low, maximum-likelihood, and high estimates of d' (using a 95% confidence interval; a d' of 0 corresponds with no perceptual difference, and a d' of 1 corresponds with a just-noticeable difference). The bottom-left corner of the table shows the identity of the preferred sample for each correct response.

The expected amount of variability in the results is quite large, given only 8 samples per trial (standard deviation 1.4 samples). Two trends in the correct-response rate are visible, however: (1) Condition A is more likely to be distinguished from the other conditions, and (2) other comparisons indicate that no perceptual difference is approximately just as likely as a just-noticeable difference.

One unusual result is that the comparison of A vs. B demonstrates a significant difference, and A vs. D also demonstrates a significant difference, but A vs. C does not demonstrate significance. Jumping ahead a little bit in the story in order to explain these results, the experiment described in Section 6 (to test the impact of evaporation on a 10-minute steep time) has results which indicate that the true underlying trend is probably that A and B actually have the least perceptual difference, A and C probably have a not significant and just-noticeable difference, and A and D have the largest perceptual difference. In other words, evaporation and steep time probably affects perception, but the effect is more likely to be a gradual change over a period of about 10 or 20 minutes.

For the preferences, all of the correct responses involving Condition A were associated with a preference for Condition A. For comparison B vs. C, the preference was equally split. For B vs. D, the single correct response favored D. For C vs. D, four out of the five favored Condition C. Shorter steep times appear to be somewhat preferred over longer steep times, but the only universal preference was for the shortest steep time of 1 minute.

These results (and taking into account the results from Section 6) suggest that the shortest hop steep time has the most perceived hop flavor, and that evaporation probably affects hop flavor gradually over a 10- to 20-minute period. Based on these results, one should keep hops in the wort for the shortest time possible in order to maximize flavor.
Comparison: A: B: C: D:
A: 6 / 8 correct
p = 0.020
LR: d'=1/d'=0 = 2.98
d' (low, ML, high) =
0.68, 2.79, 4.68
2 / 8 correct
p = 0.805
LR: d'=1/d'=0 = 0.69
d' (low, ML, high) =
0, 0, 2.10
7 / 8 correct
p = 0.003
LR: d'=1/d'=0 = 4.28
d' (low, ML, high) =
2.10, 3.75, 4.68
B: more flavor:
AAAAAA
4 / 8 correct
p = 0.259
LR: d'=1/d'=0 = 1.44
d' (low, ML, high) =
0, 1.46, 3.75
1 / 8 correct
p = 0.961
LR: d'=1/d'=0 = 0.48
d' (low, ML, high) =
0, 0, 0.68
C: more flavor:
AA
more flavor:
BB CC
5 / 8 correct
p = 0.088
LR: d'=1/d'=0 = 2.07
d' (low, ML, high) =
0, 2.10, 3.75
D: more flavor:
AAAAAAA
more flavor:
D
more flavor:
CCCC D
Table 2. Results from perceptual testing with an uncovered kettle. The top-right corner shows analysis of the number of correct responses. The bottom-left corner shows, for those samples correctly identified as different, which sample was considered to have more hop flavor.

5. Experiment #2: Varying Steep Times with a Covered Kettle

5.1 Experiment #2: Experimental Overview
The experiment with an uncovered kettle showed that hop flavor is probably maximized with the shortest possible steep time. There are two likely explanations for this: (1) the hop oils degrade when they're in boiling wort, and/or (2) the hop oils are removed from the wort through evaporation. If the first explanation is true, then one may be able to vary the temperature of the wort in order to minimize degradation and maximize flavor. If the second explanation is true, then one only needs to cover the kettle in order to prevent the loss of hop oils. The experiment described here tested the second explanation by covering the kettle during the boil. If there is no perceptual difference between any of the conditions, that would suggest that the oils are lost primarily through evaporation. If results are similar to the experiment with the uncovered kettle, that would suggest that oils are mostly degraded in boiling wort.

5.2 Experiment #2: Experimental Methods
This experiment was conducted using the same general methods as the first experiment. The first addition of Amarillo hops was made with the weight listed in Table 3 (using a weighted coarse-mesh bag). The kettle was covered during the entire 40-minute steep time, except for brief stirring and to add the second hop addition. At each target time, the second addition of 0.765 oz (21.7 g) of Amarillo hops (with the steep time listed in Table 3) was added in a weighted coarse-mesh bag.

The perceptual experiment was conducted as described in Section 3.2. Unfortunately, a bug in the randomization yielded between 7 and 12 samples per trial, instead of always 8 samples per trial. Conducting up to four tests per day took 16 days. Due to the difficulty in detecting clear differences between samples, tasting of each sample was spaced out by about 30 seconds and small sips of water or a tiny amount of dry bread was taken between tastings to reset the palate.
Condition: E F G H
weight of 1st addition: 0.363 oz /
10.30 g
0.274 oz /
7.77 g
0.181 oz /
5.14 g
0.096 oz /
2.71 g
steep time of 2nd addition: 1 min. 5 min. 10 min. 15 min.
pre-boil specific gravity (SG): 1.031 1.032 1.032 1.031
pre-boil volume:(measured, room temp.) 3.48 G /
13.18 liters
3.48 G /
13.18 liters
3.48 G /
13.19 liters
3.50 G /
13.25 liters
SG at 1st addition: 1.033 1.033 1.033 1.034
volume at 1st addition:(estimated from SG) 3.31 G /
12.54 liters
3.33 G /
12.62 liters
3.18 G /
12.04 liters
3.28 G /
12.42 liters
post-boil SG: 1.034 1.0345 1.036 1.035
post-boil volume:(estimated from SG) 3.18 G /
12.03 liters
3.18 G /
12.04 liters
3.01 G /
11.41 liters
3.14 G /
11.88 liters
measured IBUs 20.2 21.4 21.2 18.7
Table 3. Measured and estimated (where indicated) values for the four conditions with a covered kettle.

5.3 Experiment #2: Results and Analysis
The IBU levels from the four conditions were well within the perceptual threshold of 5 IBUs. The average was 20.4 IBUs with standard deviation 1.07 IBUs. The maximum difference between two conditions was 2.7 IBUs. These results indicate that the beers were not perceptually different in terms of bitterness.

The results of the perceptual test are shown in Table 4. The top-right corner of the table provides the number of correct responses, the p value associated with this response rate (none of the results reached significance), the likelihood ratio for a just-noticeable difference relative to no perceptual difference, and the low, maximum-likelihood, and high estimates of d'. The bottom-left corner of the table shows the identity of the preferred sample for each correct response.

In this experiment, condition E (the shortest steep time) does not demonstrate any significant differences against the other conditions. Overall, the likelihood ratios show no clear trend; for example, conditions with a greater difference in steep time are not more likely to have a just-noticeable difference than conditions with a small difference in steep time. Unlike the first experiment, all of the 95% confidence intervals include a d' of 0, or no perceptual difference.

For the preferences, there is also no clear preference for any one steep time. The number of correct responses is quite small in most comparisons, and the only comparison with more than four correct responses was evenly split in preference between the two conditions.

While it's not possible to demonstrate that two conditions are perceptually the same using standard significance testing, the set of results here suggests that all conditions in this experiment have at most a just-noticeable difference and quite likely no perceptual difference. In the previous experiment, Condition A had greater perceptual differences from other conditions and was universally preferred over other conditions; those patterns were not observed in this experiment. These results suggest that hop oils lost through evaporation are an important component of hop flavor.
Comparison: E: F: G: H:
E: 3 / 8 correct
p = 0.532
LR: d'=1/d'=0 = 1.00
d' (low, ML, high) =
0.0, 0.68, 2.79
2 / 7 correct
p = 0.737
LR: d'=1/d'=0 = 0.79
d' (low, ML, high) =
0, 0, 2.58
3 / 8 correct
p = 0.532
LR: d'=1/d'=0 = 1.00
d' (low, ML, high) =
0.0, 0.68, 2.79
F: more flavor:
EE F
1 / 9 correct
p = 0.974
LR: d'=1/d'=0 = 0.42
d' (low, ML, high) =
0, 0, 0
4 / 8 correct
p = 0.259
LR: d'=1/d'=0 = 1.44
d' (low, ML, high) =
0, 1.46, 3.75
G: more flavor:
GG
more flavor:
G
7 / 12 correct
p = 0.066
LR: d'=1/d'=0 = 2.48
d' (low, ML, high) =
0, 1.89, 3.38
H: more flavor:
E HH
more flavor:
BB DD
more flavor:
GGG HHHH
Table 4. Results from perceptual testing with a covered kettle. The top-right corner shows analysis of the number of correct responses. The bottom-left corner shows, for those samples correctly identified as different, which sample was considered to have more hop flavor.

6. Experiment #3: Covered vs. Uncovered Kettle with 10-Minute Addition

6.1 Experiment #3: Experimental Overview
The first experiment demonstrated an unexpected result: a significant difference between 1 and 5 minutes (A vs. B comparison with 6 correct responses out of 8 tests), no significant difference between 1 and 10 minutes (A vs. C with 2 out of 8 correct), and a significant difference between 1 and 20 minutes (A vs. D with 7 out of 8 correct). It is mathematically more likely that the lack of perceptual difference in the A vs. C comparison is an incorrect conclusion, which implies that hop oils quickly evaporate with steam. However, the number of data points in this experiment was small and therefore the uncertainty is large. A third experiment was conducted to test this hypothesis with more data. This experiment had two conditions, J and K, both with a 10-minute late-hop addition. The primary difference between the two conditions was that in Condition J the kettle was covered during the final 10 minutes and in Condition K the kettle was uncovered (allowing steam to escape). If the tentative conclusion from the first experiment is correct and hop oils are quickly lost with evaporating steam, then there should be a perceptual and significant difference between Conditions J and K. (With an estimated d' of 2.79 in the A vs. B comparison and 3.75 in the A vs. D comparison, an estimate of d' for a 10-minute steep time is about 3. With 24 tests and a d' of 3.0, the power of the test is close to 1.0.)

6.2 Experiment #3: Experimental Methods
This experiment was conducted using the same general methods as the first and second experiments. Wort for each condition was created using 2.47 lbs (1.12 kg) of DME and 3.27 G (12.38 liters) of water, yielding 3.43 G (13.0 liters) of wort with specific gravity 1.031. The first addition of 0.176 oz (5.0 g) of Amarillo hops (AA rating 8.8%) was made at 45 minutes before flameout (in a weighted coarse-mesh bag). Both conditions had 0.811 oz (23.0 g) of Amarillo hops added in a weighted coarse-mesh bag at 10 minutes before flameout. Safale S-04 yeast was used for fermentation.

For Condition J, the kettle was uncovered for the first 10 minutes after the initial hop addition, and then covered for the remaining 35 minutes of the boil (with the brief exception of adding the 10-minute hop addition). For Condition K, the kettle was uncovered during the first 10 minutes, covered during the next 25 minutes, and uncovered during the final 10 minutes (after the second hop addition was made).

The perceptual experiment was conducted as described in Section 3.2. Conducting 24 tests with up to four tests per day, along with the 24 tests in the fourth experiment, took 17 days. With the expectation of less difficulty in detecting a clear difference between samples and a desire to balance memory effects with adaptation effects, tasting of each sample was spaced out by about 10 seconds and only small sips of water were taken between tastings to reset the palate.

6.3 Experiment #3: Results and Analysis
The measured IBUs were 24.7 for Condition J and 28.9 for Condition K. The difference between these IBU levels, 4.2, is within the perceptual threshold of 5 IBUs. These results indicate that the beers were not perceptually different in terms of bitterness.

The results of the perceptual test were that 11 out of the 24 tests were correctly identified, and of those correct responses, 3 times Condition J was preferred and 8 times Condition K was preferred. The p value associated with this response rate is 0.14 (not significant at a threshold of 0.05), and the likelihood ratio for a just-noticeable difference relative to no perceptual difference is 2.07. The low, maximum-likelihood, and high estimates of d' are 0.0, 1.24, and 2.32, respectively.

These results were very much unexpected, in the low estimate of d', the lack of significance, and the general preference for the uncovered late-hop addition over the covered late-hop addition. These results imply that in the first experiment the A vs. B comparison (1 min. vs. 5 min.) yielded an incorrect result that supported a perceptual difference, and that the A vs. C comparison (1 min. vs. 10 min.) was actually correct in not demonstrating significance. Given the strength of the A vs. D comparison (1 min. vs. 20 min., with 7 out of 8 correct and consistent responses), it seems prudent to continue to assume that the result of that comparison was correct.

The preference for Condition K over Condition J might be due to (a) difficulty in distinguishing these two conditions (with a fairly low d') (b) small differences in the perceptual testing methodology that may have had an unexpectedly large effect , (c) the use of a different strain of yeast, and/or (d) flavor changes over time due to the transformation of hop oils in the hot wort in addition to the loss of oils through evaporation. The simple explanation that hop oils are simply lost through evaporation may or may not be the complete explanation.

Considering the set of results of the first three experiments, it appears that hop flavor does decrease with longer steep times, but only relatively slowly. We can estimate the perceptual change over time (with an uncovered kettle) as a d' of roughly 1.0 after 10 minutes (a just-noticeable difference) and a d' of roughly 3.0 (with a maximum-likelihood estimate of 3.75) at 20 minutes. With the preference for the shortest steep time in the first experiment not consistent with the preference for the uncovered kettle in the third experiment, it is unclear if flavor changes occur only through evaporation, through additional mechanisms, or if testing differences or statistical variation in the third experiment caused a different result. The universal preference for the shortest steep time in the first experiment leads to the tentative conclusion that flavor is maximized with the shortest steep time.

7. Experiment #4: Boiling vs. Sub-Boiling Hop Addition

7.1 Experiment #4: Experimental Overview
A comparison of the results from the first and second experiments indicates that covering or not covering the kettle can be responsible for a noticeable change (or lack of change) in hop flavor. The results of the third experiment suggest that the effect of covering the kettle is only a just-noticeable difference at a 10-minute steep time. Other than volatile hop oils evaporating with steam, another likely explanation for a change in hop flavor is a transformation of hop oils in contact with boiling wort. The fourth experiment tested the effect of wort temperature on hop flavor, comparing a 10-minute steep time at boiling (Condition L) with a 10-minute steep time at 170°F (77°C) (Condition M).

7.2 Experiment #4: Experimental Methods
This experiment was conducted using the same general methods as the previous three experiments. Dried malt extract was used to create 3.43 G (13.0 liters) of wort with pre-boil specific gravity 1.031. The first addition of Amarillo hops (AA rating 8.8%) was made at 45 minutes before flameout (in a weighted coarse-mesh bag). Condition L used 0.176 oz (5.0 g) of hops in the first addition and was identical with Condition J in Experiment #3. Condition M used 0.388 oz (11.0 g) of hops in the first addition. Both conditions had 0.811 oz (23.0 g) of Amarillo hops added in a weighted coarse-mesh bag at 10 minutes before flameout, and the kettle was covered during the final 10 minutes. In Condition L the wort was kept at boiling; in Condition M, the wort was cooled from boiling to 170°F (77°C) during the 11th minute before flameout using an immersion chiller, and the target temperature was maintained (to within a few degrees) during the final 10 minutes before flameout. Safale S-04 yeast was used for fermentation.

The perceptual experiment was conducted as described in Section 3.2. Conducting 24 tests with up to four tests per day, along with the 24 tests in the third experiment, took 17 days. As in the third experiment, tasting of each sample was spaced out by about 10 seconds and only small sips of water were taken between tastings to reset the palate.

7.3 Experiment #4: Results and Analysis
The measured IBUs were 25.1 for Condition L and 29.6 for Condition M. The difference between these IBU levels, 4.5, is within the perceptual threshold of 5 IBUs. These results indicate that the beers were not perceptually different in terms of bitterness.

The results of the perceptual test were that 5 out of the 24 tests were correctly identified, and of those correct responses, 3 times Condition L was preferred and 2 times Condition M was preferred. The p value associated with this response rate is 0.94 (not significant at a threshold of 0.05), and the likelihood ratio for no perceptual difference relative to a just-noticeable difference is 4.29. The low, maximum-likelihood, and high estimates of d' are 0.0, 0.0, and 0.8, respectively.

While it is not possible to conclude that two conditions are perceptually identical using significance testing with a null hypothesis of no difference, it would be difficult to get results that more clearly indicate no perceptual difference between the two conditions. Even random guessing would result in, on average, 8 of the 24 tests being correctly identified. The result of 5 correct responses is not so low that one should be concerned about experimental error, but low enough that the likelihood of there being no perceptual difference is more than four times greater than there being a just-noticeable difference. The maximum-likelihood estimate of d' is 0, indicating no perceptual difference. In short, there is no evidence that there is a perceptual difference between hops boiled for 10 minutes and hops kept at 170°F (77°C) for 10 minutes. I will abuse the mathematics a bit and conclude that a sub-boiling hop stand produces no noticeable difference in hop flavor, at least for a 10-minute steep time and these experimental conditions.

8. Conclusions

8.1 Summary of Results
The results from these experiments indicate that hop flavor is lost primarily through evaporating steam while the hops are steeped in hot wort. After about 10 minutes of steeping there may be a just-noticeable difference in hop flavor; after about 20 minutes the difference may be more easily perceived. Flavor appears to be lost through the evaporation of hop oils, but it is also possible that other factors also affect the flavor compounds over time.

The best-practice recommendation resulting from these experiments is to keep hops in boiling wort for as short a time as possible in order to preserve hop flavor, but a difference of 10 minutes or a decrease in wort temperature may not have a perceptible impact, especially with a covered kettle. This recommendation might be paraphrased as: minimize the time that the hops are in hot wort, but (in the words of Charlie Papazian) relax, don't worry, and maybe have a homebrew.

One potential concern with a covered kettle is the production of dimethyl sulfide (DMS) which can then not be removed by evaporation. Most ales, however, "have DMS levels well below threshold" [Fix and Fix, p. 50]. Because the precursor S-methylmethionine (SMM) and DMS are reduced more at ale fermentation temperatures than at lager fermentation temperatures, "any hint of DMS in ales is likely from technical brewing errors, most notably contamination" [Fix, p. 75]. In lagers, the increase in DMS caused by a covered kettle can be counteracted with a longer (uncovered) boil time and/or faster wort cooling [Fix and Fix, pp. 50-51]. (The other option is to not worry about DMS and brew lager in the style of Rolling Rock [Bamforth, p. 18].)

8.2 Comments on Perceptual Testing
In general it was very difficult to tell the beers in these conditions apart, despite the nearly ideal testing conditions. This difficulty was compounded (or caused) by the first taste of a beer being the most perceptually distinctive and subsequent tastes of other samples having less sensory impact. There was therefore a balance between waiting long enough to reset the palate but not waiting so long that the specifics of the flavor were forgotten. Taking small sips of water or eating a tiny amount of dry bread to reset the palate in between tastings seemed to help, but in most cases the differences between conditions were very subtle (or nonexistent).

My general preference for the flavor obtained from a 1-minute steep time with Amarillo hops may or may not be shared by others. As a counterexample, my wife thinks that every IPA she has ever encountered tastes and smells disgusting. Another hop variety might yield different results. In short, your perceptions and preferences may be different from the results of these experiments.

9. Acknowledgment
I would like to sincerely thank Dana Garves at Oregon BrewLab for the IBU measurements in these experiments. Oregon BrewLab has been a pleasure to work with, and I can always rely on the accuracy of the measured values.

References


Navigate to: AlchemyOverlord home page