Constraints on learning disjunctive, unidimensional auditory and phonetic categories

Heffner, Christopher C.; Idsardi, William J.; Newman, Rochelle S.

doi:10.3758/s13414-019-01683-x

Constraints on learning disjunctive, unidimensional auditory and phonetic categories

Perceptual/Cognitive Constraints on the Structure of Speech Communication: In Honor of Randy Diehl
Published: 13 February 2019

Volume 81, pages 958–980, (2019)
Cite this article

Download PDF

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Constraints on learning disjunctive, unidimensional auditory and phonetic categories

Download PDF

Christopher C. Heffner ORCID: orcid.org/0000-0002-5753-4543^1,2,3^nAff4,
William J. Idsardi^1,2 &
Rochelle S. Newman^1,3

1637 Accesses
6 Citations
9 Altmetric
1 Mention
Explore all metrics

Abstract

Phonetic categories must be learned, but the processes that allow that learning to unfold are still under debate. The current study investigates constraints on the structure of categories that can be learned and whether these constraints are speech-specific. Category structure constraints are a key difference between theories of category learning, which can roughly be divided into instance-based learning (i.e., exemplar only) and abstractionist learning (i.e., at least partly rule-based or prototype-based) theories. Abstractionist theories can relatively easily accommodate constraints on the structure of categories that can be learned, whereas instance-based theories cannot easily include such constraints. The current study included three groups to investigate these possible constraints as well as their speech specificity: English speakers learning German speech categories, German speakers learning German speech categories, and English speakers learning musical instrument categories, with each group including participants who learned different sets of categories. Both speech groups had greater difficulty learning disjunctive categories (ones that require an “or” statement) than nondisjunctive categories, which suggests that instance-based learning alone is insufficient to explain the learning of the participants learning phonetic categories. This fact was true for both novices (English speakers) and experts (German speakers), which implies that expertise with the materials used cannot explain the patterns observed. However, the same was not true for the musical instrument categories, suggesting a degree of domain-specificity in these constraints that cannot be explained through recourse to expertise alone.

Long-term priors constrain category learning in the context of short-term statistical regularities

Article 06 May 2022

Casey L. Roark & Lori L. Holt

Effect of explicit dimensional instruction on speech category learning

Article 05 November 2015

Bharath Chandrasekaran, Han-Gyol Yi, … W. Todd Maddox

Task and distribution sampling affect auditory category learning

Article 02 July 2018

Casey L. Roark & Lori L. Holt

Learning a language requires learning phonetic categories. Speech sound tokens vary in their realization from speaker to speaker and from utterance to utterance, making it imperative for listeners to accommodate this variation when understanding speech (Lisker, 1985; McMurray & Jongman, 2011). Despite this variability, listeners readily group speech sounds together using labels that can extend to new instances. This process of categorization has important behavioral consequences. Theories of phonetic learning make different predictions about how these categories are acquired.

In the present set of experiments, two topics of interest are probed. First, we examine the extent to which there are constraints on the types of phonetic categories that are possible to learn. In doing so, we compare instance-based (also known as exemplar-only) theories of phonetic category learning (Hawkins, 2003; Johnson, 2007; Pierrehumbert, 2003) to abstractionist theories, which may take the form of decision-bound (Ashby & Townsend, 1986), prototype (Samuel, 1982), or multiple-system (Chandrasekaran, Koslov, & Maddox, 2014) theories of learning. Although both types of theory have an impressive array of evidence behind them, the focus in this article is on whether the learning process comes with any assumptions about the structure of categories, which the two sets of theories make different predictions about (Ashby & Waldron, 1999). Second, to the extent that such constraints exist, we investigate the domain-specificity of the constraints, comparing phonetic learning with nonspeech auditory learning.

We focus on the acquisition of disjunctive categories within a unidimensional stimulus set. Disjunctive categories require “or” statements to describe, as in “A temperature is uncomfortable if it is too hot or it is too cold.” They are a subset of discontinuous categories, which also include categories that span multiple parts of a category continuum without a different category between those parts of the continuum. These categories exist in a wide variety of real-world contexts. For example, in baseball, a strike is called when the batter hits the ball in foul territory, when the batter swings and fails to hit the ball, or when the batter fails to swing when the ball transverses the strike zone. This is a good example of a disjunctive category in multidimensional stimulus space, as it is challenging to imagine a single dimension along which these three types of actions could be considered continuous. Most speech learning tasks are assumed to include multiple dimensions; say, using patterns in F2 and F3 to characterize the acquisition of the /ɹ/–/l/ distinction by Japanese learners of English (Lotto, Sato, & Diehl, 2004).

For uncomfortable temperatures, on the other hand, the idea of describing these unidimensionally is more plausible. Temperature could be expressed in Celsius or Kelvin, with temperatures above a certain level or below a certain level being labeled under the single category of “uncomfortable.” Other examples come from music. In music, identical musical note labels (e.g., “E-flat,” “A”) are used to describe disjunctive categories spaced throughout the single dimension of pitch. An E-flat is an E-flat no matter which octave it occurs within. Similarly, notes can be perceived as off the musical beat if they occur too fast (if a performer is rushing) or too slow (if a performer is lagging behind), meaning that, across time, there is a span of times perceived as on the beat that are surrounded by notes that are off the beat. Unidimensional, disjunctive categories are seemingly rare in phonetic space. In American English, the category /t/ can be realized as [t] (a voiceless alveolar stop, as in the word stop), [t^h] (an aspirated alveolar stop, as in the word top), [ɾ] (an alveolar flap, as in the word potter), and even [ʔ] (a glottal stop, as in the word button). Although it is difficult to describe all of these realizations without using a disjunction, it is also likely that these sounds vary across multiple dimensions, not just one. In intonational phonology, pitch accents can be either high (usually annotated H*) or low (L*), but they buttress a set of intermediate fundamental frequency points that are not perceived as pitch accents.

Under one set of theories—here referred to as instance-based models, although often referred to as exemplar or variationist models—listeners do not start with any assumptions about the nature of the categories being learned. Instance-based models see category learning as the result of the encoding of specific instance-to-category-label pairings. Category membership is determined only by the similarity between a new item and previously observed items. Probably the most widely used instance-based theory is the generalized context model (GCM) of Nosofsky (1986). According to the GCM, categorization is essentially a special class of item identification. Categorization requires calculating how closely a new item resembles previously identified ones, using the most similar items to that new item to make a hypothesis about its category label. Indeed, barring an inability to perceptually discriminate individual items, instance-based models can learn almost any category, even ones with very patchy distributions within the stimulus space (Ashby & Alfonso-Reese, 1995; McKinley & Nosofsky, 1995).

One particularly well-cited example of an instance-based theory within speech perception is that of Pierrehumbert (2003). Under Pierrehumbert’s (2003) model, speech sound categories are the collection of multiple memorized pairings of individual speech sound tokens (i.e., exemplars) to categories. New items that are fed into the system are simply compared with previously observed ones. The categories that the most similar previous items belong to are compared with one another, and the new item is paired with the category that has the most (and most similar) category connections. The /p/ category, then, is defined by the many specific instances of the /p/ sound that have been encountered on the part of a listener. The model of Pierrehumbert (2003) and its instance-based peers (Johnson, 2007) have explicitly been inspired by instance-based theories of visual category learning, especially the GCM (Nosofsky, 1986) and MINERVA (Hintzman, 1986; Homa, Cross, Cornell, Goldman, & Shwartz, 1973). Under instance-based speech perception theories, even very small phonetic details describing the differences between sounds can be critical, as the recollection of these fine phonetic details may distinguish between categories (Hawkins, 2003; Johnson & Seidl, 2008). This allows for straightforward accommodation of complex aspects of speech perception, especially sensitivity to speaker-specific variation in phonetic cues (Goldinger, 1998; Smith & Hawkins, 2012), as is shown through, for example, speaker-specific studies of perceptual adaptation (Dahan, Drucker, & Scarborough, 2008; Kraljic & Samuel, 2007; Norris, McQueen, & Cutler, 2003).

Abstractionist accounts, on the other hand, can more readily accommodate learner assumptions or prior beliefs about the structure of phonetic categories. Abstractionist accounts include decision-bound theories, prototype theories, and multiple-system theories, all of which include a layer of abstraction above and beyond the level of individual item-to-label mappings. The types of abstraction that are used within each model vary widely. Under decision-bound models, learners determine an abstract ideal boundary in perceptual space to delineate multiple categories. The boundaries need not necessarily be linear, although generally under decision-bound models the boundaries proposed are subject to processing constraints that discourage overly complex boundaries (Ashby & Gott, 1988; Ashby & Townsend, 1986; Maddox, Molis, & Diehl, 2002). For example, a stop in English might be classified as voiced if it has a voice onset time (VOT) of 35 ms or smaller, and voiceless if it falls above that boundary. Prototype models store categories as either a single prototype or set of multiple prototypes (Samuel, 1982). More recent formulations of prototype theories involve each speech category being formed from a mixture of Gaussian distributions (McMurray, Aslin, & Toscano, 2009; Toscano & McMurray, 2010) that abstract away from specific instances (although such mixture models inherently involve disjunctions, as categories are described as falling into one of many possible distributions).

One approach to abstractionist accounts of learning that has been gaining steam has been to propose the use of multiple systems in category learning. Motivated in part by multiple system accounts of memory (Squire, 2009), multiple-system accounts of category learning propose that listeners make use of both instance-based and rule-based systems. Under RULEX (RULes and EXceptions; Nosofsky & Palmeri, 1998; Nosofsky, Palmeri, & McKinley, 1994), learners first attempt to sort items into categories according to simple, linear rules, then attempt successively more complex rules until finally falling back on simple memorization of exceptions. Another dual-system model, COVIS (COmpetition between Verbal and Implicit Systems; Ashby, Alfonso-Reese, Turken, & Waldron, 1998), combines a familiar rule-based system with a second decision-bound system, albeit one that largely replicates instance-based learning. Dual-system models have been proposed at a variety of levels of analysis. The acquisition of morphosyntax (Ullman, 2004, 2016), lexical items (Davis & Gaskell, 2009; Lindsay & Gaskell, 2010), and phonetic categories (Chandrasekaran, Koslov, et al., 2014) have all been approached using multiple-system models that, by and large, rely on one rule-like system and one memorization-like system. This path has much in common with more recent approaches to phonetic category learning that incorporate neurobiological insights (Myers, 2014). It also provides a way to comfortably incorporate the impressive pool of evidence for the idea that listeners are acutely sensitive to fine phonetic detail in speech (Bybee, 2002; Hawkins, 2003; Hay, Nolan, & Drager, 2006; Pierrehumbert, 2002) with findings that seem to require a level of abstraction in phonetic processing (Pajak & Levy, 2014; Pycha, 2009, 2010).

Other approaches to finding a middle ground between instance-based and rule-based models do not rely on multiple systems. SUSTAIN, short for Supervised and Unsupervised STratified Adaptive Incremental Network (Love, Medin, & Gureckis, 2004) is one such example. Like single-system models, SUSTAIN does not explicitly represent two different pathways to learn categories, but the single system that is postulated forms “clusters” of stimuli that have similar category properties, resembling mixture models. When few clusters exist, the model’s behavior is said to take the form of a rule-based model. This model behavior might be seen in cases when categories are simple and easy to describe and, thus, when few stimulus clusters need to be posited. Characterizing a pitch as “low” or “high” is a good example of a category learning scenario that would require few clusters. If additional clusters are necessary, though, the model behaves more like an instance-based model, storing new clusters to accommodate the unusual exceptions in a way that resembles instance-based computation. The example of musical notes given earlier is a good example of a category learning scenario that might require many stimulus clusters; each instance of, say, B-flat would give rise to a cluster that would be assigned to the B-flat category.

Although instance-based and abstractionist theories can be described in mathematically interchangeable terms (Ashby & Maddox, 1993; Rosseel, 2002), we focus here on a key difference between these theories: the possibility of constraints on the structure of categories. The relevance of the structure of categories to the speed of learning has been a topic of interest from virtually the beginning of psychological studies of categorization. The classic study performed by Shepard, Hovland, and Jenkins (1961), for example, examined the acquisition of categories of simple geometric objects that varied in their size, shape, and shade. They found that more complex categories (i.e., categories that combined objects of disparate sizes, shapes, and shades) were harder to learn. Under abstractionist theories, these patterns are usually explained in terms of the complexity of the rules or the prototype structures that are necessary to describe the complex categories.

Under instance-based theories, the fact that complex categories are harder to learn results from interstimulus confusability or selective attention across dimensions (Nosofsky, Gluck, Palmeri, McKinley, & Glauthier, 1994). To explore why, consider the generalized context model (GCM) of Nosofsky (1986). The computational implementation of the GCM is fairly simple. The distance between a current item and previous ones is computed using a Gaussian distance function. This distance function is used to compute the weight that each item has toward categorizing the stimulus into any of the possible categories under consideration. The weighting is dependent on how confusable an item is with its neighbors. The new item is assigned to the category with the greatest summed weight. In this case, neither the exact labels chosen nor whether those labels are repeated across the stimulus space affect categorization, as the category labels themselves are only used as labels for items.

However, when participants can easily discriminate individual tokens, instance-based theories allow for almost limitless flexibility in the end point of learning. As previous authors have testified, “all [tested] exemplar models predict that with enough training, subjects will respond almost optimally in any categorization task, no matter how complex” (Ashby & Alfonso-Reese, 1995, p. 227). For the GCM in particular, the model “basically predicts that given enough experience with training exemplars, participants’ response patterns should eventually approximate the underlying category distributions” (McKinley & Nosofsky, 1995, p. 145), with the trajectory of learning only hampered by interinstance discriminability. One way to test this idea is to provide multiple, well-separated pockets of instances. Instance-based theories have a hard time explaining differences that result from category structure when the stimuli fall along a single dimension with easily differentiable items. When instances are differentiable, the structure of the categories being learned should not affect the rate of learning; essentially any assignment of instances to categories should be relatively easy. Instance-based theories predict that an almost limitless number of categories could be learned. When instances are confusable, instance-based models predict that learners should find categories challenging to learn. However, rather than responding with specific patterns, participants should behave approximately according to chance, responding in line with a broad sample of items from the stimulus set.

Previous studies along similar lines have shown mixed patterns. Kingston (2003) examined the ability of English speakers to learn to contrast sets of German vowels that differed in vowel height, rounding, and tenseness. Although the English speakers were affected by exemplar-like effects when learning to categorize by tenseness (e.g., by showing better learning when more vowel contrasts were available in training), no such effects were found for vowel height and rounding, where an abstractionist theory of learning seems to be a better match for the results obtained. Intriguingly, when such patterns were tested in the phonotactic domain, examining learners’ abilities to pick up on regularities of phonological segment coincidence within words, learners did not find more complex categories monotonically more challenging to learn (Moreton, Pater, & Pertsova, 2017). This finding challenges some of the predictions of simple abstractionist theories within the speech learning literature. The results were replicated in the visual domain for participants learning to categorize varieties of cake. However, both studies used classes of segments that varied in a binary (either–or) fashion across multiple dimensions and, in the case of Moreton et al. (2017), relied on matching or mismatching sets of segments across a word rather than single segment categories. Examining multidimensional categories complicates the predictions of both abstractionist and instance-based theories. Further, categories in phonetic space (such as voicing categories) tend not to be discrete, instead relying on the categorization of instances in continuous space.

A second question of interest is whether the processes that underpin phonetic learning, and the constraints that might make some aspects of learning more challenging, are shared between speech and nonspeech domains. For most strictly instance-based accounts of phonetic category learning, the process of learning itself is no different from learning any other auditory object. The lack of constraints on the category structures that can be learned should be identical in speech and category learning elsewhere (Port, 2007, 2010). If constraints are uncovered, on the other hand, it is an open question whether these constraints are domain-specific. It could be the case that any constraints on phonetic learning are also found in the auditory modality more generally. Many erstwhile speech-specific properties have been found with other auditory objects (Diehl, Lotto, & Holt, 2004; Holt & Lotto, 2008), and many of the properties of phonetic categories can be explained with recourse to audition-general constraints alone (Diehl, 2000, 2008; Holt, Lotto, & Diehl, 2004). Yet speech must somehow be different from the rest; after all, speech is used as an input to broader language systems, such as syntax and semantics. The sound of jangling keys cannot become a part of a syntactic phrase (Poeppel, Idsardi, & van Wassenhove, 2008). For phonetic learning, the massive amount of experience with speech categories or even innate predispositions might lead learners to make different assumptions about the structure of categories within an unknown phonetic space. Alternatively, dealing with the likely very warped perceptual space in which phonetic categories are learned may require domain-specific dimensional processing. This makes experience a key component of the study of domain-specificity of constraints on phonetic learning.

Many studies examining the acquisition of disjunctive categories outside of the phonetics literature have focused on categories that are disjunctive across multiple dimensions. Although this does accurately reflect many kinds of disjunctive categories, this leaves open the question of what learners will do with categories that are disjunctive within a single dimension. Abstractionist and instance-based theories of category learning make predictions for unidimensional categories as well as multidimensional ones. Making predictions within unidimensional category spaces is simpler than doing so for multidimensional categories.

Four studies were used to test claims about the learnability of different category structures. Experiment 1 includes a test of the dimensionality and comparability of speech and nonspeech continua. Experiment 2 includes a set of three subcomponents, assessing either different populations or different stimuli. In Experiment 2a, we examined whether there are constraints on the acquisition of disjunctive speech sound categories, as assessed with English speakers learning categories of German speech sounds. In Experiment 2b, we tested whether these constraints would also apply to German speakers, experts with this dimensional space, who were learning to categorize sounds within the same set of items. And, finally, in Experiment 2c, we determined whether the constraints would also appear for a set of nonspeech sounds, using a synthetic musical instrument continuum. In each case, participants heard items from a 10-point continuum, with different points on the continuum associated with various colored squares (as category labels). Each participant completed one of six different learning conditions, with the categories being learned changing from condition to condition based on which sounds are paired with which squares.

In Experiment 2, six different intersubject conditions were used to probe the influence of category structure on learning (see Fig. 1). Two of them, the Normal and Shifted conditions, were easy to learn under every theory of category learning. Two of the conditions, meanwhile, were predicted to be much more challenging for participants to learn: the Odd One Out and Picket Fence conditions. Both conditions included a large number of disjunctions within the stimulus continuum. Although they both could theoretically be learned, given enough exposure, by a precise instance-based theory, interitem confusability would likely doom an instance-based model in practice. Including both hard and easy conditions allows for the calibration of the relative difficulty of conditions that should be intermediate in difficulty between the two sets.

The key conditions for distinguishing between abstractionist and instance-based learning accounts were the remaining two, the Neapolitan and Sandwich conditions. Both conditions involved two category boundaries along the continua, in the same locations, but the Sandwich condition included a disjunctive category, whereas the Neapolitan condition did not. Here, the theories make divergent predictions. Under instance-based theories of category learning, both categories should have an equivalent difficulty: if the Neapolitan condition is challenging, the Sandwich condition should also be challenging. As previously mentioned, instance-based theories of category learning are very flexible, and the difficulty of categorization depends on interitem similarity. Because the Neapolitan and Sandwich conditions include equally confusable items and equally confusable boundaries, they should be equally easy to learn. No matter where a novel item fell within the speech sound continuum, the distances to adjacent and nonadjacent categories were the same across the conditions. Thus, it should have been equally difficult to identify individual items across the conditions, because the instances being sampled across the two conditions would be approximately identical; the only difference would be in the label of some of the tokens in that sample.

The behavior of abstractionist models, meanwhile, depends on the treatment of the disjunctive red category. Many proponents of dual-system models, for example, have suggested that disjunctive categories may sometimes be learned using the rule-based learning system, rather than the instance-based one (Minda, Desroches, & Church, 2008; Zeithamova & Maddox, 2006), including in speech sound categories (Maddox et al., 2014). However, such ideas have generally been based on multidimensional stimuli, such as visual stimuli that depend on both shape and color, rather than on unidimensional stimuli more like the ones encountered in the present experiment. If both the Sandwich and Neapolitan conditions are processed using identical systems, they should both be equally easy to learn.

Other abstractionist approaches would suggest that the disjunctive, unidimensional category in the Sandwich condition should make it harder to learn than the nondisjunctive ones of the Neapolitan conditions. In the Rational Rules model of concept learning (Goodman, Tenenbaum, Feldman, & Griffiths, 2008), hypotheses take the form of rules. In learning scenarios that include nondisjunctive categories along a single dimension, these rules are formed from conjunctions or disjunctions of sets that describe parts of a dimension. Participants make responses in line with the small number of hypotheses that they are entertaining at any one point about the categories that they learn, with a small probability of responding incorrectly. Individual items also have the chance of being labeled as an outlier if they belong to a category unexpected by the rules currently under consideration. Simple rules are preferred to more complicated ones due to a strong prior for simple rules. Under the Rational Rules model, participants have strong priors toward simple categories. If listeners find the disjunctive Sandwich condition more difficult than the nondisjunctive Neapolitan condition, this would provide evidence for one of these types of abstractionist theories.

Comparing the subcomponents of Experiment 2 (speech stimuli in Experiments 2a and 2b vs. musical instrument stimuli in Experiment 2c, and English speakers in Experiments 2a and 2c vs. German speakers in Experiment 2b as participants) will allow for a comparison of the effects of expertise to the effects of the materials being used. Both groups of English speakers were equally unfamiliar with the stimuli being used, regardless of whether those stimuli were speech based or instrument based, when compared with the German speakers. Thus, any biases shared by the English speakers but not by the German speakers may reflect the influence of expertise (or the lack thereof) on category learning. Conversely, both the English and German speakers learning phonetic categories were learning sounds taken from the speech domain. This implies that any bias shared by the English and German speakers learning phonetic categories, but not shared by the English speakers learning instrument categories, may reflect the influence of speech-specificity.

Experiment 1: Stimulus properties

Method

Before discussing the acquisition of the auditory categories used in this project, the perceived properties of the stimuli that were being used had to be established. After all, any differences that would be found between the acquisition of phonetic and instrument categories could either be the result of differences in the processing of items inside and outside of language or simply due to differences in the discriminability or dimensionality of the stimuli. Two continua were created through weighted averaging: a speech continuum, used in Experiment 2a and Experiment 2b, and a musical instrument continuum, used in Experiment 2c. It was believed that both continua would be perceived in a unidimensional manner, showing stepwise increases in discriminability between stimuli of successively larger intervals along the continuum. To test this idea, and to determine the extent to which two continua were well matched, participants performed a simple discrimination task to determine the distinctiveness of the stimuli.

Participants

Twenty-seven participants were recruited from Amazon’s Mechanical Turk crowdsourcing database. One participant was removed from analysis due to previous experience with German, leaving 26 native English speakers (seven female, 19 male). No participants were old enough that significant high-frequency hearing loss would be expected (M_age = 34.7 years, range: 25–47 years). Although participants were asked to use headphones, three participants reported using external or built-in speakers. Despite uncertainty about the precise qualities of the sound equipment that the participants used, previous studies using Mechanical Turk (Buxó-Lugo & Watson, 2016; Slote & Strand, 2016) have generally found Mechanical Turk to be an appropriate venue to run speech perception experiments.

Materials

To create the phonetic stimuli, materials from a previous study (Key, 2014) were used as a starting point. The [x] and [ç] end points of the palatal-to-velar continuum were excised from tokens produced by a native speaker of German, selected from a variety of recordings of [ç] and [x] in nonword frames. The now-isolated tokens, each 95-ms long, were cut at zero crossings, with the longer token cut in size to match the length of the shorter token, and the peak intensities of each file were scaled to an identical 0.9 Pa. The spectral content of these natural tokens was then linearly combined using Praat (Boersma & Weenink, 2001) to create a 10-step continuum, with intermediate points that entailed a linear combination of the acoustic noise that characterizes each fricative. Each intermediate step was therefore a weighted average of the energy found in each end point. The steps were numbered, with Step 1 defined as the most palatal item and Step 10 as the most velar item, with each intermediate number indicating the precise titration of the two end points.

To examine the acquisition of rich and acoustically complex nonlinguistic categories, we created a continuum of synthetic musical instrument sounds. This was done using the Wind Instruments Synthesis Toolbox or WIST (Rocamora, López, & Jure, 2009) and Praat (Boersma & Weenink, 2001). The WIST was used to create two 500-ms musical instrument notes, one synthesized from a trumpet template and one synthesized from a trombone template. Both notes were synthesized with identical fundamental frequencies and identical intensity properties; the only thing distinguishing the two notes was their timbre. The instrumental tokens were a great deal longer than the fricative stimuli due to the properties of the WIST. However, such differences would likely only impact the learning of each class of item insofar as the instrument items were differentiable to a different extent from the fricative items.

Next, the notes were spectrally rotated around a 4 kHz midpoint, a type of acoustic manipulation that redistributes information across frequencies in an acoustic signal. This spectral rotation was used to construct synthetic musical instruments that have much of the rich acoustic signature of brass instruments, but without a true connection to the instruments. The trumpet and trombone sounds were low-pass filtered to remove information above 8000 Hz, then spectrally rotated using two channels (split at 4000 Hz) to create two end points for the musical instrument continuum. That is, the intensity and spectral information found in the signal was mirrored around 4000 Hz, with, for example, points of relative prominence at 3500 Hz now being reflected in points of relative prominence at 4500 Hz. The long-term average spectrum of the original sound was then overlaid on the spectrally rotated sounds. This preserves the overall acoustic profile of the original brass sounds while putting a new spin on the relative prominence of different frequencies within the signal. This renders them analogous to the German fricative stimuli: acoustically complex and clearly instrumental, but unfamiliar. The end points of this continuum were labeled the “pettrum” and the “bonetrom,” respectively, and were peak scaled to ensure their intensities matched. Next, Praat was used to linearly combine the two end points to make a 10-step continuum. As with the speech stimuli, this was accomplished through use of spectral blending: each point along the continuum represented a linear combination of the two end-point signals. The pettrum end was arbitrarily labeled Step 1, while the bonetrom end was labeled Step 10.

In using unidimensional, flat continua, a level of validity was sacrificed. Flat distributions with perfectly covarying cues are not typical for acoustic categories, particularly ones with only 10 items. Studies of cue trading in phonetics, for example, have shown that many, if not all, phonetic contrasts are signaled with a wide variety of cues, all capable of combining together in many different ways to yield a coherent percept (Repp, 1982). The rich trade-offs between these cues were not available in the present data set. In this case, weighted averaging means that whatever multiple cues that listeners use to perceive the differences between the end points are completely and inextricably correlated. This continuum therefore provides an avenue to measure the perception and acquisition of simple auditory categories, akin to unidimensional voice onset time (VOT) continua used to examine the perception of word-initial voicing. Spectral slices of the midpoint of each stimulus are available in Fig. 2.

Procedure

Participants heard two blocks of trials: one with the speech stimuli, the other with the musical instrument stimuli. The order of the blocks was counterbalanced across participants. Within each trial, participants heard two paired stimuli from one of the continua, back-to-back, with a 500-ms interstimulus interval (ISI). With 10 possible stimuli as both the first and second item, there were 100 possible ordered pairs per continuum. Participants heard all 100 pairs exactly once and were then asked to rank how similar the items within the pair were on a scale from 1 to 9.

Analysis

The similarity judgments for each participant were converted into difference scores, ranging from 0 (not different) to 8 (most different). These difference scores were used to create a 10 × 10 symmetric data matrix for each participant, with each row and each column being a step within the continuum. These symmetric data matrices were analyzed using the IDIOSCAL (Individual Differences in Orientation Scaling) functionality of the “smacof” package within R (Mair, De Leeuw, Borg, & Groenen, 2016). IDIOSCAL is a generalization of Individual Differences Scaling, INDSCAL (Carroll & Chang, 1970), which has been used extensively in the category learning literature; for example, in determining naïve listeners’ parcellation of Mandarin tone categories (Chandrasekaran, Sampath, & Wong, 2010) or to examine the effects of training on categorical perception (Livingston, Andrews, & Harnad, 1998). In INDSCAL and IDIOSCAL, dimensionality analysis requires multiple possible dimensionalities, n. For each dimensionality, an n × 10 matrix is generated, showing the coordinates of each stimulus step in an n-dimensional space. Traditionally, the approach to determine the best number of effective dimensions is to calculate badness-of-fit measures for each n and to look for an “elbow,” a point at which additional possible dimensions do not lead to appreciable drops in badness ratings.

Results and discussion

Participants by and large perceived both continua as unidimensional. Table 1 shows averaged difference scores for each pair of items.

Table 1 Average difference score for each pair of stimuli, averaged across participants and orders, for the fricative and instrument items

Full size table

Figure 3 shows a scree plot with badness-of-fit values across different possible dimensionalities. Higher stress values indicate larger badness of fit. The lines do not show a clear “elbow”; badness of fit decreases gradually across the possible dimensionalities for both continua. Although the largest numeric difference across dimensionalities occurs between one and two dimensions (0.049 for the fricatives, 0.076 for the instruments), that difference is not particularly large nor much bigger than the next largest difference, between two and three dimensions (0.030 for the fricatives, 0.033 for the instruments). There does not appear to be a reason to reject the unidimensional interpretation of the continua.

To the extent that the stimulus similarity ratings did not conform to a unidimensional distribution, participants generally found the end points of the continuum to be more similar to each other than would be expected given a uniform progression from most to least similar items. This is an interesting contrast to the anchor effects found in other domains, where, for example, studies of intensity discrimination have shown that intensities closer to the end points of a continuum are more easily discriminable than those in the middle (Braida et al., 1984). This can be seen in Fig. 4, which shows the one-dimensional and two-dimensional IDIOSCAL solutions. Although the scree plots of Fig. 3 do not provide conclusive evidence that the unidimensional interpretation of this continuum should be rejected, it is important to note that Dimension 2 in Fig. 4b is scaled to about half the width of Dimension 1. Although the interpretation of absolute differences between the end points of single dimensions is not entirely one to one with estimates of variability explained by that dimension, this indicates that Dimension 2 may be doing important work in the two-dimensional solution obtained in this study.

The dimensions revealed in Fig. 4 are roughly comparable, with slightly lower interitem discriminability on the extreme ends of the continuum for the instrument continuum compared with the fricative continuum. In Fig. 4a, this corresponds to the one-dimensional solution showing bunched-up items on the bonetrom end of the continuum for the instruments to an extent not found in the fricatives. Some portion of these differences between the stimuli may be related to individual differences in the familiarity of the items to the listeners. Although we did not explicitly ask about musical training as a part of our demographic survey, differences in experiences with musical stimuli may have led to differences in the perceived properties of the stimuli.

In Fig. 4b, one possible interpretation of the two dimensions uncovered roughly corresponds to the position in the stimulus along the continuum (Dimension 1) and to whether the stimuli are extreme members of the continuum or fall somewhere in the middle (Dimension 2). Although IDIOSCAL is naïve to the true nature of the putative dimensions uncovered, Dimension 1 could be said to show how English-speaking participants sorted the items into categories, whereas Dimension 2 could show the level of certainty that the participants had in that label (with higher values indicating increasing certainty). Regardless of the interpretation of the dimensions, the two-dimensional IDIOSCAL solution shows the “extremely palatal” or “extremely pettrum” items (Steps 1–3) and the “extremely velar” or “extremely bonetrom” items (Steps 8–10) as less distant from each other than one would expect based on stimulus step alone. In general, the items are classified similarly across the sets of stimuli, with roughly equal distances from step to step across the two continua. This suggests that comparing the two continua is appropriate.