Parallel-Distributed-Processing Approach to English Word Reading

  1. PDP Approach in General
  2. PDP Approach Applied to Language Study
  3. Word Production Simulation

1. PDP Approach in General
Parallel Distributed Processing (PDP) is a relatively new approach to the study of psychological phenomena. Most traditional psychological theories postulate a serial-ordered mechanism to account for aspects of human cognition. In contrast, the PDP approach assumes that people understand through the interplay of multiple sources of knowledge, and as such, parts of the mechanism interact with each other simultaneously. The PDP approach draws heavily on what we have known about the human neural system. Specifically, it proposes a network of inter-connected information processing units as the mechanistic accounts of human cognition. Each unit contains aspect of information and may stand for either concrete objects (such as features, letters, words, etc.) or more abstract elements. Units in the network influence other units and at the same time are influenced by them. Information processing takes place through the interaction among these units.

Figure 1 illustrates the basic configuration of a PDP network system. A typical PDP model is made up of a large number of processing units. Units are connected to one another to form a pattern of connectivity. Each connection between two units carries a weight that specifies how the output of the first unit feeds into the second unit as input. A connection between two units can be either excitatory if the weight is a positive number, or inhibitory if the weight is a negative number. The absolute value of the weight, however, decides the strength of the connection. At each point in time, each unit receives input from a number of other units. The overall input from all these sources to this unit is determined by a certain propagation rule. This net input, together with the current activation state of the unit, are combined to produce a new state of activation according to a certain threshold function. Finally, connection weights between units undergo modification with experience. In this way, the system evolves by changing the pattern of connectivity.

Figure 1

Figure 1. The basic components of a PDP network

 

The PDP approach differs from the conventional modeling of human cognition in two important issues. First, in PDP models the pattern of connectivity (connection strengths) between units is stored. Hence knowledge is represented by the pattern of activity distributed over many processing units. This distributed representation contrasts with the one-unit-one-concept representational system in traditional psychological theories. Second, in the PDP approach, each processing unit in the system acts on and is simultaneously acted on by the other units. Computation takes the form of cooperative and competitive interactions among a large number of processing units. Information processing happens in a parallel fashion. There is no distinctive processing stage as there is in many conventional models.

The PDP approach has consdierable appeal to cognitive psychologists. The mechanisms it proposes to account for various aspects of human cognition, such as perception, reading, learning, memory, etc., are computationally sufficient and, to a certain degree, psychologically accurate. The current section is designed to demonstrate how the PDP approach can be applied to the study of human language production.

[Back to top]

2. PDP Approach Applied to Language Study

Reading aloud from print is a heavily researched topic. The pronunciations of most English words adhere to standard spelling-sound correspondence (e.g., DIVE, MINT), whereas others deviate considerably from this regulation (e.g., GIVE, PINT). Nonetheless, skilled readers are able to process both types of words easily and correctly. What mechanism, then, might be involved when a reader faces these two different types of words?

One general class of models to account for this phenomenon adopts what is called "dual-route architecture." The fundamental property of this model is that skilled readers have at their disposal two different procedures for converting print to speech. If the reader already knows the word, its pronunciation is retrieved by looking it up in an internal lexicon, where an entry with both the printing form and corresponding pronunciation of that word has been stored. If, however, the reader encounters letter strings that they have never seen before, such as novel words or pronunciable nonwords, a nonlexical route is taken. This requires the reader to use a system of rules specifying the relationship between letters and sounds in English. In summary, the central tenet of dual-route models is that different mechanisms, each in response to different types of input and operating according to fundamentally different principles, underlie the process that is known collectively as reading.

Although dual-route models have gained popularity among psychologists, many researchers have adopted the alternative PDP approach to studying the same phenomenon. In contrast to dual-route models, this alternative approach assumes that there is a single, uniform procedure for computing a phonological representation from an orthographic representation, and this mechanism applies to a variaty of input, including exception words and nonwords as well as regular words. Within such a research paradigm, the microstructure of the cognitive processing takes the form of a PDP network system. Within the system, the units at the input level receive information about the orthography of the word, and the units at the output level generate its phonology. The input and output units are connected through intermediate, hidden units. The system learns by adjusting weights on connections between units in a way that is sensitive to the contigency between the statistical property of the environment and the behavior of the network. As a result, there is no sharp distinction of different types of input. Rather all words (with their respective orthographical and phonological forms) co-exist within a single system whose representations and processing reflect the relative degree of consistency in the mappings for different words.

A number of PDP sytems have been developed to simulate the process of reading aloud from print (Seidenberg & McClelland, 1989; Plaut, McClelland, Seidenberg & Patterson, 1996), each with a varied degree of success in accounting for empirical data. To this day there is still controversy among psychologists as to whether the PDP approach is a viable alternative to dual-route models. However, it has also become evident that this approach does possess two important features that the dual-route model currently lacks: The model is computational, and it learns. By providing a rich set of general computational principles as well as specifying how such processes might be learned, the PDP approach offers a useful way of thinking about human performance in this particular domain.

Plaut et al. (1996) reported a study in which they developed a PDP network structure to simulate English word reading in skilled readers. Their work was based on a careful and thorough linguistic analysis of a large set of monosyllabic words that has been entensively used in empirical reading experiments (Gluck, 1979). Specifically, these researchers condensed spelling-sound regularities among 2998 monosyllabic words and implemented these regularities directly into the PDP network. A monosyllable English word, by befinition, contains only a single vowel and may contain both an initial and a final consonant cluster. Because of the structure of the articulatory system, there are strong phonotactic constraints within the initial or final consonant cluster, so that in both cases a certain phoneme can occur only once, and the order of phonemes is considerably constrained. For example, if the phonemes /p/, /h/ and /r/ all appear in the initial consonant cluster, they can appear only in the order /phr/. As a result, if, at both the input and output level, three groups of units are designated for the initial consonant cluster, the vowel, and the final consonant cluster respectively, only a small amount of replication is needed to provide a unique representation of virtually every monosyllabic word in the training corpse.

The PDP network that Plaut et al. developed consists of three layers of processing units. The input layer contains 105 orthographical units, each representing a grapheme. The output layer contains 61 phonological units, one for each phoneme. Phonotactic constraints are expressed by grouping phoneme units into mutually exclusive sets, and ordering these sets from left to right in accordance with the left-to-right ordering constraints imposed within consonant clusters. Between the two layers there is an intermediate layer of 100 hidden units. To best accomodate human subjects data, this model was implemented with various network specifications. In their first experiment, a simple feedforward network structure was adopted so that the network only maps from orthography to phonology. In a second simulation, in addition to the feedforward structure, each phoneme unit is also connected to each other phoneme unit as well as back to the hidden units. Plaut et al. reported that the simulations generated results that were very similar to those obtained in empirical studies. First, the network, once trained, was able to read nonwords in much the same way as human subjects do. Second, the simulations show the effects of frequency and consistency as are often found in reading experiments. Third, by selectively demaging the network, the model is able to generate results that resemble the reading performance of patients with various forms of dyslexia.

[Back to top]

3. The Word Production Simulation

This simulation is based on Plaut et al.'s (1996) PDP network structure. It consists of three layers of processing units. The input layer is made up of 105 units, each representing a distinctive orthographical unit (grapheme). The output layer has 61 units, each corresponding to a specific phonological representation (phoneme). A third, hidden layer, consisting of 100 processing units, mediates the input and output layer, so that all the input units feed into each hidden unit, and in turn feed into each of the output unit. Therefore, this is a simple feed-forward network structure, and maps only from orthography to phonology. The network architecture is represented in Figure 2. The specifications of the input (grapheme) and output (phoneme) layers can be viewed by clicking the "Network Structure" button in the simulation.

Figure 2

Figure 2. The Network Architecture of the Current Simulation

The simulation is run in two stages. The first is a training stage, during which the network is exposed to a training set to learn the correspondence between words and their pronunciations. This process is repeated until the network is able to read these words on its own. There are three parameters associated with network training: Learning Rate, Momentum, and Number of Training Epochs. The first two parameters determine how the connection weights are modified during training. The Number of Training Epochs needed to train the network depends on the combination of these two parameters. In the simulation contained here, each parameter comes with a default that was empirically tested to be the optimal value. You are encouraged to try out other combinations of parameters to see how they affect the training. Because of the network configuration, training demands a tremendous amount of the computer's calculating resource and may take quite some time to complete. If you wish to bypass this stage you can take the shortcut. By clicking on the "Shortcut" button in the simulation, the weights resulting from a previous simulation will be automatically loaded into the network.

The essence of training a network is to expose it to a large corpus of words and their respective pronunciations so that the connections among processing units in the network finally captures the statistical property of orthography-phonology correspondence in English word reading. The network learns by optimizing its weight pattern so as to best reflect the statistical information conveyed in the training set. Throughout the training session, connection weights among different layers of units can be viewed by clicking on the "Weight Graph (from input to hidden)" or "Weight Graph (from hidden to output)" button.

The second phase of the simulation is a testing stage. During this stage, the network reads lists of words (or nonwords), its output is then compared to the correct pronunciations. At issue is whether the network, once trained, can read a large corpse of words and pronunciable nonwords as well as skilled readers. Seven testing sets are used in the simulation. The first is the training set itself. The network is considered fully trained only when it can correctly pronounce all the words in the original training set. The next four testing sets adapt from Taraban and McClelland (1987) experiment, and consist of high-frequency consistent words, low-frequency consistent words, high-frequency exception words, and low-frequency exception words, respectively. They are included to test whether the network can replicate the effect of frequency and consistency in naming latency. For this purpose, the average cross entropy, a measure of the difference between the network's generated pronunciation and a word's correct pronunciation, is used as an analogue of naming latency. You should explore whether there is an interaction between frequency and consistency in the network's output. The simulation also includes two lists of nonwords, adapted from an experiment by Glushko (1979), as testing sets. One list consists of consistent nonwords derived from regular words (e.g., BEED from BEEF), and the other list consists of inconsistent nonwords whose pronunciations were derived from exception words (e.g., BINT from PINT). Glushko reported that human subjects in his experiment had an accuracy rate of 93.8% in reading the consistent nonwords. In contrast, accuracy for reading the inconsistent nonwords was only 78.3%. In a previous simulation, we found that the network correctly pronounced 90% of the consistent nonwords, but only 78% of the inconsistent nonwords. You should compare the network output when these two different nonword lists are used. In addition to these seven test sets, the simulation also allows for single word (nonword) testing. You can provide an input to the trained network by typing in individual letter strings. The pronunciation generated by the network will be displayed below the input entry.

One advantage of PDP network over other serial-order modeling of psychological phenomenon is that it allows for graceful degradation of performance. The current simulation demonstrates this by means of lesioning the network. When the network is lesioned (usually after being trained), a certain percentage of the hidden units are disconnected from the input and output units, hence a certain portion of information is lost, and the network's performance deteriorates. In the current simulation, once the percentage for the lesion is decided, the network randomly selects the hidden units to be removed. As a result, the network performance after lesioning always has a random characteristic. For example, lesioning the network with the same percentage does not yield identical results. However, generally speaking, the more the network is lesioned, the less accurate its performance.

The current simulation comes with two sets of training words. The first set, consisting of 2998 words, is adopted from the Plaut et al's (1996) study. Under ideal training condition, it takes about 300 epochs to fully train the network to learn this set. For demonstration purposes, a second, condensed set was constructed to include only 200 words randomly chosen from the original set. With an optimal combination of training parameters, it takes much less time to train the network. As a result, the two nonword testing sets accompanying the condensed training corpse consist of only part of the corresponding originals. Because the condensed set is a much reduced training corpse, you should expect the network to perform somewhat differently from when it is trained on the original set. However, our previous simulation results showed that even with this limited training set, the network demonstrated the effect of frequency and consistency in naming latency as well as differentiated accuracy in pronouncing consistent and inconsistent nonwords.

[Back to top]