Some thoughts on statistical fallacies and language learning
I must say that the litany of statistical fallacies I enumerated in my review of Stanovich surprised me. I had imagined that a lot of the trouble people had with statistics emerged from the formal framework that has been built up around it: technical terms such as mean, median, mode; t-tests, ANOVA, that sort of thing. A lot of the trouble that people have with it, however, is intuitive. Pose them an informal question, not incorporating any of the machinery of formal statistics, and they can still get it wrong.
Let's take a quick run through the fallacies that Stanovich discusses:
(1) "person-who" arguments
(2) discounting base rates
(3) failure to use sample size information
(4) the gambler's fallacy
(5) thinking coincidences are more "miraculous" than they are (related to (2) and (3))
(6) discounting incidences and seeing only coincidences
(7) trying to get it right every time, even when it's better to be wrong sometimes
(8) the "conspiracy theory" effect - seeing patterns where they are none
(9) the illusion of control
This intuitive mishandling of statistics is surprising to me, because I'm a firm believer in the theory of statistical language learning. By this I mean that a lot (not necessarily all, but a vast majority) of what we learn is from observations of how frequently certain words occur, how some words only occur with other words, how often some words occur in a certain context. Nor do I believe that this statistical learning is reserved for the task of language learning: as I understand it, a lot of work on our visual systems has shown that our perception of the world depends greatly on statistical considerations.
[Aside: This doesn't mean that I think linguistic knowledge consists of a big bunch of statistics. I still think that linguistic knowledge is rule-based, it's just that we use statistical learning to infer these rules.]
The most famous demonstration of statistical language learning comes from Saffran et al's 1996 article "Statistical learning by 8-month-old infants" [Google cache; link unstable] in Science. Saffran et al reasoned that
...[o]ver a corpus of speech there are measurable statistical regularities that distinguish recurring sound sequences that comprise words from the more accidental sound sequences that occur word boundaries...Within a language, the transitional probability from one sound to the next will generally be highest when the two sounds follow one another within a word, whereas transitional probabilities spanning a word boundary will be relatively low...For example, given the sequence pretty#baby, the transitional probability from pre to ty is greater than the transitional probability from ty to ba...
Transitional probabilities, therefore, can be employed in the task of word segmentation. They then demonstrated that babies exposed to a string of meaningless syllables without prosodic information, for example bidakupadotigolabubidaku... Here, bidaku, padoti and golabu are the "words". After listening to this for about two minutes, the children were presented with "words" and "non-words", where the "non-words" containing the same syllables but not in the right order. The babies could distinguish between them, listening longer to the non-words. They could also distinguish between "words" and "part-words", in which the syllables were presented in the correct order but bridging the "word boundaries", for example kupado.
As the babies were given no information other than the string of syllables (no pauses between words, no stress, etc.), the inescapable conclusion is that they derived their knowledge of word boundaries from the transitional probabilities alone. While this task was simpler than that which babies have to navigate in the real world - the stimuli were much more concentrated and did not have to be remembered over a long period of time - it is clear that some form of statistical learning is going on. In fact, infants are really, really good at statistical learning. It seems paradoxical that we're bad at reasoning statistically.
But what if our bad statistical habits are actually good for learning things like language? Take (8), for example: we induce patterns when really there are none. In the case of language, there are plenty of patterns to induce, so that's OK. And as for disregarding sample size information, perhaps that's how we get out of the trap of poverty of the stimulus. Perhaps we can and do jump to conclusions based on inadequate knowledge, but for whatever reason - the design of language itself, the built-in redundancy, or perhaps an efficient error-correcting mechanism - stumble on the right theory eventually.
So perhaps our computational models for language learning have to take into account the fact that people are using statistical information, but not in the same way that a professionally trained statistician would. We jump to conclusions, make leaps of faith - and somehow, magically, it all comes right in the end.
[Of course, it's a possibility that we grow worse and worse at statistical reasoning as we grow older, and this is why it's hard to learn language after the critical age, whenever that is. Another possibility is that it's only when we try to manipulate statistics consciously that we stumble - our subconscious is the one that's great at learning patterns, and that's why when we try to learn language actively, we don't really succeed as well as when we were little babies and picking it up instinctively.]
Statistical aspects of language [pdf slides]