Wednesday, April 20, 2005

All Consuming is back

After being down with server problems for a while, Erik Benson has rebuilt All Consuming, which allows you to tag books, DVDs, etc that you're reading/watching, make lists of books you're reading, and record your thoughts about books you've finished. Check out the list of books I'm currently reading in the right-hand column - you'll need to scroll down a little. You can join in the fun too, by registering on All Consuming and putting a little snippet of Javascript on your website.

Old joke, new (realistic) ending

Something Awful ran a contest to come up with realistic punchlines to old jokes [via BoingBoing]. They're kind of funny, in a violates-some-Gricean-maxim way. Here are a couple:

A duck walks into a bar...

Animal control is promptly called, the duck is then taken to a near by park and released.

What's the difference between Michael Jackson and a shopping bag?

One is a famous singer songwriter facing charges of child molestation and the other's a shopping bag.
Many jokes are funny, I think (I didn't pay much attention to the pragmatics part of Ling 101), because they violate some Gricean maxim. For example, the maxim of manner says, in part, to avoid ambiguity. But ambiguity is precisely the thing that makes puns funny. The maxim of relation says to be relevant, but non sequiturs deliberately lead you one way and then make a totally irrelevant statement. But in jokes, this is totally acceptable, because we understand that Gricean maxims are often violated in jokes. They're not (necessarily) true, they're not (necessarily) brief, they're not (necessarily) relevant and they're not (necessarily) unambiguous. Perhaps there's a separate Gricean maxim that says when you're trying to tell a joke, ignore all the other maxims.

So when we come across "realistic" jokes where the beginning sounds like a joke and the "punchline" actually IS relevant, IS brief, IS unambiguous and IS true, it's kind of funny. Well to me, anyway. I'm overanalysing, aren't I? Stop reading this post and go read the "jokes". Have fun.

Monday, April 18, 2005

Egypt Exploration Society

Language Hat posted a link to an intriguing article on Oxford scientists employing infra-red technology to find long-lost Greek and Roman plays and histories and poetry on a hoard of ancient papyrus thrown onto an ancient garbage heap, the Oxyrhynchus Papyri. Oxyrhynchus ("city of the sharp-nosed fish") was the "titular archdiocese of Heptanomos in Egypt", and its inhabitants were among the earliest to embrace Christianity, so there may be some early Christian documents in there too. The collection already yielded some precious documents in 1906: the Catholic Encyclopedia cites an article dating to that year titled "Les plus anciens monuments du christianisme ecrit sur papyrus".

The Telegraph article says the hoard is owned by the Egypt Exploration Society. That name rang a bell with me, and a quick look at my collection told me why: one of my favourite books, Nefertiti Lived Here, was written by Mary Chubb, who was employed by the EES in the 1930s. She volunteered to go on a season's dig at Tell El-Amarna sponsored by the society, led by the brilliant John Pendlebury, and one of the results was this lovely book. It conveys a sense of the romance of archaeology, but doesn't hesitate to point out the hardships and the disappointments as well. Reading the second-last chapter, about an ancient Egyptian folk dance, still sends a thrill down my spine - you'll have to read it to see why. Amazingly, fieldwork is still on-going at Amarna, still sponsored by the EES.

But these new techniques really go to show that it's really not all that easy to destroy information, doesn't it?* For hundreds of years those papyri have been unreadable. To all intents and purposes, they held no further information. And then people come up with a subtle, sophisticated way of teasing out information from what's left. Absolutely amazing.

*I suppose paper shredders and card shufflers and hard disk rewriters already know this, though.

Monday, April 04, 2005

Unforgettable and unrecallable passwords

Came across an interesting paper today: "Passwords you'll never forget, but can't recall" by Daphna Weinshall and Scott Kirkpatrick of the Hebrew University in Jerusalem. The basic aim is to leverage certain human memory phenomena to create "passwords" that can be recognised, but not described to a third party. An example is pick out about 100-200 pictures out of a database of about 20,000. The database is organised into groups of 2-9 based on a common theme; for example, all 9 of the photos in a certain group might contain a windmill. The authentication process is as follows: choose a few, say 5, photos, one of which is in the 100 to 200-strong set that the subject memorised. The subject has to pick the right one. This is repeated several times to minimise the chance that an intruder gets it right by chance.

The rationale is that "a picture is worth a thousand words", and many pictures cannot be described in sufficient detail without actually having them in front of one in order to pick them out of the group. Also, since the subject was given so many photographs, they would be unable to describe all of them anyway. On the other hand, we are pretty good at recognising photographs once we actually see them, so when actually going through the authentication process, we will be able to remember. Some allowance for forgetting is built in - the subject doesn't have to get every test right.

Anyway, this paper reminds me of a passage from Between Silk and Cyanide [NLB] by Leo Marks, which is, by the way, one of my favourite books. I think it must be the non-fiction book I've re-read the most, and given away as presents to the most people. It's clever, it's funny, and it's about cryptography, which I was really interested in for a long time. I haven't the time to review it properly here, but anyway the relevant passage is the following (page 508 of the hardback):

My dear Colonel,
'PANDARUS has done extremely well from the signals point of view. Before he left he was briefed by signals to give MANELAUS an identity check. This was in such a form that PANDARUS himself, if caught later by the enemy, would be unable to remember it. The position now is that MANELAUS is using the check.
'This is the first time in SOE history that an agent recruited in the field has been given an identity check without anything passing in writing!
The same system of identity check will, in due course, be used by the Zone Commanders when they use their own codes.

Yours sincerely,

Nick reminded me as head of Signals that he was my zone commander, and asked if I'd kindly tell him the secret of Pandarus's ability to forget the security checks which he had to pass on.
Astonished by its simplicity, he stared at the ceiling and muttered, 'Jesus.' (Pandarus, who's blasphemed so frequently I was convinced he was devout, said he'd try the system out. He was the first agent to use it but unless I could find a way to vary it, he was likely to be the last.)*

* I have been advised that for security reasons I must forget how it worked! Has nothing changed in fifty years except Britain's prestige?

I puzzled over this passage for some time but have never been able to even imagine a security check that comes close having the properties of unforgettability and unrecallability. I doubt that it's anything like the ones proposed in the paper, but the idea's still neat. It just goes to show that there's nothing new under the sun! (Since 1944, anyway.) Anyway, go read Between Silk and Cyanide. You won't regret it.

Sunday, April 03, 2005

Some thoughts on statistical fallacies and language learning

WARNING: long, rambling post ahead.

I must say that the litany of statistical fallacies I enumerated in my review of Stanovich surprised me. I had imagined that a lot of the trouble people had with statistics emerged from the formal framework that has been built up around it: technical terms such as mean, median, mode; t-tests, ANOVA, that sort of thing. A lot of the trouble that people have with it, however, is intuitive. Pose them an informal question, not incorporating any of the machinery of formal statistics, and they can still get it wrong.

Let's take a quick run through the fallacies that Stanovich discusses:
(1) "person-who" arguments
(2) discounting base rates
(3) failure to use sample size information
(4) the gambler's fallacy
(5) thinking coincidences are more "miraculous" than they are (related to (2) and (3))
(6) discounting incidences and seeing only coincidences
(7) trying to get it right every time, even when it's better to be wrong sometimes
(8) the "conspiracy theory" effect - seeing patterns where they are none
(9) the illusion of control

This intuitive mishandling of statistics is surprising to me, because I'm a firm believer in the theory of statistical language learning. By this I mean that a lot (not necessarily all, but a vast majority) of what we learn is from observations of how frequently certain words occur, how some words only occur with other words, how often some words occur in a certain context. Nor do I believe that this statistical learning is reserved for the task of language learning: as I understand it, a lot of work on our visual systems has shown that our perception of the world depends greatly on statistical considerations.

[Aside: This doesn't mean that I think linguistic knowledge consists of a big bunch of statistics. I still think that linguistic knowledge is rule-based, it's just that we use statistical learning to infer these rules.]

The most famous demonstration of statistical language learning comes from Saffran et al's 1996 article "Statistical learning by 8-month-old infants" [Google cache; link unstable] in Science. Saffran et al reasoned that

...[o]ver a corpus of speech there are measurable statistical regularities that distinguish recurring sound sequences that comprise words from the more accidental sound sequences that occur word boundaries...Within a language, the transitional probability from one sound to the next will generally be highest when the two sounds follow one another within a word, whereas transitional probabilities spanning a word boundary will be relatively low...For example, given the sequence pretty#baby, the transitional probability from pre to ty is greater than the transitional probability from ty to ba...

Transitional probabilities, therefore, can be employed in the task of word segmentation. They then demonstrated that babies exposed to a string of meaningless syllables without prosodic information, for example bidakupadotigolabubidaku... Here, bidaku, padoti and golabu are the "words". After listening to this for about two minutes, the children were presented with "words" and "non-words", where the "non-words" containing the same syllables but not in the right order. The babies could distinguish between them, listening longer to the non-words. They could also distinguish between "words" and "part-words", in which the syllables were presented in the correct order but bridging the "word boundaries", for example kupado.

As the babies were given no information other than the string of syllables (no pauses between words, no stress, etc.), the inescapable conclusion is that they derived their knowledge of word boundaries from the transitional probabilities alone. While this task was simpler than that which babies have to navigate in the real world - the stimuli were much more concentrated and did not have to be remembered over a long period of time - it is clear that some form of statistical learning is going on. In fact, infants are really, really good at statistical learning. It seems paradoxical that we're bad at reasoning statistically.

But what if our bad statistical habits are actually good for learning things like language? Take (8), for example: we induce patterns when really there are none. In the case of language, there are plenty of patterns to induce, so that's OK. And as for disregarding sample size information, perhaps that's how we get out of the trap of poverty of the stimulus. Perhaps we can and do jump to conclusions based on inadequate knowledge, but for whatever reason - the design of language itself, the built-in redundancy, or perhaps an efficient error-correcting mechanism - stumble on the right theory eventually.

So perhaps our computational models for language learning have to take into account the fact that people are using statistical information, but not in the same way that a professionally trained statistician would. We jump to conclusions, make leaps of faith - and somehow, magically, it all comes right in the end.

[Of course, it's a possibility that we grow worse and worse at statistical reasoning as we grow older, and this is why it's hard to learn language after the critical age, whenever that is. Another possibility is that it's only when we try to manipulate statistics consciously that we stumble - our subconscious is the one that's great at learning patterns, and that's why when we try to learn language actively, we don't really succeed as well as when we were little babies and picking it up instinctively.]

Further reading:
Statistical aspects of language [pdf slides]