Monday, August 30, 2004

Things linguists (well, some of them) often forget

Seems to me that linguistics today has too many frameworks, too many ways of looking at things. When there are as diverse ways of looking at language as Government and Binding, Head-Driven Phrase Structure Grammar and the Minimalist Program - well, they can't all be right. I think linguists really have to go back to the biology and the neuroscience and say, what really is "natural" and what really is "real"? Are trees natural? Is movement real?

Here are some things that I think linguists all know about but tend to overlook when they're constructing theories:

- Language isn't perfect. It's riddled with errors. People quite often don't finish sentences, for example. Do we nevertheless understand them because we can fill in the blanks from context, or is it because you don't need all the words to understand language?

- The distinction between competence and performance: often noted, but do any of the different frameworks explain or in any way provide for this distinction? Computational grammars (I think, I haven't done nearly enough work on them to know for sure) find parsing harder than generating because over-generation is often the result. But most people can't even find one way of saying what they want, not several (wrong) ways to say what they want. Is this merely a problem of word choice, or does syntax play a part?

- We begin parsing even before we come to the end of a sentence, whether we're listening or reading. The most dramatic proof for this probably comes from garden-path sentences like The horse raced past the barn fell. It seems to me that most computational grammars take only the whole sentence and then parse it. Is any provision made for this undeniable fact of language? If so, what predisposes us to thinking that raced past the barn is the predicate of the sentence and not a reduced relative clause? There must be some sort of minimality effect going on here. I think that insights from computational neuroscience, particularly in the field of vision, may come in useful here.

- A lot of these frameworks are way too powerful. I was working in LFG this summer, and you could do just about anything with it just by adding another feature. Is such power really a good thing? Is there a cap on the number of features a language should have? A big problem is that we have no idea where the boundaries lies between what's unattested and what's impossible. I know there's been some work done in phonotactics on this, but what of syntax? Is there any way to design an experiment to test the boundary in the realm of syntax? Good thing to think about.

Saturday, August 28, 2004

Still more about libraries

Then, if all your friends build their own library catalogues, you can search each others' catalogues to see whether they have books that you want! (Of course, they'd have to be pretty good friends - as you can tell, I'm very particular about people seeing what books I have, since they're pretty personal things. You can tell a person by their friends, and by their books too.)

Thursday, August 26, 2004

Interesting links from the past week

How linguists prove that all odds are prime (David Mortensen's website) - really, really hilarious.

Cold fusion back from the dead? (IEEE Spectrum, via Slashdot) Scientists are re-evaluating the long-held and entrenched view that cold fusion is junk science, with more and more confirmations rolling in of the Fleischmann-Pons effect.

ET should write, not call (CNN) Basically, broadcasting radio messages all over the galaxy is expensive; hard-copy is better, especially for long messages.

Humans context-free, monkeys finite-state? Apparently not. (Language Log) - earlier this year, Hauser and Fitch claimed that an experiment they had run showed that monkeys could not master grammars generated at the phrase-structure level, only a finite-state level. Now there's a new paper by Perruchet and Rey that rebuts this claim.

WordCount - tracking the way we use language. An innovative display of the Zipfian relationship between frequency and rank in the English language.

How conservatives use language to dominate politics (UC Berkeley News) - an interview with linguist George Lakoff. Really, truly worth a read. Update: a follow-up interview, with more recent news.

Incidentally... (more about libraries)

I chanced upon the LibraryLookUp site while looking for tools (preferably free) with which to catalogue my own modest library. The idea I had was to type in, or possibly scan in (buying some cheap barcode scanner) the ISBNs of the books I have. Presumably I could then use some sort of lookup service to get the full information for those books (including edition information). But these lookup services all seem to cost money, which seems slightly bizarre.

The alternative idea, which I will explore if I have the time, is to write some code with which to look up the book information via a site like Amazon. Then you can look up all the product information, and also (presumably) get access to reviews and even content, using Amazon's Look/Search Inside the Book system. Maybe this can be done using the Amazon API. (Some interesting applications that use the Amazon API service.)

It seems to me that a really interesting application that Amazon might want to enable is for people to easily build their own library catalogues within the Amazon environment. Then you could easily search within your own catalogue. The only issue that I can see is privacy - do you want Amazon to know all the books that you have? Then again, they already know about quite a few of the books that I have, some of which I bought from them, and the others of which I told them I had through the recommendation system, in order to get them to stop recommending me books I already have. And millions of people already tell Amazon what they read when they purchase and review books on the site.

And then again, how many people actually have big enough collections that they want to make a formal catalogue of? I do, but how many other people do? Probably not many (when I hear about things like how the value of one's property decreases with an increasing number of bookshelves (couldn't find information about this on the internet, heard it more than a year ago on one of those let-us-help-you-sell-your-house type of home improvement shows. May be untrue for all I know.). And then, I suppose, there may be people like Lord Emsworth, creation of the immortal P.G. Wodehouse:

Lord Emsworth - "Catalogue the library? What does it want cataloguing for?"
Lady Constance - "It has not been done since the year 1885."
Lord Emsworth - "Well, and look how splendidly we've got along without it."

(From Leave It To Psmith).

Bookmarklet for the National Library of Singapore (failed)

How sad. I recently happened upon the concept of the bookmarklet, and was especially intrigued by this particular application of them, called LibraryLookUp: a one-click hop from investigating a book on Amazon, Barnes and Noble, or any website that contains the ISBN of a book in the URL, to the book's lookup page on your local library website. So, if you chance upon a book on Amazon, say, that you find interesting, you can look it up in your local library to see if it's available - with just one click. Stupendous!

The service supports any library website in which you can access the book's lookup page via a URL that also contains the ISBN of the book. So, naturally, I wanted to apply this to my local library system, the National Library of Singapore (catalogue), which uses the CARLweb system. Unfortunately, poking around for hours (well, all right, minutes), I couldn't find any magic way of accessing a book's lookup page via a URL that contains the ISBN of a book - even by looking up a book via ISBN. So it appears that CARLweb can't be supported by this application, which is a tragedy indeed. All I can do is to look the book up on something like the National University of Singapore's library catalogue, which doesn't do me much good, since they require membership for entry (something that I have a MAJOR ISSUE with). Sigh.

Monday, August 09, 2004

How many people do you need to develop a language?

Here's an interesting question that came to me one night while I was lying awake: how many people do you need to develop a language? By "develop" I mean to really create it from scratch, the way the children who developed Nicaraguan Sign Language might have done, not just learn a fully-fledged language from their families.

The standard accounts of the development of NSL all start of by saying that before the Sandinista Revolution, deaf children in Nicaragua were isolated and never developed any system of communication sophisticated enough to be called a language. It was only when they came together in a deaf school that they began to sign with each other in what presumably could be called a pidgin, which quickly developed into a full-blown language.

So my question is - what is the threshold number before one becomes able to develop a language from scratch? (Call this x.) What about from an incomplete system of communication such as a pidgin?

A related topic that I was recently reading about in Malcolm Gladwell's The Tipping Point is that brain size (specifically, the size of the neocortex, the specifically mammalian area of the brain) is correlated with social group size. The bigger the neocortex, the bigger the average size of a mammal's social group. (The optimal size for human social groups is 150, extrapolating from the data for other mammals - and there's some fascinating evidence to back this up.) Anyway, assuming (I'm not sure) that humans' social group size is larger than that of just about all other mammals (excepting dolphins, probably), what if it so happens that our brains are just over the threshold x?

Of course, this is pure speculation, and it doesn't entirely explain why great apes would still be unable to learn language. (This last statement, of course, is in itself a contentious issue.)

Pseudo-pun

Hmm, I just realised that I could call this blog "Corpus Linguistics", a pseudo-pun on my nom de plume. Tee hee hee.

Friday, August 06, 2004

Maps of the London tube

- The official Underground tube maps from Transport for London.
- History of the Tube map (TFL). Don't forget to check out the Flash presentation showing the 1933 map morphing into the present-day one.
- The geographically correct Tube map (very, very large). (Via London Underground, which is in itself a hilarious and highly recommended site.)
- Compare subway systems of the world (including the Tube), presented at the same scale.
- Tube Walklines, showing you when it's quicker to walk between stations than to take the Tube. Remember the famous example in Bill Bryson's Notes from a small island, Bank and Mansion House, which are 200 yards apart but require travelling two lines and six stations to get from one to the other.
- A couple of maps of the Tube, translated into (fake) German and (real) German (via Boing Boing)
- Map of the projected London Underground in 2016, with plenty of new lines.

And, in a new discovery, the following site is probably THE directory for maps of the London underground: Mapper's Delight: the London Underground diagrams. Unfortunately they don't have a link to any track maps, which I'm looking for, and that I KNOW exist.

Fine example of metathesis spotted in the wild

Yesterday I was listening to a talk on computational vision techniques. The speaker consistently pronounced "daguerrotypes" as "derogatypes" /də.ɹɔ.gə.taɪps/ - effectively transposing /g/ and /r/, and also changing the vowel pattern. It seems to me pretty clear that he was doing this on analogy to the word "derogatory", since the quality of the two vowels in "derog" was exactly the same as the first two vowels in his pronunciation, which in my mind boosts Elizabeth Hume's theory that metathesis errors serve to make rarer sound sequences conform to more often-attested sound sequences.

Which gets me to thinking - this would seem to mean that theories of phonology have, in some way, to take the corpus of sound sequences into effect, which at present they don't seem to. This is a topic that I'd really like to know more about.

By the way, the IPA in this message was input using the new service Charwrite from Emeld, which seems really useful. All you have to type in is a letter that vaguely resembles the IPA character you want, for example, if you want <ə>, then right-click on the field to get a list of characters with similar shapes, and then select the right one. Good if you're working on a computer (like this one) that doesn't have IPA fonts installed, and if you just want to type in a really quick word and don't fancy hunting around in Microsoft Word or your LaTeX guide. (I discovered CharWrite through the Language Log, by the way.)

Links for 6 August

Fool's World Map - absolutely hilarious, and very apt (from BoingBoing).
38 Dishonest Rhetorical Tricks - see which ones you've been a victim of (from BoingBoing).
Booktastic - a new game for booklovers (from BookBrowse)
An March 1, 2003 article from the Japan Times about the possibility of there being a secret underground city under Tokyo.

Thursday, August 05, 2004

Some often-overlooked uses of linguistics - 2

There's a second skill that linguists get a lot of training in too, as I found when I was writing my thesis. That skill is the use of very indirect evidence to build up argumentation. The truth of the matter is that the facts of language are complex, and their "deep" representation is still pretty much unknown. Take the following sentence (from Hungarian):

Tetsz-em a főnök-nek.
please-I the boss-DAT
lit. "I please to the boss."
i.e. "I please the boss." / "The boss likes me."

On the face of it, this sentence has the following grammatical relations:

1 P 3
I please the boss.

where P=predicate, 1=subject, 2=direct object, 3=indirect object.

But I argue in my thesis on Hungarian syntax that this in fact has a deeper layer of derivation, namely:

2 P 1
2 P 3
1 P 3
I please the boss.

Unfortunately, Hungarian has no passive (it uses impersonals), so there is no marker of 2-1 advancement, neither is there any 1-3 demotion marker. So how can one argue for the increased layers of complexity, which is, after all, contrary to principles such as Occam's razor?

My argument stems from what seems on the surface to be a completely unrelated phenomenon, which is a case hierarchy for binding relations. What this means is that when you have a reflexive, which by Principle A of Binding Theory must refer back to the an antecedent within the same binding domain, the reflexive must be lower than the antecedent in the following case hierarchy:

Nominative > Accusative, Dative > Instrumental etc.

So you can have "John pinched himself", not "Himself pinched John", for example. This is more significant for Hungarian than for English since it has a fairly free word order. Now, apply this case hierarchy to the surface form of our Hungarian sentence. Since "I" is in nominative case, and "the boss" is in dative case, "I" is higher in the case hierarchy than "the boss". Replacing "the boss" with "himself" and "I" with "John", we predict that you can say "John-NOM pleases (to) himself-DAT" and not "Himself-NOM pleases to John-DAT" or "John-DAT himself-NOM pleases":

Prediction:
*Jánosnak önmaga tetszik (* indicates the predicted ungrammaticality)
John-DAT himself-NOM pleases
lit. "to John, himself pleases"

János önmagának tetszik
John himself-DAT pleases
lit. "John to himself pleases"

However, about 2/3 of my Hungarian informants accepted the first sentence (up to variation in word order). So, conclusion: either the case hierarchy is wrong, OR the case hierarchy must be able to apply to a different layer of derivation. If the case hierarchy can apply at the initial tier for (2), in which "I" stood in the 1 relation (subject) and "myself" stood in the 2 relation (direct object), then we get the correct predictions.

I know that there are probably good counter-arguments for this, but I was pleased at being able to come up with it on my own, because it involved the creative use of some pretty indirect evidence (even if I do say so myself).

Anyway, there's my plug for linguistics as a way to equip yourself with some valuable life skills.

Some often-overlooked uses of linguistics - 1

Upon telling people that I study linguistics, I almost invariably get the response, "Cool! ...What is linguistics??" Linguistics is sufficiently unknown as an academic discipline that many linguistics departments actually put up a little introductory note on "what is linguistics". Aside from any practical or scientific applications of the study of linguistics, though, there are some very useful personal skills to be derived from the study of linguistics.

Firstly, anyone studying syntax and semantics will quickly encounter the concepts of ambiguity (how many ways can you read "Time flies like an arrow"?) and presupposition, and will come out of these classes much better-equipped to spot them. Which is an incredibly useful skill for keeping oneself informed. I'll talk a little bit more about presupposition, because I find it a lot more interesting.

Consider a scenario in which you and your companion see two sleeping dogs, that look more or less identical, and with which you have no prior acquaintance. Saying "the dogs are sleeping" will get you a nod in response. So will "the two dogs are sleeping". What about "the dog is sleeping"? You'll probably get a weird look and the question, "Which dog?" Your friend will assume that one of the dogs is more contextually-relevant than the other for you for some reason, or perhaps that you're referring to a completely different dog altogether - but again, you'll first have had to establish the context in which that different dog was relevant. Now, suppose you said something like "the three dogs are sleeping". Your friend will probably think that you can't count. Because if you say "the X" in a statement in which you're not overtly negating the existence of X (i.e. "the X doesn't exist"), you are basically presupposing that there is one and only one contextually-relevant X, before making a statement about that X. (You can extrapolate this to "the Xes" and "the N Xes" quite easily.)

Presupposition is not something that only alert linguists can spot. All of us derive presuppositional inferences from statements such as "the X". It's just that a lot of people do this subconsciously, not consciously. It's only when the presupposition is exceptional in some way (very negative, for example) that we sit up and take notice, as in the famous question "Do you still beat your wife?" You can't answer "yes" or "no" to that without implicating yourself in a crime of abuse. You have to actively dispute the presupposition by saying, "I've never beaten my wife, and I don't beat her now!'

All definite determiners, including possessors, create this presuppositional inference. When George W. Bush refers to "Saddam's weapons of mass destruction" in a non-negative context, he is unconsciously creating the inference that those weapons of mass destruction exist. I suspect that if you look at his speeches, even the recent ones in which he seeks to distance himself from the flawed intelligence that led to the Iraq war, you'll find that he still says phrases like "the WMDs" and "Iraq's WMDs". (If I find the time to look up his speeches, and find evidence of this, I'll be sure to post it.)

Whew! That was a long spiel about presupposition. I'll write about the second skill in the next post.

Statement of purpose

I anticipate this blog being a collection of ideas, thoughts and notes, mostly about linguistics and cognitive science, but also on books, digital libraries and publishing, mathematics (especially cryptography and graph theory), research, public transportation, etc., etc. As you can tell, I have a fairly eclectic set of interests, and see myself as a bridge between typical left-hemisphere and right-hemisphere interests.

Ironically, I don't really subscribe too much to the strict left-brain/right-brain division. Most behaviours, especially something like language, are so incredibly complex that I think they must activate parts of both hemispheres. So you can't say that language is a left-brain phenomenon, because it harnesses the powers of the right-brain too. But anyway, I'm only just embarking on my study of neuroscience, so I'll not harp on too much more about things I don't know about here.