Thursday, September 23, 2004

Cross-linguistic entropy

I stumbled across this paper today: Estimating and Comparing Entropy across Written Natural Languages using PPM Compression [PS], by Behr, Fossum, Mitzenmacher and Xiao. I've mentioned before that I've been unable to find papers on the entropy values of different languages, so this seems to fill the gap somewhat, even though it relates to text rather than speech, which is what I was discussing.

Basically, they took two kinds of text: (1) the King James' Version of the Bible and its translations into Spanish, French, Chinese, Korean, Arabic, Japanese and Russian, and (2) the UN Treaty Collection and its translations into Spanish, French, Chinese, Arabic and Russian. They estimated the entropy of the various texts by compressing them and comparing their size in bytes after compression. In the first case, the translations all compressed to within about 15% of the (17th century) English original. In the second, Russian was about 20% off while the others were within the 15% range. Interestingly, the different scripts made the original sizes of the various documents vary by quite a bit (e.g. the Chinese text was half the size of the English text in bytes), when compressed the ratios were English 1:0.864 Chinese.

They also took the English KJV and translated it into French, Spanish, German and Italian using the Systran machine translation tool (which powers Babelfish). In this case the resultant machine-translations were larger than the original. This may have been due to faulty translation; also, the choice of text was somewhat injudicious, as Systran couldn't handle archaic words like "giveth" and "taketh" and just left them untranslated (which, I suppose, amounts to a faulty translation!).

The paper also has an interesting discussion of the relationship between expressibility and entropy of different languages.

We suggest that the compressed size of texts with the same information content should remain close to constant across languages, even when the uncompressed texts vary in size...[o]ur hypothesis is based on the following intuition. ...[T]he estimates of the entropy of English are based on a finite stochastic model of the language. The relevant attributes of these models can be applied to all natural languages. The first is the set of statements that can be expressed in this language...[o]ur conclusions rely on the assumption that S^{L} [= the set of statements that can be expressed in the language] is the same for all natural languages...Over this set, we have a probability distribution describing the likelihood that a statement is expressed, or output by the source...for large samples of statements, the probability distributions [p^{L}] for different languages are likely to be quite similar...[i]f our assumptions that p^{L} is roughly the same across all languages is true, we would expect compressed translations to have approximately the same size.
Now, this section made me think of the recent debate over Pirahã, which supposedly displays:

(i) the absence of creation myths and fiction; (ii) the simplest kinship system yet documented; (iii) the absence of numbers of any kind or a concept of counting; (iv) the absence of color terms; (v) the absence of embedding in the grammar; (vi) the absence of 'relative tenses' ... (viii) the absence of any individual or collective memory of more than two generations past; ... (ix) the absence of any terms for quantification...

If you're anything like me, you're probably wondering what on earth these people talk about. But anyway, supposing Pirahã really does have these gaps. Then it seems to me that they would have a lot less/fewer (without embedding, can they have an infinite number of sentences?) sentences in their set of expressible statements, S^{Pirahã}.

So here's my question. Would a test such as that which Behr et al. conducted be able to detect a much smaller S^L? I'm guessing that smaller S^L would mean a lower entropy - it'd be much easier to guess what's going to come next when you have fewer possible things to express. Here's the bit I'm unsure about. If you're translating an English text into Pirahã, then presumably you've got the basic information across, in which case regardless of the syntactic structure of the language, the compressed texts should fall within about 20% of each other. After all, English and Chinese have very different syntactic structures and yet they fall well within 15% for the cases given above. So, you wouldn't be able to detect gaps elsewhere in the language other than the set of sentences you'd worked on. So it wouldn't work.

On the other hand, I guess translating from English to Pirahã would result in a lot of simplifications. You couldn't say "fifty men" but just "a large number of men", you couldn't say "a yellow bird", just "a bird" - or maybe there's some paraphrase available. In which case the compressed size of the document might be a lot smaller, since a lot of information is left out. But then again (in another bold twist of the plot!) no compression algorithm is going to recognise that, say, "a large number of men" is less information than "fifty men". If a lot of paraphrases are necessary, then the size of the file might be larger.

So I guess the whole issue is over what "translate" means. Is it still an accurate translation if a lot of the little details are left out?

OK, enough rambling. It's time for bed.


Post a Comment

<< Home