Monday, February 21, 2005

More about ISBNs than you ever cared to know

It's rather sad that I should find this fascinating, but did you know you could identify where a book was published simply from its ISBN? The first few digits will tell you. 0 and 1 are for English-speaking countries (Australia, Canada, NZ, S. Africa, UK, Gibraltar, USA, Ireland - and, strangely, Namibia, Swaziland and Zimbabwe), 2 for French-speaking countries, etc., etc., then you get to 8 and the double digits - 80 for the Czech Republic and Slovakia, up to India at 93, then the 3-digit ones beginning with Argentina at 950 (don't know where 94 went) , 4-digits at the Dominican Republic with 9945, and then Bahrain at 99901 down to Eritrea at 99948.

Of course, this is in proportion to the countries' abilities (or proclivities) to churn out books, since Eritrea can only publish 10,000 books under this scheme - 999480000 through to 9994899996. The last digit is a check-digit, so there are only four decimal places for them to play around with. You can check out the full list of country identifiers here.

The next few digits are similarly assigned to unique publishers. Again, bigger publishers get smaller codes, so they have more room to play around with the number of ISBNs they can assign.

You know, this looks a lot like a prefix-free Huffman encoding(if you're into information theory) where things that occur more frequently get a smaller identifier. The funny thing is, in information theory, that implies that these major publishers (and countries) put out stuff containing less information! This page breaks down the distribution of publishers for English-speaking countries (look down at the bottom of the gory details).

Further reading:

This is a really good textbook about information theory available on the web: Information Theory, Inference, and Learning Algorithms by David MacKay. I especially recommend the section about estimating the entropy of English given the fact that we can construct American-style crosswords. (British-style crosswords are harder to solve though!)

Another useful website: ISBN check
Wikipedia's entry on ISBNs

Oh, and, ISBNs are becoming 13 digits long - current ISBNs will have 978 prefixed to them and a recalculated check digit (which is what appears on most barcodes anyway - the 978 indicates that it's a "Bookland" EAN). This is, of course, because the 10-digit ISBNs restrict the number of books published. Looks like those Library Lookup regexes are going to need a tweak before too long.


