Wednesday, January 18, 2017

Reading Speed

I've long been dissatisfied with the table that I put in this post, which showed my proficiency in various languages. In it I gave my reading, listening and speaking abilities subjective grades on the 0 to 10 scale. It should be easy to measure the first two of these objectively. I recently started to do this with reading.

I chose newspaper editorials as material. Why? There's less variation among them in terms of difficulty than among novels. There are fewer personal and place names in them than in newspaper articles. They're easy to find.

Here's what I have so far:

For this table I read about 5,000 words in English, German, Russian, Spanish and French each. I read a little fewer than 2,900 words in Italian and about 1,250 characters in Chinese. This is a work in progress. Eventually I want to base my table on 5,000 words in all of these European languages and on 10,000 characters in Chinese. And I want to add Portuguese and Ukrainian, which I can read to some extent.

For the record, the one English word I didn't know was "exons", encountered in an NYT editorial by Nicholas Kristof. I comfort myself with the near-certainty that he doesn't know what it means either.

What does "Adj. reading speed" mean in my above table? Well, German words, for example, tend to be longer than English ones. So a thousand words in German usually convey more information than a thousand words in English. I decided to quantify this, and then to adjust for it.

When people compare languages, they often use the Lord's Prayer because it's the text that's been translated into the most tongues. However, it's short and sometimes contains archaic vocabulary, which is unrepresentative of modern speech. I decided to go for one of the chief texts of the religion of liberalism instead, the Universal Declaration of Human Rights.

The web site of the High Commissioner of these rights has translations of their Declaration into 501 languages. The document is broken up into chapters. I removed chapter headings ("Chapter 25", for example) before counting the words because I'm more interested in the length of real sentences. In the end I came up with the following numbers:

For Chinese I used the number of characters instead of the number of words. I used the coefficients in the last column to adjust the reading speed numbers above. My entire worksheet can be seen here.

My plan for measuring listening comprehension is to find some audio-books of classic novels and then to calculate the percentage of the words that I understand correctly the first, the second, the third, etc. time that I hear a passage spoken.

The only way to objectively quantify speaking ability is to ask lots of native speakers to grade you. I can't really do that, so those grades will remain subjective in my new system.


  1. How many Chinese characters do you know?

    I've heard that 10,000 characters is way more than you need to know for Chinese. They say that 3,000 to 5,000 is enough to read at the level of an educated person. Characters less frequent than the top 3 to 5K rarely appear and when they do can just be looked up.

    1. I have 4,346 Chinese characters and 2,638 compounds in my Anki deck. I've written about Anki here and here. I pretty much know the things that I have in Anki. The recall is between 93% and 94% I think.

      When I read Chinese news or Wikipedia articles, I sometimes see new to me characters and enter them into my Anki deck. Not very often though. To read Chinese fluently I'd need to know several times more compounds though. Very often you cannot predict the meaning of a compound from its constituent characters. A lot of Chinese words are 2-character compounds and some are longer.

      Also, it takes me a long time to recognize many of the characters and compounds that I do know, hence the slow reading speed.

  2. 4,346 is a lot. That's impressive. That's enough for educated literacy, assuming that they're high frequency characters. Are those characters from one of the high frequency character lists?

    You're right about the compounds. It's hard to infer their meaning just by knowing the constituent characters. On the other hand, knowing the constituent characters makes learning and remembering the compounds easier.

    1. I got 4,096 characters and 1,516 compounds from this book:

      The reason that I can cite precise figures like that is that I've got everything in Anki and it puts out statistics. While reading things, mostly Chinese news and Wiki articles, I encountered 150 new characters and 1,122 new compounds, all of which I entered into my Anki deck.

      I bought that book several years before I learned about Anki's existence. It explains character shapes with etymologies, some of which are fake I think. But they're good mnemonic devices. I memorized all of the characters from that paper book and later entered them into Anki.

      While reading and listening to Pimsleur Mandarin lessons I use an app called Pleco now. Kind of a smart dictionary.

      And before I got that character etymology book I read Chinese Readers by John DeFrancis.