Why Chinese Is So Damn Hard, Part 1
Lately I have been working on some software for accelerating the learning curve of Chinese characters. When you learn Chinese (or Japanse, for that matter) your knowledge of the language roughly boils down to how many characters you know. The question is: how many characters should you know in order to be able to have a basic grip on the language? what about a more advanced one, like: reading a Chinese newspaper?
I have often heard people say that when you know about 2000 characters you can read around 80% of a newspaper. This is pure connerie, as we say in French. You know why? Because usually, the most interesting word in a sentence is also the most complex. Thus, with 2000 characters in your handbook you might very well be able to read four characters out of five, but then you will only understand about two or three sentences in the whole article, because you will always miss that particular intriging character that you have never seen anywhere.
To support this claim I have sampled a large amount of Chinese sentences on the web. More precisely, I have sampled 23851 sentences from Twitter, all in Chinese. I have extracted the chinese characters from these sentences and counted their number of occurrences. There are 5577 characters. in all. From these number of occurrences, I have obtained the relative frequency rank of each character.
For instance, here are the five most frequent and rarest characters I have found:
| Rank | Character | Number of occurrences |
| 1 | 的 | 30645 |
| 2 | 是 | 15229 |
| 3 | 我 | 12930 |
| 4 | 了 | 12774 |
| 5 | 不 | 12743 |
| ... | ... | ... |
| 5573 | 琲 | 1 |
| 5574 | 驮 | 1 |
| 5575 | 鑲 | 1 |
| 5576 | 瞰 | 1 |
| 5577 | 曦 | 1 |
With 1000 characters, you can read about 10% of what you find on the web. Pretty lame, huh?