Germanvocab.com

FAQ

Why did you make this web site?

I'm learning to read German and am working my way through the list on this German Language Stackexchange page. I figured that having vocabulary lists would be useful to myself and thought I might as well share them.

What format are the files in?
They're zipped up tab-separated text files. There are two versions of each file, in slightly different formats. If the accents don't display correctly when you open one of the files, try the other.

What should I do with these files?
You can open them in Excel in Windows (just drag words.txt onto Excel). On Excel for Mac, go to File->Import, choose 'Text file', select 'words.txt' and then change the 'File origin' to 'Windows (ANSI)'.

They're in an ideal format for flashcards, or for use on Memrise. If you have a Mac then you can use Uprise to bulk-upload the audio.

Where are the 3,000 most common Germans words from?
They're based on a corpus of some 50 million German words from subtitles at http://opensubtitles.org

How were the vocabulary lists created?
A program (written in C# and using SQL Server, if you're interested) parses the text and splits it into individual words. For each word, it uses a morphological database to work out the base form (so, for example, esse, isst, aßt, esst etc. are all converted to the infinitive, essen) and figures out what form of speech the base word is (noun, verb, etc.). It then counts how many times the base form is used in the text, translates it using the Google Translate API and then gets the pronunciation from http://forvo.com. Finally, any obvious errors are corrected.

What are the limitations of these vocabulary lists?
The creation of the vocabulary lists was highly automatic, so the lists are only as good as the data and the algorithms:

  • It can get mixed up with words that are both nouns and verbs. Wissen can mean both to know and knowledge, but the algorithm might give its meaning as to knowledge or know.
  • It has no context. If it sees the word Herzog it will assume it is the preterite of herziehen rather than somebody's name.
  • It can't cope well with separable words. If it sees the word aushöhlen in the separated form höhlen ... aus it will interpret this as two words.
  • It can't cope well with compound nounds. When it sees Seekarte (composed of See and Karte), it can't find this in the morphological data and doesn't know that it's a noun.
  • It stumbles over proper nouns.
  • It can't distinguish between words with different meanings depending on the gender (for example, der Flur and die Flur).
  • Hyphenated words throw it off: if a word has been split across two lines then it will interpret them as two words.
  • It only lists the most common translation, as returned by Google Translate.

Although they aren't perfect, they should be a good start for learning the vocabulary in the books they've been generated from.

I have another question - can you answer it?
Try e-mailing me at neil@germanvocab.com.

Can you create a vocabulary list for X?
Maybe - email me at neil@germanvocab.com.

There's a mistake in one of the vocabulary lists - can you correct it?
Maybe - email me at neil@germanvocab.com.