External Corpora

Many people make lists of web-based corpus materials–Sonja Eisenbeiss has done one better and compiled a list of useful lists (of web-based corpus materials), available here: http://experimentalfieldlinguistics.wordpress.com/experimental-materials/lexical_databases/

Corpus Resources at Essex. All sorts of details for what external corpora we have access to, and where they are stored: http://www.essex.ac.uk/linguistics/research/resgroups/clgroup/Resources/Corpora/

Below is a sample of other external corpora that may be relevant to various linguistic questions (a general list)–it was developed as a reference for Essex students:

ENGLISH corpora (not specifically learner/L2 corpora–those are grouped separately, below)    

  • International Corpus of English Brief Description: National Regional varieties of English with comparable genres & registers (1 mil words each). Plain & tagged (CLAWS7 tagset) http://ice-corpora.net/ice/index.htm Cost? free with license to single users within non-profits
  • British National Corpus Brief Description: 100 mil words of late 20th century spoken (10%) & written (90%) British English, tagged via CLAWS 7 tagset http://www.natcorp.ox.ac.uk/ Cost? single user GBP 75, institutional GBP 500
  • TIME Magazine Corpus of American English Brief Description: 100 million words 1923-2006. all written, all from the magazine http://corpus.byu.edu/time/ Cost? free online search tool
  • The SLX Corpus of Classic Sociolinguistic Interviews Brief Description: mini–8 interviews with 9 spkrs http://projects.ldc.upenn.edu/DASL/SLX/ Cost? free or $100
  • Santa Barbara Corpus of Spoken English. Brief Description: audio & transcripts of language in everyday American life. Forms part of ICE-American English http://www.linguistics.ucsb.edu/research/santa-barbara-corpus Cost? free, available online
  • Birmingham Blog Corpus. Brief Description: This corpus consists of 628,558,282 words extracted from blog texts. The corpus is split into sections according to how the texts were discovered and downloaded: http://wse1.webcorp.org.uk/home/blogs.html Cost? free, available online
  • Dundee Corpus of English Brief Description: Eyetracking corpus of newspaper reading Cost? Free to researchers
  • Switchboard Corpus of American English (tagged) Brief Description: Richly tagged corpus of telephone conversations in American English http://groups.inf.ed.ac.uk/switchboard/download.html Cost? $25
  • COCA Brief Description: 450 million word corpus of English http://corpus.byu.edu/coca/ Cost? Free to researchers
  • Bank of English Brief Description: English corpora that provide the basis for Collins dictionaries http://www.collinslanguage.com/content-solutions/wordbanks Cost? GBP695
  • LDC Brief Description: corpora, lexica, dictionaries for a broad set of languages http://www.ldc.upenn.edu/About/ Cost? USD 1,000
  • Sounds of the City: Glaswegian English diachronic corpus. Sample audio available online, researchers can gain access to full corpus http://soundsofthecity.arts.gla.ac.uk/ Cost? Free


  • FLLOC (French Learner Language Oral Corpora) Brief Description: L2 oral French from classroom beginners to advanced; range of tasks; audiofiles; CHAT/CLAN transcripts; tagged transcripts; xml files; circa 3.5M words http://www.flloc.soton.ac.uk Cost? free
  • SPLLOC (Spanish Learner Language Oral Corpora) Brief Description: L2 oral Spanish; range of tasks; range of levels; audiofiles; CHAT/CLAN transcripts; tagged transcripts; xml files http://www.splloc.soton.ac.uk Cost? free
  • LTTC English Learner Corpus (LTTC-ELC) Brief Description: Taiwanese learners of English; written exam; intermediate & high intermediate; CHAT transcripts; 2M words http://lttcelc.org/index.php Cost? free
  • ICLE (International Corpus of Learner English) Brief Description: written essays; higher intermediate; range of L1s; 3.7M words http://www.uclouvain.be/en-cecl-icle.html Cost? €225 (1 user); €290 (2-10 users); €420 (11-25 users)
  • CHILDES Brief Description: mostly L1 acquisition data; growing amount of L2 (inc. ESF corpus); some disordered data; transcripts; some tagged data; CHAT; huge database, wide range of L1s; sophisticated analysis software http://childes.psy.cmu.edu/ Cost? free
  • WRICLE (Written Corpus of Learner English) Brief Description: L2 English essays; Spanish L1; 750K words http://www.uam.es/proyectosinv/woslac/Wricle/ Cost? free
  • Sketch Engine Brief Description: range of L2 corpora e.g. British Academic Spoken/Written Corpus https://the.sketchengine.co.uk/open/ Cost? free
  • CEEAUS (Corpus of English Essays Written by Asian University Students) Brief Description: written essays; L1 Japanese; L1 Chinese; NS controls; intermediate and advanced http://language.sakura.ne.jp/s/ceeause.html Cost? free
  • ICCI (International Corpus of Crosslinguistic Interlanguage) Brief Description: written essays; range of L1s beginners to lower intermediates http://tonolab.tufs.ac.jp/icci/index.jsp Cost? free
  • ISLE speech corpus Brief Description: 23 German and 23 Italian L1; L2 English; speaker records same blocks of sentences http://catalog.elra.info/product_info.php?products_id=568 Cost? €50 members of ELRA; €100 non-members

 RUSSIAN Corpora 

ARABIC: corpora of Levantine, MSA & other regional dialects

  • Quranic Arabic Corpus Brief Description: an annotated linguistic resource which shows the Arabic grammar, syntax and morphology for each word in the Holy Quran. The corpus provides three levels of analysis: morphological annotation, a syntactic treebank and a semantic ontology http://corpus.quran.com/ Cost? free online search tool for lexicon, semantic ontology, etc.
  • Online Arabic Corpus Brief Description: unclear http://nmelrc.org/online-arabic-corpus Cost? free web access with registration
  • International Corpus of Arabic (ICA) Brief Description: MSA from different regions, under construction http://www.bibalex.org/unl/Frontend/Project.aspx?id=9 Cost?
  • Tunisian Arabic Corpus Brief Description: 700k words http://tunisiya.org/ Cost? free & downloadable
  • Pangloss Collection Brief Description: Connected, spontaneous speech, mostly in “rare” or endangered languages, recorded in their cultural context and transcribed in consultation with native speakers. At present, the archive contains 1230 records in 71 languages, with 325 documents annotated. http://lacito.vjf.cnrs.fr/archivage/index_en.htm Cost? free & downloadable
  • Open Source Arabic Corpora (OSAC) Brief Description: The corpora include: – BBC Arabic corpus: collected from bbcarabic.com, includes 4,763 text documents. Each text document belongs to 1 of 7 categories (Middle East News 2356, World News 1489, Business & Economy 296, Sports 219, International Press 49, Science & Technology 232, Art & Culture 122). The corpus contains 1,860,786 (1.8M) words and 106,733 district keywords after stopwords removal.- CNN Arabic corpus: collected from cnnarabic.com, includes 5,070 text documents. Each text document belongs to 1 of 6 categories (Business 836, Entertainments 474, Middle East News 1462, Science & Technology 526, Sports 762, World News 1010). The corpus contains 2,241,348 (2.2M) words and 144,460 district keywords after stopwords removal. – Open Source Arabic Corpus (OSAc) (small c): collected from multiple sites, includes 22,429 text documents. Each text document belongs to 1 of 11 categories (Economics, History, Entertainments, Education & Family, Religious and Fatwas, Sports, Heath, Astronomy, Low, Stories, Cooking Recipes). The corpus contains about 18,183,511 (18M) words and 449,600 district keywords after stopwords removal. https://sites.google.com/site/motazsite/Home/osac Cost? free & downloadable
  • CJK Dictionary Institute Brief Description: links to various online Arabic lexical DBs including: Database of Arab Names (DAN); Arab Name Transcription Engine Demo (ANTE); The CJKI Arabic Learner’s Dictionary (CALD); Database of Arab Names in Arabic (DANA); Database of Arabic Business Names (DABNA); Expanded OFAC (XOFAC); Database of Foreign Names in Arabic (DAFNA); Dictionary of Arabic Place Name Variants (DAPNA); Dictionary of Arabic Proper Nouns; Arabic Broken Plurals; Arabic Lexical Database (ALD) http://www.kanji.org/cjk/arabic/arabsam.htm Cost? free
  • Aralex Brief Description: 40 million word MSA lexical DB http://faculty.uaeu.ac.ae/s_boudelaa/Boudelaa_Marslen-Wilson_aralex.pdf Cost? free

 TURKISH Corpora

  • TS Corpus of Turkish Data Brief Description: TS Corpus is consist of 491 million tokens * It’s a tagged corpus both by means of POSTag and morphological tags * It’s free for academic researches * It’s based on CWB (http://cwb.sourceforge.net/index.php) * It displays hit sets both in KWIC and Line View * It allows users to categorize queries http://tscorpus.com/ Cost? free for academics

 GREEK Corpora    

  • Hellenic National Corpus Brief Description: Greek, written text, POS-tagged and lemmatised, 46 million words (but updated constantly) http://hnc.ilsp.gr/en/default.asp Cost? 6-month subscription, 6-10 users: 529 euros, 11-30 users: 793 euros, more than 30 users: negotiable rate
  • Corpus of Greek Texts Brief Description: Greek, written and spoken text, 30 million words http://sek.edu.gr/index.php?en Cost? Free

 PARALLEL Corpora 

 FRENCH Corpora    

 List of other Corpora    (sorry, we couldn’t help ourselves! even more lists embedded within lists!)


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s