External Corpora

Many people make lists of web-based corpus materials–Sonja Eisenbeiss has done one better and compiled a list of useful lists (of web-based corpus materials), available here: http://experimentalfieldlinguistics.wordpress.com/experimental-materials/lexical_databases/

Corpus Resources at Essex. All sorts of details for what external corpora we have access to, and where they are stored: http://www.essex.ac.uk/linguistics/research/resgroups/clgroup/Resources/Corpora/

Below is a sample of other external corpora that may be relevant to various linguistic questions (a general list)–it was developed as a reference for Essex students:

ENGLISH corpora (not specifically learner/L2 corpora–those are grouped separately, below)

International Corpus of English Brief Description: National Regional varieties of English with comparable genres & registers (1 mil words each). Plain & tagged (CLAWS7 tagset) http://ice-corpora.net/ice/index.htm Cost? free with license to single users within non-profits
British National Corpus Brief Description: 100 mil words of late 20th century spoken (10%) & written (90%) British English, tagged via CLAWS 7 tagset http://www.natcorp.ox.ac.uk/ Cost? single user GBP 75, institutional GBP 500
TIME Magazine Corpus of American English Brief Description: 100 million words 1923-2006. all written, all from the magazine http://corpus.byu.edu/time/ Cost? free online search tool
The SLX Corpus of Classic Sociolinguistic Interviews Brief Description: mini–8 interviews with 9 spkrs http://projects.ldc.upenn.edu/DASL/SLX/ Cost? free or $100
Santa Barbara Corpus of Spoken English. Brief Description: audio & transcripts of language in everyday American life. Forms part of ICE-American English http://www.linguistics.ucsb.edu/research/santa-barbara-corpus Cost? free, available online
Birmingham Blog Corpus. Brief Description: This corpus consists of 628,558,282 words extracted from blog texts. The corpus is split into sections according to how the texts were discovered and downloaded: http://wse1.webcorp.org.uk/home/blogs.html Cost? free, available online
Dundee Corpus of English Brief Description: Eyetracking corpus of newspaper reading Cost? Free to researchers
Switchboard Corpus of American English (tagged) Brief Description: Richly tagged corpus of telephone conversations in American English http://groups.inf.ed.ac.uk/switchboard/download.html Cost? $25
COCA Brief Description: 450 million word corpus of English http://corpus.byu.edu/coca/ Cost? Free to researchers
Bank of English Brief Description: English corpora that provide the basis for Collins dictionaries http://www.collinslanguage.com/content-solutions/wordbanks Cost? GBP695
LDC Brief Description: corpora, lexica, dictionaries for a broad set of languages http://www.ldc.upenn.edu/About/ Cost? USD 1,000
Sounds of the City: Glaswegian English diachronic corpus. Sample audio available online, researchers can gain access to full corpus http://soundsofthecity.arts.gla.ac.uk/ Cost? Free

ENGLISH LEARNER Corpora

FLLOC (French Learner Language Oral Corpora) Brief Description: L2 oral French from classroom beginners to advanced; range of tasks; audiofiles; CHAT/CLAN transcripts; tagged transcripts; xml files; circa 3.5M words http://www.flloc.soton.ac.uk Cost? free
SPLLOC (Spanish Learner Language Oral Corpora) Brief Description: L2 oral Spanish; range of tasks; range of levels; audiofiles; CHAT/CLAN transcripts; tagged transcripts; xml files http://www.splloc.soton.ac.uk Cost? free
LTTC English Learner Corpus (LTTC-ELC) Brief Description: Taiwanese learners of English; written exam; intermediate & high intermediate; CHAT transcripts; 2M words http://lttcelc.org/index.php Cost? free
ICLE (International Corpus of Learner English) Brief Description: written essays; higher intermediate; range of L1s; 3.7M words http://www.uclouvain.be/en-cecl-icle.html Cost? €225 (1 user); €290 (2-10 users); €420 (11-25 users)
CHILDES Brief Description: mostly L1 acquisition data; growing amount of L2 (inc. ESF corpus); some disordered data; transcripts; some tagged data; CHAT; huge database, wide range of L1s; sophisticated analysis software http://childes.psy.cmu.edu/ Cost? free
WRICLE (Written Corpus of Learner English) Brief Description: L2 English essays; Spanish L1; 750K words http://www.uam.es/proyectosinv/woslac/Wricle/ Cost? free
Sketch Engine Brief Description: range of L2 corpora e.g. British Academic Spoken/Written Corpus https://the.sketchengine.co.uk/open/ Cost? free
CEEAUS (Corpus of English Essays Written by Asian University Students) Brief Description: written essays; L1 Japanese; L1 Chinese; NS controls; intermediate and advanced http://language.sakura.ne.jp/s/ceeause.html Cost? free
ICCI (International Corpus of Crosslinguistic Interlanguage) Brief Description: written essays; range of L1s beginners to lower intermediates http://tonolab.tufs.ac.jp/icci/index.jsp Cost? free
ISLE speech corpus Brief Description: 23 German and 23 Italian L1; L2 English; speaker records same blocks of sentences http://catalog.elra.info/product_info.php?products_id=568 Cost? €50 members of ELRA; €100 non-members

RUSSIAN Corpora

A query to Russian corpora Brief Description: interface for interrogating a range of different Russian corpora http://corpus.leeds.ac.uk/ruscorpora.html Cost? free
Russian National Corpus Brief Description: modern written Russian; 300M words http://www.ruscorpora.ru/en/index.html Cost? free
Russian Corpora Brief Description: range of texts (literature; press etc.) http://www.athel.com/Russian_corpora.html Cost? free

ARABIC: corpora of Levantine, MSA & other regional dialects

Quranic Arabic Corpus Brief Description: an annotated linguistic resource which shows the Arabic grammar, syntax and morphology for each word in the Holy Quran. The corpus provides three levels of analysis: morphological annotation, a syntactic treebank and a semantic ontology http://corpus.quran.com/ Cost? free online search tool for lexicon, semantic ontology, etc.
Online Arabic Corpus Brief Description: unclear http://nmelrc.org/online-arabic-corpus Cost? free web access with registration
International Corpus of Arabic (ICA) Brief Description: MSA from different regions, under construction http://www.bibalex.org/unl/Frontend/Project.aspx?id=9 Cost?
Tunisian Arabic Corpus Brief Description: 700k words http://tunisiya.org/ Cost? free & downloadable
Pangloss Collection Brief Description: Connected, spontaneous speech, mostly in “rare” or endangered languages, recorded in their cultural context and transcribed in consultation with native speakers. At present, the archive contains 1230 records in 71 languages, with 325 documents annotated. http://lacito.vjf.cnrs.fr/archivage/index_en.htm Cost? free & downloadable
Open Source Arabic Corpora (OSAC) Brief Description: The corpora include: – BBC Arabic corpus: collected from bbcarabic.com, includes 4,763 text documents. Each text document belongs to 1 of 7 categories (Middle East News 2356, World News 1489, Business & Economy 296, Sports 219, International Press 49, Science & Technology 232, Art & Culture 122). The corpus contains 1,860,786 (1.8M) words and 106,733 district keywords after stopwords removal.- CNN Arabic corpus: collected from cnnarabic.com, includes 5,070 text documents. Each text document belongs to 1 of 6 categories (Business 836, Entertainments 474, Middle East News 1462, Science & Technology 526, Sports 762, World News 1010). The corpus contains 2,241,348 (2.2M) words and 144,460 district keywords after stopwords removal. – Open Source Arabic Corpus (OSAc) (small c): collected from multiple sites, includes 22,429 text documents. Each text document belongs to 1 of 11 categories (Economics, History, Entertainments, Education & Family, Religious and Fatwas, Sports, Heath, Astronomy, Low, Stories, Cooking Recipes). The corpus contains about 18,183,511 (18M) words and 449,600 district keywords after stopwords removal. https://sites.google.com/site/motazsite/Home/osac Cost? free & downloadable
CJK Dictionary Institute Brief Description: links to various online Arabic lexical DBs including: Database of Arab Names (DAN); Arab Name Transcription Engine Demo (ANTE); The CJKI Arabic Learner’s Dictionary (CALD); Database of Arab Names in Arabic (DANA); Database of Arabic Business Names (DABNA); Expanded OFAC (XOFAC); Database of Foreign Names in Arabic (DAFNA); Dictionary of Arabic Place Name Variants (DAPNA); Dictionary of Arabic Proper Nouns; Arabic Broken Plurals; Arabic Lexical Database (ALD) http://www.kanji.org/cjk/arabic/arabsam.htm Cost? free
Aralex Brief Description: 40 million word MSA lexical DB http://faculty.uaeu.ac.ae/s_boudelaa/Boudelaa_Marslen-Wilson_aralex.pdf Cost? free

TURKISH Corpora

TS Corpus of Turkish Data Brief Description: TS Corpus is consist of 491 million tokens * It’s a tagged corpus both by means of POSTag and morphological tags * It’s free for academic researches * It’s based on CWB (http://cwb.sourceforge.net/index.php) * It displays hit sets both in KWIC and Line View * It allows users to categorize queries http://tscorpus.com/ Cost? free for academics

GREEK Corpora

Hellenic National Corpus Brief Description: Greek, written text, POS-tagged and lemmatised, 46 million words (but updated constantly) http://hnc.ilsp.gr/en/default.asp Cost? 6-month subscription, 6-10 users: 529 euros, 11-30 users: 793 euros, more than 30 users: negotiable rate
Corpus of Greek Texts Brief Description: Greek, written and spoken text, 30 million words http://sek.edu.gr/index.php?en Cost? Free

PARALLEL Corpora

Parts of Europarl (European Parliament Proceedings Parallel Corpus) Brief Description: Spoken text; Parallel corpora: French-English, German-English, Greek-English, Italian-English, Spanish-English, Portuguese-English. http://www.statmt.org/europarl/ Cost? Free
DGT-TM (DGT Multilingual Translation Memory of the Acquis Communautaire) Brief Description: Parallel corpus of European legislation in 22 European languages; among them: English, French, German, Greek, Italian, Spanish http://langtech.jrc.it/DGT-TM.html Cost? Free
CRATER 2 parallel corpus Brief Description: French-English-Spanish parallel corpus. European Commision written and spoken text. POS-tagged and lemmatised. 1.5 million words in English and French sections, 1 million in Spanish http://www.comp.lancs.ac.uk/linguistics/crater/corpus.html Cost? Free
CLUVI parallel corpus Brief Description: Parallel corpora of various genres in various combinations of these languages: English, Galician, French, Spanish http://sli.uvigo.es/CLUVI/index_en.html Cost? Free
PAROLE corpora Brief Description: This is a set of corpora and lexica in various European languages; they are not parallel but they followed the same design principles, so they are comparable; corpora were 20 million words per language; lexica were 20,000 entries per language. Written text. Only 250,000 words POS-tagged per language. http://catalog.elra.info/search.php?page=1&affichage=long&restrict=;exclude=products_all;config=htdig_elra_cat;method=and;format=normal;sort=score;matchesperpage=;words=PAROLE Cost? Different price per corpus/lexicon (see website)
Babel Chinese-English parallel corpus Brief Description: Written text. 20 million Chinese characters, 10 million English words. POS-tagged and tokenised Chinese, POS-tagged and lemmatised English. http://www.lancs.ac.uk/fass/projects/corpus/babel/babel.htm Cost? Free
Hong Kong parallel text Brief Description: 59 million English words and 49 million Chinese words. Subcorpora: Hansards, Laws, News. http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2004T08 Cost? $200

FRENCH Corpora

FRANTEXT Brief Description: literary texts; philosophy; scientific and technical texts http://www.frantext.fr/ Cost? 41 per person or €350 per institution
Corpus of spoken French Brief Description: 95 interviews in different regions of France http://www.llas.ac.uk/resources/mb/80 Cost? free
Corpus de francais parle au Quebec Brief Description: spoken quebecois French in year 2000 http://recherche.flsh.usherbrooke.ca/cfpq/index.php/site/index Cost? free
PFC (Phonologie du francais contemporain) Brief Description: spoken french across the world http://www.projet-pfc.net/pfc-recherche.html Cost? free

List of other Corpora (sorry, we couldn’t help ourselves! even more lists embedded within lists!)

Long list of corpora http://www.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/index2.html
Another long list of corpora http://www.uow.edu.au/~dlee/CBLLinks.htm

Essex Corpus Linguistics Collective

our corpora, training & wider resources

External Corpora

Leave a comment Cancel reply