Main

Laboratorio de Lingüística Informática

CORAF

ACCESS TO CORPUS ONLINE (Use Firefox preferably)

CORAF (Corpus ORal de Aprendientes de Francés) is a learner corpus composed of 30 interviews with Spanish learners of French as a Foreign Language (FLE). These learners contribute to our project by taking part in some informal conversations/interviews performed in their teaching establishments (all the recordings were taken in the University of Castilla-La Mancha and three other Language Schools in the Castilla-La Mancha region).

CORAF is also a monolingual French as Foreign Language learner corpus composed by 30 samples of spontaneous speech from 34 participants (30 learners and 4 interviewers). There is a total of 61.092 words (33.915 produced by learners). We collected also different samples from all levels of the CEFR (A1, A2, B1, B2, C1 and C2), and all the recordings are synchronized with their orthographic transcription.

In addition, files include transcription and metadata of learners' sociolinguistic information (such as sex, age, place of birth or teaching level), and some other data about learners' knowledge of French (i.e. level, learning context, etc.).

The transcription used follows the LLI-UAM guidelines and also covers an error analysis of oral production by means of different tags (analyzed and categorized by the researcher). Finally, we take into account different speech phenomena (tagged as @oral and {%oral}), in order to find out another important and significant features of learner language.

The corpus structure and other specific features are shown in the following tables:

LEVEL (CEFR) TOTAL LENGTH MEAN LENGTH Nº INTERVIEWS & SEX
(Man or Woman)
WORDS (TOTAL) WORDS (LEARNER)
A1 1:00:24 12' 05" 5 (2M/3W) 6989 2506
A2 1:05:22 13' 04" 5 (3M/2W) 8503 4110
B1 1:14:19 14' 52" 5 (1M/4W) 9699 4908
B2 1:19:46 15' 57" 5 (2M/3W) 11279 6858
C1 1:20:28 16' 06" 5 (2M/3W) 12365 7867
C2 1:22:04 16' 25" 5 (2M/3W) 12257 7666
TOTAL 7:22:23 14' 45" 30 (12M/18W) 61092 33915

FILE LEVEL (CEFR) LENGTH TOTAL WORDS WORDS (LEARNER) WORDS (INTERVIEWER) TURNS Nº SPEECH PHENOMENA
A1M01 A1 0:10:44 1371 445 926 189 0
A1M02 A1 0:12:03 1639 288 1351 212 1
A1W01 A1 0:13:27 1196 429 767 162 0
A1W02 A1 0:12:54 1528 741 787 214 0
A1W03 A1 0:11:16 1255 603 652 162 3
  A1 1:00:24 8503 4110 4393 939 4
A2M01 A2 00:12:49 1648 605 1043 266 2
A2M02 A2 00:13:43 2189 1180 1009 257 37
A2M03 A2 00:10:18 1460 485 975 215 1
A2W01 A2 00:13:54 1444 846 598 212 1
A2W02 A2 00:14:38 1762 994 768 361 2
  A2 1:05:22 8503 4110 4393 1311 43
B1M01 B1 00:12:57 1688 749 939 175 6
B1W01 B1 00:15:23 2062 1011 1051 319 2
B1W02 B1 00:17:08 2173 1111 1062 363 12
B1W03 B1 00:14:54 2262 1026 1236 313 27
B1W04 B1 00:13:57 1514 1011 503 135 1
  B1 1:14:19 9699 4908 4791 1305 48
B2M01 B2 00:18:59 2695 1676 1019 263 46
B2M02 B2 00:15:07 1794 1081 713 215 1
B2W01 B2 00:12:57 1975 904 1071 160 14
B2W02 B2 00:16:56 2392 1780 612 126 24
B2W03 B2 00:15:47 2423 1417 1006 262 14
  B2 1:19:46 11279 6858 4421 1026 99
C1M01 C1 00:16:59 2416 1464 952 298 11
C1M02 C1 00:13:24 1845 1095 750 228 47
C1W01 C1 00:17:35 2841 1825 1016 325 23
C1W02 C1 00:14:11 2694 1875 819 179 76
C1W03 C1 00:18:19 2569 1608 961 309 16
  C1 1:20:28 12365 7867 4498 1339 173
C2M01 C2 00:14:16 1750 1162 588 189 5
C2M02 C2 00:15:54 2348 1518 830 330 23
C2W01 C2 00:20:00 2762 1756 1006 390 26
C2W02 C2 00:14:40 2564 1487 1077 287 24
C2W03 C2 00:17:14 2833 1743 1090 207 50
  C2 1:22:04 12257 7666 4591 1403 128
TOTAL   7:22:23 61092 33915 27177 7323 495

SELECTED PUBLICATIONS