Main

Laboratorio de Lingüística Informática

C-ORAL-ROM

Essential information

Text lenght

In the informal section:

In the formal section, the text length is defined according to the following rules:

The corpus design matrix has been approximated in each language collection as in the following table:

Requirements
Italian
French
Spanish
Portuguese
Section
Context
Domain
Words
Words
Words
Words
Words
INFORMAL
150000
155048
140341
169625
167059
Family-private
124500
128696
113287
131734
134511
Monologue
42000
45213
42436
42718
45939
Dialogue-Conversation
82500
83483
70851
89016
88572
Public
25500
26352
27054
37891
32548
Monologue
6000
6051
6521
6182
7693
Dialogue-Conversation
19500
20301
20533
31709
24855
FORMAL
150000
156544
129618
165846
152483
Natural context
65000
68324
47862
72573
66151
Media
60000
61638
54176
62809
62018
Telephone
25000
26582
27580
30464
24314
TOTAL
300000
311592
269959
335471
319542


Acoustic quality

C-ORAL-ROM is oriented towards the collection of corpora in natural environment, despite the fact that this necessarily causes a lower acoustic quality of the resource. Moreover, C-ORAL-ROM has exploited, in the frame of a new multilingual work, the rich contents of the archives set up by the providers during years of research on spoken languages; therefore the acoustic quality and the recording conditions of the resource are variable.

The following are the requirements for the acoustic format and for the recording apparatus:

The speech files of the acoustic database are defined on a quality scale (recording, volume, voice overlapping and noise). The quality scale extends from the highest level of clarity of the voice signal to low levels of acoustic quality.

  1. Digital recordings with DAT or minidisk apparatus and unidirectional microphones or analogue recording of very high quality.

  2. Digital recording with poorer microphone response or analogue recordings with:

    • Good microphone response
    • Low background noise
    • Low percentage of overlapped utterances
    • F0 computing possible in most of the file


  3. Low quality analogue recordings with:

    • Poor microphone response
    • Background noise
    • Average percentage of overlapped utterances
    • F0 computing possible in many parts of the files

The quality is gauged spectrographically. Sessions in which F0 analysis is not significant are excluded from sampling. The acoustic quality of each recording and the most relevant data on the recording condition are always recorded in the metadata of each text.


Speech and label files

For each spontaneous speech recording session, the following is delivered into folders of the multimedia corpus.

  1. Speech files: uncompressed .WAV files (Windows PCM: 22,050 hz; 16 bit).
  2. Transcripts in CHAT format enriched by the annotation of terminal and non terminal prosodic breaks and the alignment information, in TXT files.
  3. The text-to-speech alignment files: XML file in WIN PITCH CORPUS format.
  4. DTD of the WinPitchCorpus alignment format.

Speech files and transcription files are in one to one correspondence. The following is the general table of the C-ORAL-ROM multimedia corpus:

wavfiles
GB
Duration
Utterances
Words in txt files
French
206
3,77
26.21.43
19546
256271
Italian
204
5,19
36.16.10
40402
311592
Portuguese
152
4,43
29.43.42
38855
317920
Spanish
210
4,56
31.06.00
35588
335471

For each session, the following files are also delivered:

  1. The transcription of each session in CHAT format in .TXT files (without the alignment information).
  2. The C-ORAL-ROM transcription of each session in .XML files.
  3. DTD of the C-ORAL-ROM .XML format.
  4. Metadata in CHAT format.
  5. Metadata in IMDI format.
  6. The C-ORAL-ROM transcription each session with Part of Speech annotation and Lemma annotation for each form in .TXT files.
  7. Tag set adopted in .TXT files.
  8. Frequency lists of lemmas and Frequency lists of forms in TXT files.
  9. Measurements of the Language values recorded in each text: in the Excel files "measurements_language.xls".
  10. Line diagrams presenting the trend observed with regard to the standard text variation parameters along the corpus structure's nodes, in the Excel file "Multilingual graphics.xls".

The Spanish corpus

C-ORAL-ROM
Informal
Type Sub-Type Code Words Time (s) Utterances Dialogic turns
familiar/private conversation efamcv01.xml 1656 555 335 232
familiar/private conversation efamcv02.xml 1554 434 257 139
familiar/private conversation efamcv03.xml 1588 495 395 268
familiar/private conversation efamcv04.xml 1572 475 227 124
familiar/private conversation efamcv05.xml 1528 409 257 145
familiar/private conversation efamcv06.xml 1496 381 224 135
familiar/private conversation efamcv07.xml 1593 370 263 185
familiar/private conversation efamcv08.xml 1636 482 308 197
familiar/private conversation efamcv09.xml 1542 497 298 182
familiar/private conversation efamcv10.xml 1568 384 242 117
familiar/private conversation efamcv11.xml 1553 458 237 143
familiar/private conversation efamcv12.xml 1555 385 212 130
familiar/private conversation efamcv13.xml 1549 501 248 162
familiar/private conversation efamcv14.xml 1568 389 277 182
familiar/private conversation efamcv15.xml 1537 424 223 156
familiar/private dialogue efamdl01.xml 1534 391 241 141
familiar/private dialogue efamdl02.xml 1548 350 241 147
familiar/private dialogue efamdl03.xml 1589 428 241 138
familiar/private dialogue efamdl04.xml 1509 478 196 116
familiar/private dialogue efamdl05.xml 1592 519 224 112
familiar/private dialogue efamdl06.xml 1563 500 335 214
familiar/private dialogue efamdl07.xml 1551 550 244 87
familiar/private dialogue efamdl08.xml 1531 394 230 93
familiar/private dialogue efamdl09.xml 1556 438 204 95
familiar/private dialogue efamdl10.xml 1570 397 206 99
familiar/private dialogue efamdl11.xml 1507 475 290 149
familiar/private dialogue efamdl12.xml 1539 519 344 153
familiar/private dialogue efamdl13.xml 1556 435 217 128
familiar/private dialogue efamdl14.xml 1537 564 287 162
familiar/private dialogue efamdl15.xml 1548 488 145 65
familiar/private dialogue efamdl16.xml 1432 455 178 84
familiar/private dialogue efamdl17.xml 1535 472 228 83
familiar/private dialogue efamdl18.xml 1570 376 168 85
familiar/private dialogue efamdl19.xml 1523 383 195 110
familiar/private dialogue efamdl20.xml 1538 481 299 164
familiar/private dialogue efamdl21.xml 1574 506 284 176
familiar/private dialogue efamdl22.xml 1574 344 225 158
familiar/private dialogue efamdl23.xml 806 262 133 74
familiar/private dialogue efamdl24.xml 1498 449 232 143
familiar/private dialogue efamdl25.xml 1585 449 252 139
familiar/private dialogue efamdl26.xml 1564 590 272 174
familiar/private dialogue efamdl27.xml 1565 604 269 157
familiar/private dialogue efamdl28.xml 1502 443 229 127
familiar/private dialogue efamdl29.xml 1553 435 173 94
familiar/private dialogue efamdl30.xml 1642 462 243 81
familiar/private dialogue efamdl31.xml 1536 342 166 96
familiar/private dialogue efamdl32.xml 1522 416 210 151
familiar/private dialogue efamdl33.xml 1621 431 294 226
familiar/private dialogue efamdl34.xml 1484 400 173 122
familiar/private dialogue efamdl35.xml 1531 453 315 169
familiar/private dialogue efamdl36.xml 936 287 96 41
familiar/private dialogue efamdl37.xml 1517 350 245 149
familiar/private dialogue efamdl38.xml 1525 596 224 139
familiar/private dialogue efamdl39.xml 1525 447 264 151
familiar/private dialogue efamdl40.xml 1556 358 256 163
familiar/private dialogue efamdl41.xml 1534 446 289 178
familiar/private dialogue efamdl42.xml 1562 501 242 116
familiar/private monologue efammn01.xml 4597 2021 490 1
familiar/private monologue efammn02.xml 4523 1336 315 35
familiar/private monologue efammn03.xml 4571 1418 440 1
familiar/private monologue efammn04.xml 4512 1383 231 1
familiar/private monologue efammn05.xml 3133 1352 401 13
familiar/private monologue efammn06.xml 3196 1490 288 1
familiar/private monologue efammn07.xml 4495 1528 276 1
familiar/private monologue efammn08.xml 4567 1453 350 1
familiar/private monologue efammn09.xml 3049 1332 51 2
familiar/private monologue efammn10.xml 4586 1630 293 1
public conversation epubcv01.xml 1670 700 406 221
public conversation epubcv02.xml 1544 451 309 198
public dialogue epubdl01.xml 1616 602 210 116
public dialogue epubdl02.xml 1529 466 151 91
public dialogue epubdl03.xml 1547 496 149 81
public dialogue epubdl04.xml 1499 431 192 140
public dialogue epubdl05.xml 1580 473 256 150
public dialogue epubdl06.xml 1530 493 249 143
public dialogue epubdl07.xml 1555 534 216 133
public dialogue epubdl08.xml 1471 470 186 94
public dialogue epubdl09.xml 1523 534 167 106
public dialogue epubdl10.xml 1553 399 342 234
public dialogue epubdl11.xml 1522 390 168 112
public dialogue epubdl12.xml 1559 524 245 193
public dialogue epubdl13.xml 1551 486 210 175
public dialogue epubdl14.xml 1477 450 278 150
public dialogue epubdl15.xml 1550 511 185 91
public dialogue epubdl16.xml 1538 873 321 159
public dialogue epubdl17.xml 1537 480 177 122
public dialogue epubdl18.xml 1584 362 188 131
public monologue epubmn01.xml 1522 831 179 1
public monologue epubmn02.xml 4489 1700 110 1
Formal-Natural context
Type Sub-Type Code Words Time (s) Utterances Dialogic turns
business dialogue enatbu01.xml 3005 954 353 243
business dialogue enatbu02.xml 3056 1040 185 91
business monologue enatbu03.xml 2973 1407 71 1
conference monologue enatco01.xml 2995 1291 156 1
conference monologue enatco02.xml 3014 1255 105 1
conference monologue enatco03.xml 3135 2135 114 1
conference monologue enatco04.xml 3131 1121 136 1
law conversation enatla01.xml 3160 1013 239 106
law monologue enatla02.xml 3043 1006 106 1
political debate conversation enatpd01.xml 2964 997 125 22
political debate conversation enatpd02.xml 3091 989 151 15
prof. explanation monologue enatpe01.xml 2996 1042 181 1
prof. explanation conversation enatpe02.xml 3095 815 234 102
prof. explanation conversation enatpe03.xml 2866 981 268 165
prof. explanation monologue enatpe04.xml 3106 1009 150 1
preaching monologue enatpr01.xml 985 419 69 3
preaching monologue enatpr02.xml 1579 553 62 1
preaching monologue enatpr03.xml 1706 994 92 1
preaching monologue enatpr04.xml 306 164 21 1
preaching monologue enatpr05.xml 600 349 78 1
preaching monologue enatpr06.xml 1648 939 127 1
public speech monologue enatps01.xml 2993 1145 128 2
public speech conversation enatps02.xml 3124 1035 100 13
teaching dialogue enatte01.xml 3124 1082 180 57
teaching conversation enatte02.xml 3061 812 310 132
teaching monologue enatte03.xml 3108 1275 163 11
teaching monologue enatte04.xml 3060 1409 239 11
Formal-Media
Gender Code Words Time (s) Utterances Dialogic turns
interviews emedin01.xml 1509 504 108 25
interviews emedin02.xml 1536 505 74 20
interviews emedin03.xml 1590 492 50 21
interviews emedin04.xml 1478 448 88 31
interviews emedin05.xml 1527 449 111 39
meteo emedmt01.xml 518 152 34 1
meteo emedmt02.xml 519 161 21 1
meteo emedmt03.xml 554 178 27 1
news emednw01.xml 1596 483 67 12
news emednw02.xml 1637 512 72 15
news emednw03.xml 1546 473 70 8
news emednw04.xml 1555 458 76 22
news emednw04_1.xml 831 244 34 9
news emednw04_2.xml 732 213 42 13
news emednw05.xml 1535 437 81 25
news emednw05_1.xml 1535 437 73 22
news emednw05_2.xml 1535 437 8 3
news emednw06.xml 1611 554 68 19
news emednw06_1.xml 1611 554 37 8
news emednw06_2.xml 1611 554 31 11
reportages emedrp01.xml 1491 503 132 61
reportages emedrp01_1.xml 1491 503 107 52
reportages emedrp01_2.xml 1491 503 27 9
reportages emedrp02.xml 1558 641 164 47
reportages emedrp02_1.xml 1558 641 30 8
reportages emedrp02_2.xml 1558 641 57 11
reportages emedrp02_3.xml 1558 641 18 14
reportages emedrp02_4.xml 1558 641 59 14
reportages emedrp03.xml 1520 626 80 14
reportages emedrp03_1.xml 1520 626 41 7
reportages emedrp03_2.xml 1520 626 41 7
reportages emedrp04.xml 1526 606 118 37
reportages emedrp04_1.xml 1526 606 55 13
reportages emedrp04_2.xml 1526 606 14 8
reportages emedrp04_3.xml 1526 606 22 9
reportages emedrp04_4.xml 1526 606 27 7
reportages emedrp05.xml 1512 573 79 21
reportages emedrp05_1.xml 1512 573 42 11
reportages emedrp05_2.xml 1512 573 37 10
reportages emedrp06.xml 1548 704 101 23
reportages emedrp06_1.xml 1548 704 42 8
reportages emedrp06_2.xml 1548 704 57 15
reportages emedrp07.xml 1557 576 91 21
scientific press emedsc01.xml 1516 527 104 53
scientific press emedsc02.xml 1492 529 124 71
scientific press emedsc03.xml 1578 549 176 102
scientific press emedsc04.xml 1522 572 103 26
sport emedsp01.xml 1540 803 93 1
sport emedsp02.xml 1537 375 202 111
sport emedsp03.xml 1557 403 133 70
sport emedsp04.xml 1587 484 155 79
sport emedsp05.xml 1581 638 105 34
sport emedsp06.xml 1528 466 154 76
talk show emedts01.xml 1516 399 133 99
talk show emedts02.xml 1499 458 101 65
talk show emedts03.xml 1523 524 75 26
talk show emedts04.xml 1549 515 69 8
talk show emedts05.xml 1484 535 236 158
talk show emedts06.xml 1489 588 95 43
talk show emedts07.xml 1534 475 153 73
talk show emedts08.xml 1555 572 183 95
talk show emedts09.xml 1599 471 87 45
talk show emedts10.xml 1567 541 183 99
talk show emedts11.xml 1527 638 140 93
Telephone
Gender Code Words Time (s) Utterances Dialogic turns
telephone etelef01.xml 1175 368 269 154
telephone etelef02.xml 1135 333 216 152
telephone etelef03.xml 719 231 178 119
telephone etelef04.xml 73 22 30 17
telephone etelef05.xml 89 26 24 22
telephone etelef06.xml 1539 394 255 180
telephone etelef07.xml 328 119 91 58
telephone etelef08.xml 5376 1721 941 567
telephone etelef09.xml 1759 509 441 272
telephone etelef10.xml 2048 577 407 261
telephone etelef11.xml 519 148 96 71