ctb
CTB8.0 共有如下类型文件:
Newswire: [0001-0325, 0400-0454, 0500-0540, 0600-0885, 0900-0931, 4000-4050]——后缀.nw.raw
Magazine articles: [0590-0596, 1001-1151]——后缀.mz.raw
Broadcast news:[2000-3145, 4051-4111]
Broadcast conversations: [4112-4197]
Weblogs: [4198-4411]
Discussion forums: [5000-5558]
其中,可以作为gold数据的文件有:
The following is a list of files that are double-annotated and can be
regarded as gold standard files.
CTB-1 (69 files, 22,316 words)
chtb_001.fid - chtb_043.fid
chtb_144.fid - chtb_169.fid
CTB-3 (32 files, 12,027 words)
chtb_900.fid - chtb_931.fid
CTB-4 (7 files, 13,828 words)
chtb_1018.fid
chtb_1020.fid
chtb_1036.fid
chtb_1044.fid
chtb_1060.fid
chtb_1061.fid
chtb_1072.fid
CTB-5 (6 files, 15,052 words)
chtb_1118.fid
chtb_1119.fid
chtb_1132.fid
chtb_1141.fid
chtb_1142.fid
chtb_1148.fid
Total: 114 files, 63,223 words (12.46% of the corpus)
各文件的内容提取
chtb_0001.nw.raw ~ chtb_0931.nw.raw
示例:
<S ID=1>
</S>
chtb_4000.nw.raw ~ chtb_4050.nw.raw
示例:
<seg id="4">
韩国国立兽医科学检疫院检测后确认,该养鸭场发现了高致病性禽流感病毒。
</seg>
chtb_1001.mz.raw ~ chtb_1151.mz.raw
示例:
<S ID=18718>
文.谢淑芬图.薛继光
</S>
chtb_2000.bn.raw ~ chtb_3145.bn.raw
示例:
<TEXT>
当年朝鲜战争中的韩国难民生还者拒绝接受美国总统克林顿星期四的声明,克林顿对当年美国军人打死韩国平民表示遗憾。
代表这些生还者的发言人表示:“克林顿的声明只是文过饰非。”
并誓言要将此案送交国际法庭。
克林顿总统在声明中对1950年7月老根米村附近发生的事件深表遗憾。
说那次的事件留下战争悲剧和战争创伤的痛苦记忆。
后来国防部长科恩表示:“美国将为当年伤亡的平民树立一座纪念碑,并且设立一个奖学金纪念战争死难者。”
死难者家属要求美国明确道歉,并且给予直接赔偿。
</TEXT>
chtb_4051.bn.raw ~ chtb_4111.bn.raw
示例:
<segment id="10" start="303.376" end="305.365385407">
这个草案已经是修改过一次的。
</segment>
chtb_4112.bc.raw ~ chtb_4197.bc.raw
示例:
EMPTY
父母外出打工后,孩子留在了农村的家中。
他们被称为留守儿童。
外出打工的父母也是很无奈的,实际上他做出这种选择是非常无奈的。
中国一点二亿农民常年在外地务工,产生了近两千万留守儿童。
我想跟他们说,我好想他们。
他们日复一日年复一年的在孤独中期盼着被爱,在残缺中守望着亲情。
对孩子的情感交流,我觉得特别重要。
在这个条件容许的时候,多回家看看孩子。
EMPTY
chtb_4198.wb.raw ~ chtb_4411.wb.raw
示例:
<seg id="4">
在TVBS与辩论会同时播出的节目上,李敖说,他要告公视的状纸都写好了。
</seg>
chtb_5000.raw ~ chtb_5558.raw
示例:
<su id=p1su3>——“su id=”后面跟的内容可以是多样的
怎么有关方面就不明白呢?