Journal of International Technology and Information Management
Document Type
Article
Abstract
Speech corpus is a database of audio files containing spoken words/sentences and text transcriptions. In this work we present a data collection system for creating speech corpora from movies and TV series DVDs. Corpus generation from these DVDs is significantly lower- cost solution comparing to conventional way of obtaining a speech corpus. In addition, it also takes a shorter amount of time to collect the data and processes it into a corpus. In order to be able to perform this operation the Data Collection Toolkit is introduced. This toolkit is an application developed using C# .Net Framework 3.5 in Visual Studio 2008. Throughout the presented work, this toolkit is included to show how it can be utilized to simplify the process of creating a corpus.
Recommended Citation
Kepuska, Veton Z. and Rojanasthien, Pattarapong
(2011)
"Speech Corpus Generation from DVDs of Movies and TV Series,"
Journal of International Technology and Information Management: Vol. 20:
Iss.
1, Article 4.
DOI: https://doi.org/10.58729/1941-6679.1100
Available at:
https://scholarworks.lib.csusb.edu/jitim/vol20/iss1/4