Speech corpus is a database of audio files containing spoken words/sentences and text transcriptions. In this work we present a data collection system for creating speech corpora from movies and TV series DVDs. Corpus generation from these DVDs is significantly lower- cost solution comparing to conventional way of obtaining a speech corpus. In addition, it also takes a shorter amount of time to collect the data and processes it into a corpus. In order to be able to perform this operation the Data Collection Toolkit is introduced. This toolkit is an application developed using C# .Net Framework 3.5 in Visual Studio 2008. Throughout the presented work, this toolkit is included to show how it can be utilized to simplify the process of creating a corpus.
Kepuska, Veton Z. and Rojanasthien, Pattarapong
"Speech Corpus Generation from DVDs of Movies and TV Series,"
Journal of International Technology and Information Management: Vol. 20
, Article 4.
Available at: http://scholarworks.lib.csusb.edu/jitim/vol20/iss1/4