Journal of International Technology and Information Management

Document Type



Speech corpus is a database of audio files containing spoken words/sentences and text transcriptions. In this work we present a data collection system for creating speech corpora from movies and TV series DVDs. Corpus generation from these DVDs is significantly lower- cost solution comparing to conventional way of obtaining a speech corpus. In addition, it also takes a shorter amount of time to collect the data and processes it into a corpus. In order to be able to perform this operation the Data Collection Toolkit is introduced. This toolkit is an application developed using C# .Net Framework 3.5 in Visual Studio 2008. Throughout the presented work, this toolkit is included to show how it can be utilized to simplify the process of creating a corpus.