Journal of International Technology and Information Management

Speech Corpus Generation from DVDs of Movies and TV Series

Document Type

Article

Abstract

Speech corpus is a database of audio files containing spoken words/sentences and text transcriptions. In this work we present a data collection system for creating speech corpora from movies and TV series DVDs. Corpus generation from these DVDs is significantly lower- cost solution comparing to conventional way of obtaining a speech corpus. In addition, it also takes a shorter amount of time to collect the data and processes it into a corpus. In order to be able to perform this operation the Data Collection Toolkit is introduced. This toolkit is an application developed using C# .Net Framework 3.5 in Visual Studio 2008. Throughout the presented work, this toolkit is included to show how it can be utilized to simplify the process of creating a corpus.

Recommended Citation

Kepuska, Veton Z. and Rojanasthien, Pattarapong (2011) "Speech Corpus Generation from DVDs of Movies and TV Series," Journal of International Technology and Information Management: Vol. 20: Iss. 1, Article 4.
DOI: https://doi.org/10.58729/1941-6679.1100
Available at: https://scholarworks.lib.csusb.edu/jitim/vol20/iss1/4

Download

Included in

Management Information Systems Commons

COinS

Journal of International Technology and Information Management

Speech Corpus Generation from DVDs of Movies and TV Series

Authors

Document Type

Abstract

Recommended Citation

Included in

Share

Search