Juraj Citorík. Hlasové ovládání pro efektivní editaci textu BACHELOR THESIS. Charles University in Prague Faculty of Mathematics and Physics - PDF

Charles University in Prague Faculty of Mathematics and Physics BACHELOR THESIS Juraj Citorík Hlasové ovládání pro efektivní editaci textu Department of Software Engineering Supervisor of the bachelor

Please download to get full document.

View again

of 43
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Books - Non-fiction

Publish on:

Views: 139 | Pages: 43

Extension: PDF | Download: 0

Charles University in Prague Faculty of Mathematics and Physics BACHELOR THESIS Juraj Citorík Hlasové ovládání pro efektivní editaci textu Department of Software Engineering Supervisor of the bachelor thesis: Study programme: Curriculum: RNDr. Jakub Loko, Ph.D. Computer Science General Computer Science Prague 2013 None of this would be possible without the support of my family. Thank you. I would also like to thank my thesis supervisor, RNDr. Jakub Loko, Ph.D. for helpful advice, patience and trust. I would like to extend my sincere gratitude to those who went above and beyond to help me, including Martina ƒerníková, Luká² urovský, Zuzana Abelovská, Dominik Me hart, Martin Klepá, Tomá² Susedik, Michal Bilanský, Samuel Barto² and Vendula Michlíková. I declare that I carried out this bachelor thesis independently, and only with the cited sources, literature and other professional sources I understand that my work relates to the rights and obligations under the Act No. 121/2000 Coll., the Copyright Act, as amended, in particular the fact that Charles University in Prague has the right to conclude a license agreement on the use of this work as a school work pursuant to Section 60 paragraph 1 of the Copyright Act. In... date... Signature Názov práce: Hlasové ovládání pro efektivní editaci textu Autor: Juraj Citorík Katedra: Katedra softwarového inºenýrství Vedúci bakalárskej práce: RNDr. inºenýrství Jakub Loko, Ph.D., Katedra softwarového Abstrakt: Cie om tejto práce je poskytnú úvod do problematiky digitálneho spracovania zvuku a rozpoznávania re i. V texte je popísaných nieko ko vybraných deskriptorov re i a algoritmov spojených s problematikou. Tieto sú pouºité v implementácii jednoduchého hlasom ovládaného textového editoru a.net kniºnice. Deskriptory sú porovnané s oh adom na rýchlos a presnos pri pouºití v systéme rozpoznávania príkazov pre textový editor a to v systéme závislom alebo nezávislom na hovoriacom. Kniºnica tried poskytuje jednoduchý spôsob implementácie hlasového ovládania závislého na hovoriacom v obmedzenej doméne príkazov v ubovo nom programe. Editor textu umoº uje uºívate ovi priradi hlasové povely k zabudovaným funkciám programu, o napríklad umoº uje aj neskúseným uºívate om pouºíva pokro ilé funkcie bez nutnosti predo²lého u enia sa napríklad klávesových skratiek. Tento prístup je navy²e nezávislý na jazyku a je pouºite ný aj pre udí s poruchami re i, o momentálne roz²írené rie²enia neumoº ujú. Výsledky experimentov ukazujú, ºe prezentované deskriptory a algoritmy sú, za predpokladu dostato nej kvality nahrávky, dostato ne efektívne pre pouºitie pri rozpoznávaní príkazov v systéme závislom na hovoriacom. K ú ové slová: hlasové ovládanie, digitálne spracovanie zvuku Title: Voice control for eective text editing Author: Juraj Citorík Department: Department of Software Engineering Supervisor: RNDr. Jakub Loko, Ph.D., Department of Software Engineering Abstract: The aim of this thesis is to provide a comprehensive introduction to digital sound processing and speech recognition. Selected speech recognition features as well as algorithms are introduced and utilized in a voice controlled text editor and a.net class library. The performance of the features is evaluated in both speaker-dependent and speaker-independent recognition of commands related to text editing. The library provides a straightforward way of implementing a speaker-dependent, domain-constrained voice recognition in an arbitrary application. It is used in a simple voice controlled text editor. The editor allows the user to assign voice commands to built-in actions. In this way, it is possible for inexperienced users to access and use advanced features of the program without having to learn complex workows. Moreover, this approach is language-agnostic and can even be used by people with speech impairments as opposed to majority of presently used voice recognition systems. The results of the experiments indicate that, given a recording of sucient quality, the presented features and algorithms provide an eective means to implement a speaker-dependent speech recognition system, which can be used in a voice controlled text editor. Keywords: voice recognition, digital sound processing Contents Introduction 3 1 Speech and Sound Processing Fundamentals Sound Properties of Sound Digital Signal Processing Fundamentals Sound Digitization Process Waveform and Frequency Spectrum The Discrete Fourier Transform Windowing Spectral Leakage Window Characteristics Windows Speech Production of Speech Speech Signal Characteristics Speech Perception A Simple Speech Production Model Speech Recognition Features Mel Frequency Cepstral Coecients Linear Predictive Coding Perceptual Linear Prediction Measuring Level of Similarity Between Speech Signals Similarity of Feature Vectors Dynamic Time Warping Voice Activity Detection Related Works Types of Speech Recognition Systems Speaker-dependent Systems Speaker-independent Systems Speaker-adapting Systems Modern Speech Recognition Acoustic Models Language Models Decoding Summary Implementation Available Speech Recognition Feature Extraction Tools MARSYAS Yaafe opensmile NET Class Library Implementation Using opensmile to extract speech features 3.2.2 Object Model Using the Voice Control Library Prerequisites Basic usage Advanced Usage A Voice Controlled Text Editor Motivation A List of Possible Commands A Simple Voice Controlled Text Editor Usage Guide Implementation Further Challenges Experiments Goals Evaluated Features Methodology List of Commands Used Speaker-dependent Speech Recognition Speaker-independent Speech Recognition Results Speaker-dependent Recognition Speaker-independent Recognition Recognition Success Rates of Individual Commands Extraction Time Discussion Speaker-dependent speech recognition Speaker-independent speech recognition Recognition Success Rates of Individual Commands Extraction Time Summary Conclusion 47 Bibliography 48 2 Introduction Speech has always been the dominant means of human communication. This preference for spoken language communication, however, hasn't yet been reected in the way humans interact with computers. Most computers utilize a graphical user interface, which depends on keyboard input and mouse clicks. Nowadays, consumers place a premium on simplicity of use and therefore it is essential to create human-computer interfaces that allow for a more natural interaction, gentle learning curve and thus higher productivity. One of the means to achieve this is speech recognition. In the following chapters we will explore fundamentals of digital sound processing and speech recognition. This includes basic properties of sound and human speech, digitization of sound, speech recognition features and algorithms. These will be used as a foundation to build a voice controlled text editor and a.net class library. Finally, the performance of the various speech recognition features will be assessed in both speaker-dependent and speaker-independent system for voice control with focus on commands related to text editing. Motivation Novice users and expert users alike can benet from a spoken language interface. It is frequently dicult for novice users to control a new program, where simple actions require the knowledge of the program's interface conventions and involve manipulating several windows, checkboxes and sliders. This is in stark contrast to the simplicity of merely saying what the user wants to do. Moreover, professional users can use voice commands to avoid unnecessary obstacles in their workow. For instance, a graphic designer might invoke a text formatting command using his or her voice while using the mouse to pinpoint the area to which the desired action should be applied. This results in a more uid workow and thus increased productivity. Speech recognition features Speech recognition features describe the speech signal in a way that allows us to nd for a spoken command a matching command already present in the voice command database. In the following chapters, we will explore various speech recogniton features, such as Mel Frequency Cepstral Coecients, as well as speech processing algorithms and techniques that allow us to create a voice-controlled text editor. We will evaluate the performance of the features in one of the following chapters. The criteria that are of interest to us are speed of extraction and accuracy. Voice control for text editing The set of actions used in a typical text editing task is limited which allows us to provide the user with a list of available commands and prompt him or her to 3 record the corresponding phrase. In this way, a local speaker-dependent voice command database is created. This database is then utilized when the user uses the voice control to determine the correct action with respect to the spoken utterance. The fact that the user creates his or her own command database renders it usable for people with mild speech impairments and people who speak with a strong accent. It is also language-agnostic. Speaker-independent speech recognition We will also examine the perfomance of a speaker-independent voice recognition system, which will use a database of text editing commands recorded by several users. We will assess the accuracy of this database, as well as its perfomance when dierent underlying speech recognition features are used. Chapters overview Chapter 1 contains introduction to digital sound processing and speech recognition. It also provides descriptions of three speech recognition features: Mel Frequency Cepstral Coecients, Linear Predictive Coding and Perceptual Linear Prediction. Algorithms for tackling voice activity detection and command matching are presented as well. Chapter 2 discusses related works and techniques used in state-of-the-art speech recognition. Chapter 3 includes details about the implementation of the.net voice control library. Chapter 4 contains the specication of the voice controlled text editor. A set of commands to be used in the program is determined. This chapter also species problems that have to be tackled so as to implement a functional solution. Chapter 5 species the methodology of feature performance testing and discusses the results of the tests. 4 1. Speech and Sound Processing Fundamentals 1.1 Sound Sound is a wave of pressure caused by an oscillating object that propagates through a medium, such as air. The wave includes zones where air molecules are in a compressed conguration and zones where air molecules are less compressed. The former zones are called compressions and the latter zones are called rarefactions. The alternating congurations of compression and rarefaction can be depicted by a graph of a sine wave as shown in Figure 1.1 We often use the term signal or audio signal to refer to a sound wave. A signal is a function that conveys information about the behaviour or attributes of some phenomenon. Throughout the remainder of this text we will use terms speech and speech signal interchangeably Properties of Sound The sound wavelength λ is the distance between two subsequent maximal compressions or minimal rarefactions as depicted in Figure 1.1. The frequency f of a sound is dened as: f = c λ Where c is the speed of sound in the corresponding environment. Signal amplitude is the amount by which the signal diers from zero. Signal magnitude is dened as the absolute value of amplitude. Signal power is dened as magnitude squared. In practice, due to the vast range of sound pressure levels the human ear can detect, it is convenient to use a logarithmic decibel scale for sound intensity. The sound's pressure level (SPL) in decibels (db) is actually a comparison of its pressure level P to the reference pressure level P 0 that is equal to 0 db and Amplitude Wavelength Figure 1.1: The sinewave can be used to describe a sound wave. The peaks of the sine curve correspond to maximal compression, while the valleys indicate minimal rarefaction. Two important properties of a sound wave, amplitude and wavelength, are shown in the gure as well. 5 Sound db Level Times higher than TOH Threshold of hearing Light whisper Quiet conversation Normal conversation Heavy truck trac Pain threshold of ear Sonic boom Rocket engine Table 1.1: Intensity and decibel levels of various sounds.[1] corresponds to the threshold of human hearing (TOH): SP L (db) = 20 log 10 ( P P 0 ) Decibel levels of selected sounds are shown in Table Digital Signal Processing Fundamentals Sound Digitization Process To be able to work with sound signals on a computer, we have to digitize them. In this process, a continuous analog signal is turned into a discrete signal by a process called sampling (Figure 1.2). This process is carried out by a analog-todigital converter (ADC), which samples the continuous signal at regular intervals and stores the amplitude corresponding to a particular moment in a sample. The sample rate is the number of times an analog signal is measured (sampled) per second. The conversion of the amplitude of each sample to a binary number is called quantization. The number of bits used for quantization is referred to as bit depth. Bit depth and sample rate (sampling frequency) are the two most important factors that determine the quality of a digital audio system[5]. A digital signal can only contain a limited range of frequency components. The limit is given by the sampling frequency as stated in the Nyquist Sampling Theorem: Theorem 1. For a signal sampled at a rate R the highest frequency that the signal can contain is R/2. Otherwise distortion appears in the digitized signal Waveform and Frequency Spectrum With the exception of pure sine waves, sounds are made up of many dierent frequency components vibrating at the same time. The particular characteristics of a sound are the result of the unique combination of frequencies it contains. Sounds contain energy in dierent frequency ranges, or frequency bands. Once we obtain the frequency components of a signal, we can represent the signal in the frequency domain with a spectrogram, as shown in Figure 1.3, which also displays the waveform of the signal, representing the amplitude of the signal in the time domain. 6 Figure 1.2: Digitization of a simple sound signal with various sample rates[7] Figure 1.3: Speech signal waveform, and its spectrogram. The light spots indicate high intensity of the frequency components at the corresponding time. 7 1.2.3 The Discrete Fourier Transform To obtain the frequency spectrum of a signal, the Fast Fourier Transform (FFT) is used, which is an ecient implementation of the Discrete Fourier Transform (DFT). The DFT of a signal x N [k] is dened as And since X N [k] = N 1 n=0 x N [n]e j2πnk/n 0 k N e j2πnk/n = cos (2πnk/N) j sin (2πnk/N) we can intuitively see that DFT is a decomposition of a signal into a linear combination of sines and cosines of various frequencies. Figure 1.4 shows the FFT of the signal from Figure 1.3 which is sampled at 14 khz. Note that it contains information about frequencies up to 7 khz, which corresponds to the Nyquist frequency of a signal sampled at 14 khz. The FFT of a simple signal is shown in Figure 1.5. Straightforward implementation of the above formula yields a computational complexity of O ( n 2), but if the FFT algorithm is used to compute the transform, the complexity is O ( n log n ). Detailed description of the FFT algorithm can be found in [1] The result of an N-point transform is a vector containing N frequency bins. Each bin contains intensity information about a range of frequencies. The range is determined by N, the sampling rate of the signal and the index of the bin. For instance, the frequency band corresponding to a bin with a zero-based index k, in a transform of size N of a signal sampled at fs Hz is: ( ) ( ) fs fs k to (k + 1) Hz N N The Inverse Discrete Fourier Transform To transform a signal from the frequency domain back to the time domain, the Inverse Discrete Fourier Transform (IDFT) is used: x N [n] = 1 N N 1 k=0 The Discrete Cosine Transform X N [k]e j2πnk/n 0 n N Closely related to the Discrete Fourier Transform and frequently used in speech recognition is the Discrete Cosine Transform (DCT). The DCT of a real-valued signal x N [k] is dened as C[k] = N 1 n=0 x[n] cos (πk (n + 1/2) /N) 0 k N Compared to DFT, it is often the case that the coecients of DCT are concentrated at lower indices, which means we can approximate the signal with fewer coecients.[1] 8 Figure 1.4: Fourier transform transform of the signal from Figure 1.3 Figure 1.5: Fourier transform of a 440 Hz sine wave 9 1.3 Windowing In digital signal processing (DSP), a signal is rarely processed in one piece. The signal is usually divided into frames and the processing is applied to individual frames. However, a problem called spectral leakage may occur, which causes an FFT bin to contain energy from adjacent frequency bins. To counter this, we use windows Spectral Leakage The FFT algorithm assumes that the input signal is periodic throughout all time and that the period length is the same as the length of the input. If the input contains a non-integral number of cycles, spectral leakage occurs. The problem is illustrated in gure 1.6, which shows the spectrum of a frame taken from the signal whose fourier transform (computed for the whole signal at once) is shown in gure 1.5. It is clear that the information about frequency components is inaccurate. One way to amend this is to divide the signal into frames that only contain integral number of cycles. In many cases, however, this is either not possible or a constant frame size is required. Figure 1.6 shows the spectra obtained from a frame containing non-integral number of cycles, using three types of windows. Spectral leakage is clearly present Window Characteristics Figure 1.7 shows characteristics that describe a window and tell us how it will perform. The center of the main lobe occurs at each frequency bin of the signal. The width of the main lobe is given in bins and determines the frequency resolution of the window. However, as the main lobe narrows, its energy is distributed into its side lobes which decreases the accuracy of the information about amplitude Windows Applying a window is equivalent to convolving the window with the input. Even when no window is used, the input is in fact convolved with a rectangular window of uniform height, therefore no window is sometimes called Rectangular or Uniform window. The most frequently used windows in speech recognition are rectangular, Hann and Hamming window. Rectangular (Uniform) Window An N-sample Rectangular Window is dened by: w (n) = 1 10 Figure 1.6: Spectra obtained from a part of a signal containing a 440 Hz sine wave, using no window (also called Rectangular or Uniform window), Hann window and Hamming window. Figure 1.7: Properties of a window function 11 (a) Rectangular window plot (b) Rectangular window magnitude response Figure 1.8: Rectangular window plot and magnitude response (a) Hamming window plot (b) Hamming window magnitude response Figure 1.9: Hamming window plot and magnitude response Hann Window An N-sample Hann Window is dened by: ( ( )) 2πn w (n) = cos N 1 Hamming Window An N-sample Hamming Window is dened by: ( ) 2πn w (n) = cos N 1 (a) Hann window plot (b) Hann window magnitude response Figure 1.10: Hamming window plot and magnitude response 12 Window -3 db Main Lobe -3 db Main Lobe Maximum Side Width (bins) Width (bins) Lobe Level (db) Rectangular Hann Hamming Table 1.2: Properties of the Rectangular, Hann and Hamming window. Rectangular window has great frequency resolution due to a narrow main lobe. On the other hand, the Hann and Hamming windows have better amplitude resolution thanks to their low maximum side lobe level. An overview of th
Related Search
Similar documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks