Citizendia
Your Ad Here

Speech synthesis is the artificial production of human speech. Speech refers to the processes associated with the production and perception of Sounds used in Spoken language. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware. Typical PC hardware A typical Personal computer consists of a case or chassis in a tower shape (desktop and the following parts Motherboard A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. A symbolic linguistic representation is a representation of an Utterance that uses Symbols to represent linguistic information about the utterance such as information Phonetic transcription (or phonetic notation) is the visual system of symbolization of the sounds occurring in spoken human Language. [1]

Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database. A Computer Database is a structured collection of records or data that is stored in a computer system Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity. Within Phonetics, a phone is a speech sound or gesture considered a physical event without regard to its place in the Phonology of a Language In Phonetics, a diphone is an adjacent pair of Phones It is usually used to refer a recording of the transition between two phones For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely "synthetic" voice output. The vocal tract is that cavity in animals and humans where sound that is produced at the sound source ( Larynx in mammals syrinx in birds is filtered [2]

The quality of a speech synthesizer is judged by its similarity to the human voice, and by its ability to be understood. An intelligible text-to-speech program allows people with visual impairments or reading disabilities to listen to written works on a home computer. Visual impairment or vision impairment is Vision loss that constitutes a significant limitation of visual capability resulting from Disease, A reading disability is a condition in which a sufferer displays difficulty reading resulting primarily from neurological factors Many computer operating systems have included speech synthesizers since the early 1980s.

Contents

Overview of text processing

Overview of a typical TTS system

 Audio sample:
  • Sample of Microsoft Sam
    Windows XP’s default speech synthesizer voice saying, “The quick brown fox jumps over the lazy dog 1,234,567,890 times. Windows XP is a family of 32-bit and 64-bit Operating systems produced by Microsoft for use on Personal computers including home and Soif. ”
  • Problems playing the files? See media help.

A text-to-speech system (or "engine") is composed of two parts: a front-end and a back-end. Front-end and back-end are generalized terms that refer to the initial and the end stages of a process The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing, or tokenization. In Computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences. Phonetic transcription (or phonetic notation) is the visual system of symbolization of the sounds occurring in spoken human Language. In Linguistics, prosody (from Greek προσωδία) is the Rhythm, stress, and intonation of speech In Grammar, a phrase is a group of Words that functions as a single unit in the Syntax of a sentence. In Grammar, a clause is a word or group of words that consists of a subject and a predicate, although in some Languages and some types of In Linguistics, a sentence is a grammatical unit of one or more words bearing minimal syntactic relation to the words that precede or follow it often preceded and followed The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion. In Typography, a grapheme is the fundamental unit in written language. [3] Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred to as the synthesizer—then converts the symbolic linguistic representation into sound.

History

Long before electronic signal processing was invented, there were those who tried to build machines to create human speech. Electronics refers to the flow of charge (moving Electrons through Nonmetal conductors (mainly Semiconductors, whereas electrical Signal processing is the analysis interpretation and manipulation of signals Signals of interest include sound, images, biological signals such as Some early legends of the existence of "speaking heads" involved Gerbert of Aurillac (d. A Brazen Head (or Brass Head or Bronze Head) was a prophetic device attributed to many medieval scholars who were believed to be wizards or who were reputed to be Pope Sylvester II, or Silvester II (c 946&ndash May 12, 1003) born Gerbert d'Aurillac, was a prolific scholar teacher and Pope 1003 AD), Albertus Magnus (1198–1280), and Roger Bacon (1214–1294). For the Nova Scotia premier see Roger Bacon (politician. Roger Bacon, O

In 1779, the Danish scientist Christian Kratzenstein, working at the Russian Academy of Sciences, built models of the human vocal tract that could produce the five long vowel sounds (in International Phonetic Alphabet notation, they are [aː], [eː], [iː], [oː] and [uː]). The Kingdom of Denmark ( ˈd̥ænmɑɡ̊ (archaic ˈd̥anmɑːɡ̊ commonly known as Denmark, is a country in the Scandinavian region of northern Europe The Russian Academy of Sciences (Российская Академия Наук Rossi'iskaya Akade'miya Nau'k, shortened to PAH RAN) consists of the National The vocal tract is that cavity in animals and humans where sound that is produced at the sound source ( Larynx in mammals syrinx in birds is filtered In Phonetics, a vowel is a Sound in spoken Language, such as English ah! or oh!, pronounced with an open Vocal tract [4] This was followed by the bellows-operated "acoustic-mechanical speech machine" by Wolfgang von Kempelen of Vienna, Austria, described in a 1791 paper. A bellows is a device for delivering pressurized Air in a controlled quantity to a controlled location Vienna ( in Wien; see also other names) is the Capital of Austria, and is also one of the nine States of Austria. Austria (Österreich ( officially the Republic of Austria (Republik Österreich [5] This machine added models of the tongue and lips, enabling it to produce consonants as well as vowels. In Articulatory phonetics, a consonant is a Speech sound that is articulated with complete or partial closure of the upper Vocal tract, the upper vocal In 1837, Charles Wheatstone produced a "speaking machine" based on von Kempelen's design, and in 1857, M. Sir Charles Wheatstone FRS (6 February 1802 - 19 October 1875 was a British Scientist and Inventor of many scientific breakthroughs Faber built the "Euphonia". Wheatstone's design was resurrected in 1923 by Paget. [6]

In the 1930s, Bell Labs developed the VOCODER, a keyboard-operated electronic speech analyzer and synthesizer that was said to be clearly intelligible. Bell Laboratories (also known as Bell Labs and formerly known as AT&T Bell Laboratories and Bell Telephone Laboratories) is the Research organization A vocoder, ˈvoʊkoʊdər (a Portmanteau of vox/voc ( voice) and encoder) is an analysis / synthesis system mostly used for speech in which the input is Homer Dudley refined this device into the VODER, which he exhibited at the 1939 New York World's Fair. The 1939-40 New York World's Fair, Flushing Meadows-Corona Park (also the location of the 1964-1965 New York World's Fair) was one of the largest

The Pattern playback was built by Dr. Franklin S. Cooper and his colleagues at Haskins Laboratories in the late 1940s and completed in 1950. The Pattern playback [http//wwwlingsuse/staff/hartmut/kemplne Franklin Seaney Cooper (Apr 29 1908 - Feb 20 1999 was an American Physicist and inventor who was a pioneer in speech research Haskins Laboratories is an independent international multidisciplinary community of researchers conducting basic Research on spoken and written There were several different versions of this hardware device but only one currently survives. The machine converts pictures of the acoustic patterns of speech in the form of a spectrogram back into sound. Using this device, Alvin Liberman and colleagues were able to discover acoustic cues for the perception of phonetic segments (consonants and vowels). Alvin Meyer Liberman (May 10 1917 - Jan 13 2000 was an American Psychologist whose ideas set the agenda for fifty years of research in the psychology of Speech perception Phonetics (from the Greek φωνή ( phonê) "sound" or "voice" is the study of the physical sounds of human speech

Early electronic speech synthesizers sounded robotic and were often barely intelligible. However, the quality of synthesized speech has steadily improved, and output from contemporary speech synthesis systems is sometimes indistinguishable from actual human speech.

Electronic devices

The first computer-based speech synthesis systems were created in the late 1950s, and the first complete text-to-speech system was completed in 1968. In 1961, physicist John Larry Kelly, Jr and colleague Louis Gerstman[7] used an IBM 704 computer to synthesize speech, an event among the most prominent in the history of Bell Labs. John Larry Kelly Jr (1923 &ndash 1965 was a scientist who worked at Bell Labs. The IBM 704, the first mass-produced Computer with Floating point arithmetic hardware was introduced by IBM in April 1954. Bell Laboratories (also known as Bell Labs and formerly known as AT&T Bell Laboratories and Bell Telephone Laboratories) is the Research organization Kelly's voice recorder synthesizer (vocoder) recreated the song "Daisy Bell", with musical accompaniment from Max Mathews. Max Vernon Mathews (* November 13, 1926, in Columbus, Nebraska) is a pioneer in the world of Computer music. Coincidentally, Arthur C. Clarke was visiting his friend and colleague John Pierce at the Bell Labs Murray Hill facility. Sir Arthur Charles Clarke, CBE (16 December 1917–19 March 2008 was a British Science fiction Author, Inventor, and Clarke was so impressed by the demonstration that he used it in the climactic scene of his screenplay for his novel 2001: A Space Odyssey,[8] where the HAL 9000 computer sings the same song as it is being put to sleep by astronaut Dave Bowman. For other uses see 2001 A Space Odyssey. 2001 A Space Odyssey ( 1968) is a Science fiction Novel HAL 9000 ( Heuristically programmed ALgorithmic Computer is a fictional Computer in Arthur C David Bowman is a character in the Space Odyssey series He first appears in a story jointly written by Stanley Kubrick and Arthur C [9] Despite the success of purely electronic speech synthesis, research is still being conducted into mechanical speech synthesizers. [10]

Synthesizer technologies

The most important qualities of a speech synthesis system are naturalness and Intelligibility. Naturalness describes how closely the output sounds like human speech, while intelligibility is the ease with which the output is understood. The ideal speech synthesizer is both natural and intelligible. Speech synthesis systems usually try to maximize both characteristics.

The two primary technologies for generating synthetic speech waveforms are concatenative synthesis and formant synthesis. A formant is a peak in the Frequency spectrum of a sound caused by acoustic Resonance. Each technology has strengths and weaknesses, and the intended uses of a synthesis system will typically determine which approach is used.

Concatenative synthesis

Concatenative synthesis is based on the concatenation (or stringing together) of segments of recorded speech. For concatenation of general lists see Append. In Computer programming, string concatenation is the operation of joining two character Generally, concatenative synthesis produces the most natural-sounding synthesized speech. However, differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes result in audible glitches in the output. There are three main sub-types of concatenative synthesis.

Unit selection synthesis

Unit selection synthesis uses large databases of recorded speech. A Computer Database is a structured collection of records or data that is stored in a computer system During database creation, each recorded utterance is segmented into some or all of the following: individual phones, syllables, morphemes, words, phrases, and sentences. Within Phonetics, a phone is a speech sound or gesture considered a physical event without regard to its place in the Phonology of a Language A syllable ( Greek:) is a unit of organization for a sequence of speech sounds In Morpheme-based morphology, a morpheme is the smallest linguistic unit that has semantic meaning. A word is a unit of Language that carries meaning and consists of one or more Morphemes which are linked more or less tightly together and has a Phonetic In Grammar, a phrase is a group of Words that functions as a single unit in the Syntax of a sentence. In Linguistics, a sentence is a grammatical unit of one or more words bearing minimal syntactic relation to the words that precede or follow it often preceded and followed Typically, the division into segments is done using a specially modified speech recognizer set to a "forced alignment" mode with some manual correction afterward, using visual representations such as the waveform and spectrogram. Speech recognition (also known as automatic speech recognition or computer speech recognition) converts spoken words to machine-readable input (for example to keypresses waveformogg|right|a sine square and sawtooth wave at 440 hz]] Waveform means the shape and form of a signal such as a Wave moving in a solid liquid or gaseous The spectrogram is the result of calculating the Frequency spectrum of Windowed frames of a compound signal. [11] An index of the units in the speech database is then created based on the segmentation and acoustic parameters like the fundamental frequency (pitch), duration, position in the syllable, and neighboring phones. A database index is a Data structure that improves the speed of operations on a database table. The fundamental tone, often referred to simply as the fundamental and abbreviated fo, is the lowest frequency in a harmonic series. Pitch represents the perceived Fundamental frequency of a sound At runtime, the desired target utterance is created by determining the best chain of candidate units from the database (unit selection). In Computer science, runtime or run time describes the operation of a Computer program, the duration of its execution from beginning to termination This process is typically achieved using a specially weighted decision tree. In Operations research, specifically in Decision analysis, a decision tree (or tree diagram is a decision support tool that uses a graph or

Unit selection provides the greatest naturalness, because it applies only a small amount of digital signal processing (DSP) to the recorded speech. Digital signal processing ( DSP) is concerned with the representation of the signals by a sequence of numbers or symbols and the processing of these signals DSP often makes recorded speech sound less natural, although some systems use a small amount of signal processing at the point of concatenation to smooth the waveform. The output from the best unit-selection systems is often indistinguishable from real human voices, especially in contexts for which the TTS system has been tuned. However, maximum naturalness typically require unit-selection speech databases to be very large, in some systems ranging into the gigabytes of recorded data, representing dozens of hours of speech. A gigabyte (derived from the SI prefix Giga-) is a unit of Information or Computer [12] Also, unit selection algorithms have been known to select segments from a place that results in less than ideal synthesis (e. g. minor words become unclear) even when a better choice exists in the database. [13]

Diphone synthesis

Diphone synthesis uses a minimal speech database containing all the diphones (sound-to-sound transitions) occurring in a language. In Phonetics, a diphone is an adjacent pair of Phones It is usually used to refer a recording of the transition between two phones The number of diphones depends on the phonotactics of the language: for example, Spanish has about 800 diphones, and German about 2500. Phonotactics (in Greek phone = voice and tactic = course is a branch of Phonology that deals with restrictions in a Language on the In diphone synthesis, only one example of each diphone is contained in the speech database. At runtime, the target prosody of a sentence is superimposed on these minimal units by means of digital signal processing techniques such as linear predictive coding, PSOLA[14] or MBROLA. Digital signal processing ( DSP) is concerned with the representation of the signals by a sequence of numbers or symbols and the processing of these signals Linear predictive coding ( LPC) is a tool used mostly in Audio signal processing and Speech processing for representing the Spectral envelope In digital Signal processing techniques PSOLA stands for Pitch Synchronous Overlap Add Method MBROLA is an Algorithm for Speech synthesis, a Software which is distributed at no financial cost but in Binary form only and a Worldwide [15] The quality of the resulting speech is generally worse than that of unit-selection systems, but more natural-sounding than the output of formant synthesizers. Diphone synthesis suffers from the sonic glitches of concatenative synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages of either approach other than small size. As such, its use in commercial applications is declining, although it continues to be used in research because there are a number of freely available software implementations.

Domain-specific synthesis

Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances. It is used in applications where the variety of texts the system will output is limited to a particular domain, like transit schedule announcements or weather reports. [16] The technology is very simple to implement, and has been in commercial use for a long time, in devices like talking clocks and calculators. The level of naturalness of these systems can be very high because the variety of sentence types is limited, and they closely match the prosody and intonation of the original recordings.

Because these systems are limited by the words and phrases in their databases, they are not general-purpose and can only synthesize the combinations of words and phrases with which they have been preprogrammed. The blending of words within naturally spoken language however can still cause problems unless the many variations are taken into account. For example, in non-rhotic dialects of English the <r> in words like <clear> /ˈkliːə/ is usually only pronounced when the following word has a vowel as its first letter (e. English pronunciation is divided into two main accent groups the rhotic (ˈroʊtɪk and non-rhotic, depending on when the sound typically represented g. <clear out> is realized as /ˌkliːəɹˈɑʊt/). Likewise in French, many final consonants become no longer silent if followed by a word that begins with a vowel, an effect called liaison. French ( français,) is a Romance language spoken around the world by 118 million people as a native language and by about 180 to 260 million people In French, most written word-final Consonants are silent in most contexts This alternation cannot be reproduced by a simple word-concatenation system, which would require additional complexity to be context-sensitive. In Linguistics, an alternation is the phenomenon of a Phoneme or Morpheme exhibiting variation in its phonological realization

Formant synthesis

Formant synthesis does not use human speech samples at runtime. A formant is a peak in the Frequency spectrum of a sound caused by acoustic Resonance. Instead, the synthesized speech output is created using an acoustic model. Parameters such as fundamental frequency, voicing, and noise levels are varied over time to create a waveform of artificial speech. The fundamental tone, often referred to simply as the fundamental and abbreviated fo, is the lowest frequency in a harmonic series. Phonation has slightly different meanings depending on the subfield of Phonetics. is a one volume manga created by Tsutomu Nihei as a prequel to his ten-volume work Blame!. waveformogg|right|a sine square and sawtooth wave at 440 hz]] Waveform means the shape and form of a signal such as a Wave moving in a solid liquid or gaseous This method is sometimes called rules-based synthesis; however, many concatenative systems also have rules-based components.

Many systems based on formant synthesis technology generate artificial, robotic-sounding speech that would never be mistaken for human speech. However, maximum naturalness is not always the goal of a speech synthesis system, and formant synthesis systems have advantages over concatenative systems. Formant-synthesized speech can be reliably intelligible, even at very high speeds, avoiding the acoustic glitches that commonly plague concatenative systems. High-speed synthesized speech is used by the visually impaired to quickly navigate computers using a screen reader. A screen reader is a software application that attempts to identify and interpret what is being displayed on the screen (or more accurately sent to standard output Formant synthesizers are usually smaller programs than concatenative systems because they do not have a database of speech samples. They can therefore be used in embedded systems, where memory and microprocessor power are especially limited. An embedded system is a special-purpose Computer system designed to perform one or a few dedicated functions often with Real-time computing constraints A data storage device is a device for recording (storing information (data A microprocessor incorporates most or all of the functions of a Central processing unit (CPU on a single Integrated Because formant-based systems have complete control of all aspects of the output speech, a wide variety of prosodies and intonations can be output, conveying not just questions and statements, but a variety of emotions and tones of voice. In Linguistics, intonation is variation of pitch whilst speaking which is not used to distinguish words

Examples of non-real-time but highly accurate intonation control in formant synthesis include the work done in the late 1970s for the Texas Instruments toy Speak & Spell, and in the early 1980s Sega arcade machines. Texas Instruments ( better known in the electronics industry (and popularly as TI, is an American company based in Dallas, Texas, USA The Speak & Spell was an electronic Toy consisting of a speech synthesizer and a keyboard is a multinational Video game Software and Hardware development company and a former Home computer A video arcade (also known as an amusement arcade in the United Kingdom in Japan or as an "arcade" is a venue where people play arcade video games [17] Creating proper intonation for these projects was painstaking, and the results have yet to be matched by real-time text-to-speech interfaces. [18]

Articulatory synthesis

Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there. Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human Vocal tract and the articulation processes occurring The vocal tract is that cavity in animals and humans where sound that is produced at the sound source ( Larynx in mammals syrinx in birds is filtered The first articulatory synthesizer regularly used for laboratory experiments was developed at Haskins Laboratories in the mid-1970s by Philip Rubin, Tom Baer, and Paul Mermelstein. Haskins Laboratories is an independent international multidisciplinary community of researchers conducting basic Research on spoken and written Philip E Rubin (born May 22 1949, in Newark, New Jersey) is an American cognitive scientist who since 2003 has been the Chief Executive This synthesizer, known as ASY, was based on vocal tract models developed at Bell Laboratories in the 1960s and 1970s by Paul Mermelstein, Cecil Coker, and colleagues. Bell Laboratories (also known as Bell Labs and formerly known as AT&T Bell Laboratories and Bell Telephone Laboratories) is the Research organization

Until recently, articulatory synthesis models have not been incorporated into commercial speech synthesis systems. A notable exception is the NeXT-based system originally developed and marketed by Trillium Sound Research, a spin-off company of the University of Calgary, where much of the original research was conducted. NeXT Computer Inc (later NeXT Software Inc) was an American Computer company headquartered in Redwood City, California that The University of Calgary is a research-intensive Public university in Calgary Alberta, Canada. Following the demise of the various incarnations of NeXT (started by Steve Jobs in the late 1980s and merged with Apple Computer in 1997), the Trillium software was published under the GNU General Public License, with work continuing as gnuspeech. Steven Paul Jobs (born February 24 1955 is the Co-founder, Chairman, and CEO of Apple Inc and former CEO of Pixar Animation The system, first marketed in 1994, provides full articulatory-based text-to-speech conversion using a waveguide or transmission-line analog of the human oral and nasal tracts controlled by Carré's "distinctive region model".

HMM-based synthesis

HMM-based synthesis is a synthesis method based on hidden Markov models. A hidden Markov model ( HMM) is a Statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters and the In this system, the frequency spectrum (vocal tract), fundamental frequency (vocal source), and duration (prosody) of speech are modeled simultaneously by HMMs. Familiar concepts associated with a Frequency are colors musical notes radio/TV channels and even the regular rotation of the earth The vocal tract is that cavity in animals and humans where sound that is produced at the sound source ( Larynx in mammals syrinx in birds is filtered The fundamental tone, often referred to simply as the fundamental and abbreviated fo, is the lowest frequency in a harmonic series. Speech waveforms are generated from HMMs themselves based on the maximum likelihood criterion. waveformogg|right|a sine square and sawtooth wave at 440 hz]] Waveform means the shape and form of a signal such as a Wave moving in a solid liquid or gaseous Maximum likelihood estimation ( MLE) is a popular statistical method used for fitting a mathematical model to some data [19]

Sinewave synthesis

Sinewave synthesis is a technique for synthesizing speech by replacing the formants (main bands of energy) with pure tone whistles. Sinewave synthesis is a technique for synthesizing speech by replacing the Formants (main bands of energy with pure tone whistles A formant is a peak in the Frequency spectrum of a sound caused by acoustic Resonance. [20]

Challenges

Text normalization challenges

The process of normalizing text is rarely straightforward. Texts are full of heteronyms, numbers, and abbreviations that all require expansion into a phonetic representation. In Linguistics, heteronyms (also known as heterophones) are words with identical spellings but different pronunciations and meanings A number is an Abstract object, tokens of which are Symbols used in Counting and measuring. For the HTML tag see HTML element. An abbreviation (from Latin brevis "short" There are many spellings in English which are pronounced differently based on context. For example, "My latest project is to learn how to better project my voice" contains two pronunciations of "project".

Most text-to-speech (TTS) systems do not generate semantic representations of their input texts, as processes for doing so are not reliable, well understood, or computationally effective. As a result, various heuristic techniques are used to guess the proper way to disambiguate homographs, like examining neighboring words and using statistics about frequency of occurrence. heuristic (hyu̇-ˈris-tik is a method to help solve a problem commonly an informal method

Deciding how to convert numbers is another problem that TTS systems have to address. It is a simple programming challenge to convert a number into words, like "1325" becoming "one thousand three hundred twenty-five. " However, numbers occur in many different contexts; when a year or part of an address, "1325" should likely be read as "thirteen twenty-five", or, when part of a social security number, as "one three two five". In the United States, a Social Security number (SSN is a 9-digit number issued to citizens Permanent residents and temporary (working residents under section 205(c(2 A TTS system can often infer how to expand a number based on surrounding words, numbers, and punctuation, and sometimes the system provides a way to specify the context if it is ambiguous.

Similarly, abbreviations can be ambiguous. For example, the abbreviation "in" for "inches" must be differentiated from the word "in", and the address "12 St John St. " uses the same abbreviation for both "Saint" and "Street". TTS systems with intelligent front ends can make educated guesses about ambiguous abbreviations, while others provide the same result in all cases, resulting in nonsensical (and sometimes comical) outputs.

Text-to-phoneme challenges

Speech synthesis systems use two basic approaches to determine the pronunciation of a word based on its spelling, a process which is often called text-to-phoneme or grapheme-to-phoneme conversion (phoneme is the term used by linguists to describe distinctive sounds in a language). The phoneME project is Sun Microsystems reference implementation of Java virtual machine and associated libraries of Java ME with source licensed under the GNU The simplest approach to text-to-phoneme conversion is the dictionary-based approach, where a large dictionary containing all the words of a language and their correct pronunciations is stored by the program. Determining the correct pronunciation of each word is a matter of looking up each word in the dictionary and replacing the spelling with the pronunciation specified in the dictionary. The other approach is rule-based, in which pronunciation rules are applied to words to determine their pronunciations based on their spellings. This is similar to the "sounding out", or synthetic phonics, approach to learning reading. Synthetic Phonics is a method of teaching reading which first teaches the letter sounds and then builds up to blending these sounds together to achieve full pronunciation of whole

Each approach has advantages and drawbacks. The dictionary-based approach is quick and accurate, but completely fails if it is given a word which is not in its dictionary. As dictionary size grows, so too does the memory space requirements of the synthesis system. On the other hand, the rule-based approach works on any input, but the complexity of the rules grows substantially as the system takes into account irregular spellings or pronunciations. (Consider that the word "of" is very common in English, yet is the only word in which the letter "f" is pronounced [v]. ) As a result, nearly all speech synthesis systems use a combination of these approaches.

Some languages, like Spanish, have a very regular writing system, and the prediction of the pronunciation of words based on their spellings is quite successful. Speech synthesis systems for such languages often use the rule-based method extensively, resorting to dictionaries only for those few words, like foreign names and borrowings, whose pronunciations are not obvious from their spellings. On the other hand, speech synthesis systems for languages like English, which have extremely irregular spelling systems, are more likely to rely on dictionaries, and to use rule-based methods only for unusual words, or words that aren't in their dictionaries. English is a West Germanic language originating in England and is the First language for most people in the United Kingdom, the United States

Evaluation challenges

It is very difficult to evaluate speech synthesis systems consistently because there is no subjective criterion and usually different organizations use different speech data. The quality of a speech synthesis system highly depends on the quality of recording. Therefore, evaluating speech synthesis systems is almost the same as evaluating the recording skills.

Recently researchers start evaluating speech synthesis systems using the common speech dataset. [21] This may help people to compare the difference between technologies rather than recordings.

Prosodics and emotional content

A recent study reported in the journal "Speech Communication" by Amy Drahota and colleagues at the University of Portsmouth, UK, reported that listeners to voice recordings could determine, at better than chance levels, whether or not the speaker was smiling. The University of Portsmouth is a British University in the historic south coast city of Portsmouth. The United Kingdom of Great Britain and Northern Ireland, commonly known as the United Kingdom, the UK or Britain,is a Sovereign state located [22] It was suggested that identification of the vocal features which signal emotional content may be used to help make synthesized speech sound more natural.

Dedicated hardware

Computer operating systems or outlets with speech synthesis

Apple

The first speech system integrated into an operating system was Apple Computer's MacInTalk in 1984. An operating system (commonly abbreviated OS and O/S) is the software component of a Computer system that is responsible for the management and coordination Apple Inc, ( formerly Apple Computer Inc, is an American Multinational corporation with a focus on designing and manufacturing Consumer electronics PlainTalk is the collective name for several Speech synthesis ( MacInTalk) and Speech recognition technologies developed by Apple Inc. Since the 1980s Macintosh Computers offered text to speech capabilities through The MacinTalk software. In the early 1990s Apple expanded its capabilities offering system wide text-to-speech support. With the introduction of faster PowerPC based computers they included higher quality voice sampling. Apple also introduced speech recognition into its systems which provided a fluid command set. More recently, Apple has added sample-based voices. Starting as a curiosity, the speech system of Apple Macintosh has evolved into a cutting edge fully-supported program, PlainTalk, for people with vision problems. Macintosh, commonly nicknamed Mac is a Brand name which covers several lines of Personal computers designed developed and marketed by Apple Inc PlainTalk is the collective name for several Speech synthesis ( MacInTalk) and Speech recognition technologies developed by Apple Inc. VoiceOver, was included in Mac OS Tiger and more recently Mac OS Leopard. VoiceOver is a feature built into Apple Inc 's Mac OS X operating system since version 10 The voice shipping with Mac OS X 10. 5 ("Leopard") is called "Alex" and features the taking of realistic-sounding breaths between sentences, as well as improved clarity at high read rates.

AmigaOS

The second operating system with advanced speech synthesis capabilities was AmigaOS, introduced in 1985. AmigaOS is the default native Operating system of the Amiga personal computer The voice synthesis was licensed by Commodore International from a third-party software house (Don't Ask Software, now Softvoice, Inc. Commodore, the commonly used name for Commodore International, was a US-American Electronics company based in West Chester Pennsylvania ) and it featured a complete system of voice emulation, with both male and female voices and "stress" indicator markers, made possible by advanced features of the Amiga hardware audio chipset. The Amiga is a family of Personal computers originally developed by Amiga Corporation. A chipset is a group of Integrated circuits or chips that are designed to work together and are usually marketed as a single product [23] It was divided into a narrator device and a translator library. Amiga Speak Handler featured a text-to-speech translator. AmigaOS is the default native Operating system of the Amiga personal computer AmigaOS considered speech synthesis a virtual hardware device, so the user could even redirect console output to it. Some Amiga programs, such as word processors, made extensive use of the speech system.

Microsoft Windows

Modern Windows systems use SAPI4- and SAPI5-based speech systems that include a speech recognition engine (SRE). Microsoft Windows is a series of Software Operating systems and Graphical user interfaces produced by Microsoft. The Speech Application Programming Interface or SAPI is an API developed by Microsoft to allow the use of Speech recognition and Speech The Speech Application Programming Interface or SAPI is an API developed by Microsoft to allow the use of Speech recognition and Speech Speech recognition (also known as automatic speech recognition or computer speech recognition) converts spoken words to machine-readable input (for example to keypresses SAPI 4. 0 was available on Microsoft-based operating systems as a third-party add-on for systems like Windows 95 and Windows 98. Windows 95 is a consumer-oriented Graphical user interface -based Operating system. Windows 98 ( codenamed Memphis) is a graphical Operating system released on 25 June 1998 by Microsoft and the successor to Windows 95 Windows 2000 added a speech synthesis program called Narrator, directly available to users. Windows 2000 (also referred to as Win2K) is a preemptive, interruptible graphical and business-oriented Operating system designed to work with Narrator is a light-duty Screen reader utility packaged with Microsoft Windows 2000, Windows XP and Windows Vista. All Windows-compatible programs could make use of speech synthesis features, available through menus once installed on the system. Microsoft Speech Server is a complete package for voice synthesis and recognition, for commercial applications such as call centers. The Microsoft Speech Server is a product from Microsoft designed to allow the authoring and deployment of IVR applications incorporating Speech Recognition A call centre or call center (see spelling differences) is a centralized office used for the purpose of receiving and transmitting a large volume of requests by

Internet

Currently, there are a number of applications, plugins and gadgets that can read messages directly from an e-mail client and web pages from a web browser. Application software is a subclass of Computer software that employs the capabilities of a computer directly and thoroughly to a task that the user wishes to perform GADGET is a freely available code for cosmological N-body/SPH simulations written by Volker Springel at the Max Planck Institute for Astrophysics. An e-mail client, aka Mail User Agent (MUA aka e-mail reader is a frontend Computer program used to manage E-mail. A web browser is a software application which enables a user to display and interact with text images videos music games and other information typically located on a Some specialized software can narrate RSS-feeds. RSS is a family of Web feed formats used to publish frequently updated works – such as Blog entries news headlines audio and video – in a standardized On one hand, online RSS-narrators simplify information delivery by allowing users to listen to their favourite news sources and to convert them to podcasts. A podcast is a series of audio or Video digital-media files which is distributed over the Internet by syndicated Download On the other hand, on-line RSS-readers are available on almost any PC connected to the Internet. A personal computer ( PC) is any Computer whose original sales price size and capabilities make it useful for individuals and which is intended to be operated Users can download generated audio files to portable devices, e. g. with a help of podcast receiver, and listen to them while walking, jogging or commuting to work. A podcast is a series of audio or Video digital-media files which is distributed over the Internet by syndicated Download

A growing field in internet based TTS technology is web-based assistive technology, e. g. Talklets. This web based approach to a traditionally locally installed form of software application can afford many of those requiring software for accessibility reason, the ability to access web content from public machines, or those belonging to others. While responsiveness is not as immediate as that of applications installed locally, the 'access anywhere' nature of it is the key benefit to this approach.

Others

Speech synthesis markup languages

A number of markup languages have been established for the rendition of text as speech in an XML-compliant format. A markup language is an Artificial language using a set of annotations to text that give instructions regarding the structure of text or how it is to be displayed Don't change "Extensible" The most recent is Speech Synthesis Markup Language (SSML), which became a W3C recommendation in 2004. Speech Synthesis Markup Language ( SSML) is an XML -based Markup language for Speech synthesis applications A W3C Recommendation is the final stage of a Ratification process of the World Wide Web Consortium (W3C working group concerning the Standard. Older speech synthesis markup languages include Java Speech Markup Language (JSML) and SABLE. Java Speech API Markup Language ( JSML) is an XML -based Markup language for annotating text input to Speech synthesizers JSML is used with-in SABLE is an XML Markup language used to annotate texts for Speech synthesis. Although each of these was proposed as a standard, none of them has been widely adopted.

Speech synthesis markup languages are distinguished from dialogue markup languages. VoiceXML, for example, includes tags related to speech recognition, dialogue management and touchtone dialing, in addition to text-to-speech markup. VoiceXML ( VXML) is the W3C 's standard XML format for specifying interactive voice dialogues between a human and a computer

Applications

Accessibility

Speech synthesis has long been a vital assistive technology tool and its application in this area is significant and widespread. Assistive technology (AT is a generic term that includes assistive adaptive and rehabilitative devices for people with disabilities and includes the process used in selecting It allows environmental barriers to be removed for people with a wide range of disabilities. The longest application has been in the use of screenreaders for people with visual impairment, but text-to-speech systems are now commonly used by people with dyslexia and other reading difficulties as well as by pre-literate youngsters. A screen reader is a software application that attempts to identify and interpret what is being displayed on the screen (or more accurately sent to standard output Visual impairment or vision impairment is Vision loss that constitutes a significant limitation of visual capability resulting from Disease, Dyslexia is considered to be a Learning disability. It manifests primarily as a difficulty with written language particularly with Reading and Spelling They are also frequently employed to aid those with severe speech impairment usually through a dedicated voice output communication aid. Speech disorders or speech impediments, as they are also called are a type of Communication disorders where 'normal' speech is disrupted A Voice Output Communication Aid creates audible Speech or readable text for someone who cannot speak

News service

Sites such as Ananova have used speech synthesis to convert written news to audio content, which can be used for mobile applications. Ananova is a Web-oriented news service that originally featured a computer-simulated animation of a woman newscaster an Embodied agent named "Ananova"

Entertainment

Speech synthesis techniques are used as well in the entertainment productions such as games, anime and similar. In 2007, Animo Limited announced the development of a software application package based on its speech synthesis software FineSpeech, explicitly geared towards customers in the entertainment industries, able to generate narration and lines of dialogue according to user specifications. [26]

Software such as Vocaloid can generate singing voices via lyrics and melody. Vocaloid is a singing synthesizer application software developed by the Yamaha Corporation that enables users to synthesize singing by just typing in Lyrics and This is also the aim of the Singing Computer project (which uses the GPL software Lilypond and Festival) to help blind people check their lyric input. GNU LilyPond is a Computer program for Music engraving. One of LilyPond's major goals is to produce scores that are engraved with traditional layout rules reflecting Festival is a general multi-lingual Speech synthesis system originally developed at Centre for Speech Technology Research (CSTR at the University of Edinburgh. [27]

References

  1. ^ Jonathan Allen, M. Sharon Hunnicutt, Dennis Klatt, From Text to Speech: The MITalk system. Cambridge University Press: 1987. ISBN 0521306418
  2. ^ Rubin, P. , Baer, T. , & Mermelstein, P. (1981). An articulatory synthesizer for perceptual research. Journal of the Acoustical Society of America, 70, 321-328.
  3. ^ P. H. Van Santen, Richard William Sproat, Joseph P. Olive, and Julia Hirschberg, Progress in Speech Synthesis. Springer: 1997. ISBN 0387947019
  4. ^ History and Development of Speech Synthesis, Helsinki University of Technology, Retrieved on November 4, 2006
  5. ^ Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine ("Mechanism of the human speech with description of its speaking machine," J. Events 1333 - Flood of the Arno River, causing massive damage in Florence as recorded by the Florentine chronicler Giovanni Villani Year 2006 ( MMVI) was a Common year starting on Sunday of the Gregorian calendar. B. Degen, Wien).
  6. ^ Mattingly, Ignatius G. Speech synthesis for phonetic and phonological models. In Thomas A. Sebeok (Ed. ), Current Trends in Linguistics, Volume 12, Mouton, The Hague, pp. 2451-2487, 1974.
  7. ^ http://query.nytimes.com/search/query?ppds=per&v1=GERSTMAN%2C%20LOUIS&sort=newest NY Times obituary for Louis Gerstman.
  8. ^ Arthur C. Clarke online Biography
  9. ^ Bell Labs: Where "HAL" First Spoke (Bell Labs Speech Synthesis website)
  10. ^ Anthropomorphic Talking Robot Waseda-Talker Series
  11. ^ Alan W. Black, Perfect synthesis for all of the people all of the time. IEEE TTS Workshop 2002. (http://www.cs.cmu.edu/~awb/papers/IEEE2002/allthetime/allthetime.html)
  12. ^ John Kominek and Alan W. Black. (2003). CMU ARCTIC databases for speech synthesis. CMU-LTI-03-177. Language Technologies Institute, School of Computer Science, Carnegie Mellon University.
  13. ^ Julia Zhang. Language Generation and Speech Synthesis in Dialogues for Language Learning, masters thesis, http://groups.csail.mit.edu/sls/publications/2004/zhang_thesis.pdf Section 5. 6 on page 54.
  14. ^ PSOLA Synthesis
  15. ^ T. Dutoit, V. Pagel, N. Pierret, F. Bataiile, O. van der Vrecken. The MBROLA Project: Towards a set of high quality speech synthesizers of use for non commercial purposes. ICSLP Proceedings, 1996.
  16. ^ L. F. Lamel, J. L. Gauvain, B. Prouts, C. Bouhier, R. Boesch. Generation and Synthesis of Broadcast Messages, Proceedings ESCA-NATO Workshop and Applications of Speech Technology, September 1993.
  17. ^ Examples include Astro Blaster, Space Fury, and Star Trek: Strategic Operations Simulator. Astro Blaster is a Shoot 'em up Arcade game released by Sega in 1981. Space Fury ( J: スペースフューリー) a Multi-directional shooter arcade game created by Sega on June 17th 1981 Star Trek - Strategic Operations Simulator is an Arcade game based on the original Star Trek Television program, and released by
  18. ^ John Holmes and Wendy Holmes. Speech Synthesis and Recognition, 2nd Edition. CRC: 2001. ISBN 0748408568.
  19. ^ The HMM-based Speech Synthesis System, http://hts.sp.nitech.ac.jp/
  20. ^ Remez, R. E. , Rubin, P. E. , Pisoni, D. B. , & Carrell, T. D. Speech perception without traditional speech cues. Science, 1981, 212, 947-950.
  21. ^ Blizzard Challenge http://festvox.org/blizzard
  22. ^ The Sound of Smiling
  23. ^ Miner, Jay et al (1991). Jay Glenn Miner ( May 31, 1932 &ndash June 20, 1994) was a famous Integrated circuit designer known primarily for his work in Amiga Hardware Reference Manual: Third Edition. Addison-Wesley Publishing Company, Inc. Addison-Wesley is a Book publishing imprint of Pearson PLC, best known for computer books ISBN 0-201-56776-8.
  24. ^ Smithsonian Speech Synthesis History Project (SSSHP) 1986-2002
  25. ^ gnuspeech
  26. ^ Speech Synthesis Software for Anime Announced
  27. ^ Free(b)soft - Singing Computer

Specific programs

See also

External links

PlainTalk is the collective name for several Speech synthesis ( MacInTalk) and Speech recognition technologies developed by Apple Inc. Festival is a general multi-lingual Speech synthesis system originally developed at Centre for Speech Technology Research (CSTR at the University of Edinburgh. FreeTTS is an open source speech Synthesis system written entirely in the Java programming language. IVONA is a multi-lingual Speech synthesis system developed at IVO Software Kurzweil Educational Systems, Inc is an American based company that specializes in providing reading and writing software to assist people who are blind or partially Praat (also the Dutch word for "talk" is a free scientific Computer software program for the analysis of Software Automatic Mouth, or SAM was a Speech synthesis program for the early Personal computers developed and sold by Don’t Ask Software and a distant ancestor A vocoder, ˈvoʊkoʊdər (a Portmanteau of vox/voc ( voice) and encoder) is an analysis / synthesis system mostly used for speech in which the input is eSpeak is a compact open source software speech synthesizer. It comes with Debian GNU/Linux 4 Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human Vocal tract and the articulation processes occurring Chinese speech synthesis is the application of Speech synthesis to the Chinese language (usually Standard Mandarin) A language is a dynamic set of visual auditory or tactile Symbols of Communication and the elements used to manipulate them Natural language processing ( NLP) is a subfield of Artificial intelligence and Computational linguistics. The OpenDocument format (ODF is a File format for electronic office documents such as Spreadsheets Charts presentations and Screen readers are a form of Assistive technology. Contemporary screen readers Unfinished Screenreader Projects Historical interest Sinewave synthesis is a technique for synthesizing speech by replacing the Formants (main bands of energy with pure tone whistles Speech processing is the study of speech signals and the processing methods of these signals Speech recognition (also known as automatic speech recognition or computer speech recognition) converts spoken words to machine-readable input (for example to keypresses The Open Directory Project ( ODP) also known as dmoz (from directory
© 2009 citizendia.org; parts available under the terms of GNU Free Documentation License, from http://en.wikipedia.org
Dapyx Software network: MP3 Explorer | Ebook Manager | Zenithic