Speech Synthesizers


1. Concatenative Synthesis
        This type is based on the stringing together of segments of recorded speech.  It gives the most natural sounding speech.  There are 3 types of concatenative synthesis:

a) Unit Selection Synthesis
       This type uses a large speech database that is recorded "utterances".  Utterances are expressions or statements.  It gives the greatest naturalness and the output from the best systems is usually indistinguishable from real human voices.
b) Diphone Synthesis
      This type uses a minimal speech database containing diphones (sound-to-sound transitions) in a language.  The quality of speech is generally not as good, but it is more natural than other technologies.  The use of this type of synthesizer in commercial applications is declining.
c) Domain-specific Synthesis
      This type concatenates pre-recorded words and phrases to create complete utterances.  It is limited to such things as transit schedule announcements and, therefore, is not used to help those people who are disabled.  The reason for this is that it can only synthesize the combinations of words and phrases that it has been programmed with.

2. Formant Synthesis
           This type of synthesis does not use any human speech samples, but instead, output speech is created using an acoustic model.  Parameters (such as voicing and noise levels) are varied to create a waveform of artificial speech.  This causes many systems to create robotic-sounding speech.  However, maximum naturalness is not always the goal of speech synthesis.  Sometimes, intelligibility is more important because visually impaired people need a system that can work at high speeds. 

