web-sls 00.01 2.02.00

WEB-SLS

The European Student Journal of Language and Speech

 


A New Approach to the Evaluation of Vocal Effort by the PSOLA Method

by

A.Tassa# and J.S.Liénard*

# IFA/CNR - Via Fosso del Cavaliere, 00133 Roma - Italy

* LIMSI/CNRS - BP 133, 91403 ORSAY Cedex - France


In this article, a method for changing the prosody and amplitude spectrum of a given signal is presented. In particular, the study focuses on  the problem of the characterization of vocal effort. In the present paper, it is shown how to transform a vowel uttered for instance in a soft voice into the same vowel, uttered by the same speaker, but at a normal or at a high voice level. To achieve this goal, the well known PSOLA technique has been used. The aim is partially realized, as the final sequences show actually great similarities with the targets. However, the importance of the method consists in the intermediate steps accomplished, which allow the evaluation of the role of spectral features in the description of vocal effort.   



INTRODUCTION

The present study describes a signal processing tool that matches two speech sequences, applying to one of them the prosodic profile or the spectral evolution of the other. The aim is to transform one signal into the other one step by step, in order to evaluate each factor perceptually. As the method implies auditory judgments, the quality of the signal obtained at each step must be as high as possible. This is the main reason for the choice of the PSOLA method ( Pitch Synchronous OverLap and Add). The approach is applied to the study of the Vocal Effort, which is an important component of the voice timbre. The speech material consists of a set of French isolated vowels uttered by several speakers (males and females) with varying degrees of vocal effort.

ABOUT VOCAL EFFORT

The main problem in speech analysis is that the signal structures conveying the linguistic meaning are intimately mixed with those conveying information on the speaker as well as on the communication situation. For instance, vocal effort is not just a matter of energy level. A whisper, though electronically amplified, is still perceived as a whisper, and lowering the volume of a shout is not sufficient to transform it into a soft voice. Variations of the vocal effort affect both prosody and spectral features.

Prosody is an aspect of the speech signal which deals with non-verbal information, i.e. information that cannot be coded in a written text. It is realized by variations of energy, fundamental frequency and syllable duration, as well as by the pauses. Energy level and fundamental frequency are related to the behaviour of the source of the vocal signal. They are interdependent in a complex way: the volume of air expelled from the lungs increases when speaking loud, resulting in an increase of the sound level. The increase of pressure on the glottal aperture usually tends to make the vocal cords vibrate more quickly and, consequently, to produce a pitch increase. Change in duration is a secondary effect often associated to changes in vocal effort. Actually, it has been experienced that loud sequences last quite always more than "normal" ones. This effect has been explained on a psychological basis: if the speaker needs to speak loud, it is because he (or she) wants to be understood in some critical situations ( i.e. noisy places,   great distances ...). Therefore, he (or she) makes the produced utterances last more, just as an unconscious effort  to be more intelligible [Fonagy and Fonagy, 1966].

Spectral features are strictly related to the behaviour of the vocal tract. When speaking louder, we modify the shape of the vocal tract, which results in a slight shift of some formants towards higher frequencies. According to Traunmüller "an upward displacement in tonotopic position of the group of the lower formants is perceptually interpreted as an increase in vocal effort" [Traunmüller, 1985]. The spectral structure of the signal (i.e. the envelope of the running amplitude spectrum) is also affected by a change of vocal effort. As a quasiperiodic signal, a vowel has a line spectrum: lines are spaced regularly according to the fundamental frequency. Its envelope, though, remains approximately constant (in spite of changes in fundamental frequency), showing some maxima corresponding to the formants. The identity of a vowel is related to the shape of the envelope and mainly determined by the frequency and amplitude of the first three formants. However these parameters are also altered by a change of the vocal effort. Di Benedetto and Liénard, working on a speech corpus of 12 French isolated vowels, uttered by 13 speakers (males and females) according to 3 different voice levels (soft, normal and loud), showed a general increase of the fundamental frequency at a rate of about 5Hz/dB ([Di Benedetto  and Liénard , 1994][Liénard and Di Benedetto  , 1988]). The first formant frequency generally increased by approximately 3.5 Hz/dB while upper formant positions did not exhibit significant variations. Similar results were found by Schulmann in a CVC context [Schullmann, 1985].

In order to study perceptually the role of the acoustic parameters in the coding of the vocal effort information we needed a tool for changing at will the prosodic and spectral parameters of our vowel signals. Several existing techniques allow such transformations (channel vocoder, LPC vocoder, homomorphic vocoder, phase vocoder, among others). Many studies have been published on the topic of "voice conversion", aiming at the transformation of a male voice into a female voice (see for instance [Childers et al, 1989]) for synthesis purposes. The notion of "voice morphing" (analogous to the video morphing) has been presented by [Slaney et al, 1996 ]). In most cases, the signal is decomposed into two parts: excitation (source wave) and resonance (spectral transfer function) and each part is processed separately by using a time alignement with Dynamic Time Warping. The algorithms are complex and do not guarantee a perfect quality of the transformed voice. Instead, as our goal was to study the effects of variation of the vocal effort, we choose the PSOLA technique ([Moulines et Charpentier, 1990], [Valbret, Moulines and Tubach, 1992]), which was simple to implement and could provide an excellent quality, necessary to perform some psychoacoustic tests. We will now describe the principle of the PSOLA method and its application to the study of vocal effort.

 back to main


PSOLA method
Definition
PSOLA is a method used in voice synthesis to create speech material while retaining a good level of naturalness. The acronym stands for Pitch Synchronous OverLap and Add and it refers to the fact that the speech material is created by concatenating ("overlapping and adding") elementary elements. The duration of those segments is proportional to the pitch periods. The method can be used for changing pitch and duration of an utterance. This transformation can be accomplished  simply by extracting such periods and by recollecting them in a way different from the original. The procedure consists of two steps: an analysis one and a synthesis one.

Analysis
In the analysis phase, short-time (ST) signals are extracted by means of a weighting window  from the original vocal signal. Windows are centered at some mark points which constitute the analysis time axis. The duration of the window (Wm) is proportional to the analysis local pitch period dm(t).
In formulas:                                           

xm(t) = x(t) hm(t-tm)                   m = 0,...,M    

Wm = m dm = m (tm - tm-1)

Where x(t) is the original vocal signal, hm(t) is the weighting window, xm(t) is the analysis ST-signal, tm is the sequence of  pitch mark points on the analysis time axis (see fig.1) and M is the total number of pitch periods of the vocal signal. m is the proportionality factor, and typically m=2 is used for a wide-band analysis, corresponding to an overlap factor of 50%, among adjacent pitch periods.
 
Synthesis
In the synthesis phase, synthetic ST-signals xq(t) are obtained from the analysis ST-signals by means of a transformation Y. 

  Eq. 1					        xq(n)=Y(xm(n))            	  with q = 0, ..., Q   and  m = 0, ...,M   

Where M and Q are the total numbers of pitch periods for the source and the target vocal signals, respectively. If no spectral modification is required, then Y(..) is the identity function.
Finally, they are concatenated by another synthesis window hq(t), whose width is twice the  synthesis local pitch period dq. Synthetic ST-signals are centered on the mark points of the synthesis time axis (that is on tq , with q = 0, ..., Q). Overlap generally occurs, and overlapping samples are added. The resulting synthetic numeric signal is given by:

xsynth(n) = Sqxq(n)hq(tq-n)  / Sqhq2(tq-n)   

In this work, however, no synthesis window has been used, so that :                             

xsynth(n) = Sqxq(n-qq)  

Where qq (q = 1,...,Q) is a sequence of  local delays for the synthetic sequence. In TABLE I, relations among pitch factor modification, overlapping and duration of the signal (with respect to the original one) are shown, in case the total number of ST-signals is not changed ( and so, pitch and duration are modified by the same factor).

   Pitch factor Overlapping Duration
      m = m No overlapping The signal is longer than the original
  -1 < m < m  < 50% The signal is longer than the original
       m = 1     50% The signal is as long as the original
       m > 1  > 50% The signal is shorter than the original
TABLE I

Mapping
A delicate part of the algorithm is the mapping between analysis and synthesis time axis. If duration and pitch are to be scaled both by the same factor a, then a simple "stretching" of the signal is to be made: the synthetic pitch period becomes  dq(t)= a dn(t). Thus, the total duration is a times the original one and, as  the number of ST-signals has not changed, pitch is automatically scaled by the same factor a. If duration and pitch must be altered by different factors, a different number of elementary segments is required. PSOLA method, in this case, makes use of a very simple elimination-repetition technique. If only pitch is to be modified, then the synthesis time axis will have the same duration, but it will be necessary to scale the local pitch period, thus varying the total number of ST-signals. If only duration modification is required, ST-signals must be added (or suppressed) without altering the distance among adjacent pitch periods. If different scale factors are necessary, a previous scaling of the duration is to be made, and a successive pitch scaling is accomplished.
In figure 1 a mapping is shown, in which both pitch and duration are modified (by different factors), and in which the total number of ST-signals passes from M to Q.

Using a pitch synchronous method requires every vocal signal file to be accompanied by a  pitch marks file, giving the information about where and how extracting the ST-signals (about the pitch marks detection method used here, see Appendix B).
Pitch mark points must be placed at the maxima of the instantaneous energy of the speech signal, so as to maximally preserve the part of the ST signal which is less influenced by the neighbouring periods. By doing so, the distorsions due to the overlap operation are minimized.

back to main


SYNTHETIC SEQUENCES

In this work we consider that each sequence has a specific prosody and short-time spectrum envelope. By PSOLA, the two components can be separated. Therefore, it is possible to create synthetic sequences having the prosody of one  sequence and the spectral features of a different one. Perceptive tests should  evaluate the resulting perceived vocal effort and should point out the importance of prosody or spectrum. Each vowel (source) has been associated with the same vowel, uttered by the same speaker at a different voice level (target). The transformation between source and target is made in 3 different ways: by prosodic modifications, by spectral modifications, or by both. Thus, for each couple, it is possible to listen to 3 synthetic sequences.

Prosodic modifications
In this case the source sequence has to change its "rythm"; the synthetic sequence must have the spectrum of the source and the prosody of the target, which means that the synthetic sequence must have the local waveform of the source with the pitch marks of the target sequence. This aim can be achieved by PSOLA.
In the analysis, every single ST-signal is extracted from the original sequence by means of a Hanning window (with m = 2).
A new mapping is required between the analysis time axis and the synthesis one, i.e. between the source and target pitch marks. The mapping is executed by a specific program which calculates, for every pitch mark of the source sequence and of the target, its position with respect to the duration of the entire sequence, and establishes the correspondences between each pitch mark point of the source and of the target . Another algorithm implements the equalization of the overall energy: global energy is calculated for each ST-signal, and a normalizing factor (rq) is calculated.
In the synthesis phase, the ST-signal xm(n) extracted from the source sequence is centered at the corresponding mark point (tq) of the target sequence, after being normalized by the rq factor, for energy equalization.

rq = Sn|Xqtarget(n)|2/ Si |Xmsource(i)| 2                         for i=1,...,M and n = 1,...,Q

 

Where Sn|Xqtarget(n)| is the energy of the qth synthetic ST-signal and Si |Xmsource(i)| 2   is the energy of the mth analysis ST-signal which corresponds to the previous target one, according to the mapping.
It must be noticed that this  factor represents the Y(...) transformation of eq. 1, that is:

     xq(n)=Y(xm(n)) = rq xm(n)  

 

Vowel AA. Speaker MB. Transition N-P. 

In the figure it is possible to evaluate the effect of the prosodic modifications: the amplitude spectra of the SOURCE (N level), the TARGET  (P level), the prosodically modified signal (MBchanged) and the completely modified signal (MBmodified) are plotted. The MBchanged curve follows the SOURCE one, as it should (as no spectral modification has been made), except for a vertical shift, which is caused by the energy equalization. The MBmodified curve is very difficult to distinguish, as it fits almost completely the TARGET one, as expected.   
(For the mismatch in the first part, see Appendix A)   

Fig.2

 
By the above transformation, the original waveform is preserved, but prosody has changed.

Spectral modifications
These modifications are achieved in the same way as prosodic modifications. In this case a one-to-one mapping is used, in order to preserve prosody. A specific program computes the average overall spectrum (see Appendix A) for source and target, and the ratio between these spectra represents the transfer function of the filter which transforms the source signal into the target one.

 Z(w) = |Xtarget(w)| / |Xchanged(w)|

It must be observed that this ratio is calculated after normalization, in order to discard the intrinsic gain factor caused by energy differences between source and target, which are of no interest in this case. In fact Xchanged(w) is the spectrum of the source sequence after normalization, i.e. |Xchanged(w)|=|Xsource(w)| r, where  r is the average normalization factor.
Therefore, each ST-signal is filtered by this specific transfer function, and replaced in its place (as the mapping is one-to-one).
In formulas, this filtering represents, once again, the Y(...) function of eq.1, i.e.:  

xq(n)=Y(xm(n)) = z(n) * xm(n)

Where the symbol * stands for the convolution operation and z(n) is the inverse Fourier transform of Z(w).
It must be noticed that the synthetic ST-signal is computed by multiplying the amplitude spectra and therefore not involving the phase.   

Vowel AA. Speaker MB. Transition N-P.  

In the figure it is possible to evaluate the effect of the   spectral modifications: the envelope amplitude spectra of the SOURCE (--), the TARGET (--)  and the spectrally modified signal are plotted. It is possible to notice that the profile of the synthetic sequence follows the one of the target, except for a vertical shift due to the lack of energy normalization.

Fig. 3

 
In this way the overall amplitude spectrum has changed ( it matches the one of the target, except for an amplification factor) while prosody has remained unchanged.

Changing both prosody and spectrum
Both modifications (prosody and spectrum) can be obtained in two steps: first, a prosodic modification is accomplished, and then a spectral modification is added. Synthetic sequences obtained in this way do sound close enough to the target, even if in many cases significant differences are still perceived, suggesting that some factors still remain  to be considered. The most important might be the signal-noise ratio, since discrepancies appear when manipulating soft voices which present breathness in their final part.

Final results
In TABLE II it is possible to listen to all the different synthetic sequences obtained for the vowel AA of the male speaker MB, for the transition from normal level to soft level. In TABLE III it is possible to listen to other results obtained for other speakers.
 

SOURCE - Speaker MB. Vowel AA. Normal level
Spectral modifications
Prosodic modifications
Both modifications
TARGET - Speaker MB. Vowel AA. Soft level
TABLE II

 

The first results show that prosodic modifications are the main factors influencing vocal effort. Spectral modifications are related principally to the "colour" and the "shades" of an utterance and they seem not to be very significant with respect to the prosodic ones. Those sequences which have been modified only spectrally generally do not allow the perception of significant differences in vocal effort, though, in some cases, they may introduce important variations (for instance, the filtered sequence for MLAA.N-P) .
 
 

Speaker Vowel SOURCE Spectral modifications Prosodic modifications Completely modified TARGET
AB AA Level N N-L N-L N-L Level L
ML II Level L L-P L-P L-P Level P
ML II Level P P-L P-L P-L Level L
ML AA Level N N-P N-P N-P Level P
TABLE III 

 
 

Discussion
Although this work focuses expecially on vocal effort, the method presented is very general, and it can be profitably employed to analyze other features.
Actually, the method allows a correspondance, and a particular transformation between two general vocal sequences which are not too different one from another. Therefore, it could be possible to choose other pairs, corresponding to different vowels, uttered by different speakers at different voice levels. However, in such a case, the several differences between the two sequences could lead to no significant result in evaluating the parameters playing a role in this transformation. These parameters, in fact, are too numerous and the interactions among the different factors involved in the determination of vowel, speaker and vocal effort are too complicated.
This is the reason why in this work, only step by step transformations are accomplished.
Accordingly, only three kinds of pairs have been related:

In the first part of this article, only the first category has been examined, and it accounts for the evaluation of vocal effort.
However, also the two other associations did lead to interesting results.

Changing vowel
Relating sequences corresponding to different vowels, leads to an evaluation of the factors involved in the identity of vowels.
Literature relates that "the formant frequency and the fundamental frequency are crucial factors in identifying a vowel" [Di Benedetto  and Liénard , 1992], and that "it is demonstrated that the tonotopic distance between the first formant position and the fundamental frequency carries most of the information on vowel openness" [Traunmüller, 1985].
Actually, in this test, only spectral modifications have been perceived (in informal perceptive tests) as significant. Anyway it should be stressed that prosodic modifications did not involve great changes in F0 (less than 40 Hz).
 

Speaker Source vowel Prosodic modification Filtered vowel Target
MB AA AA-II AA-II II
MB AA AA-OU AA-OU OU
AM AA AA-II AA-II II
AM AA AA-OU AA-OU OU
TABLE IV

 

Vowels obtained by prosodic modifications are not of natural sounding quality. Note that the identity of the vowel does not change much from token to token. Filtering introduces the most significant effect. Synthetic II and OU sound almost natural in the case of speaker MB, worse in the case of speaker AM, as the latter involved much higher pitch values and much higher differences among the pitch of the original vowels (thus involving too much distortion in the difference F1-F0).

Changing speaker
Relating sequences corresponding to the same vowel, uttered at the same voice level by different speakers should lead to considerations about  the importance of prosody and spectral features in the determination of speakers identity. Yet, this topic is far from the scope of the present work. In this case the results were not satisfying.
Relating sequences with not too different pitched voices results in natural sounding utterances, only if the change in intonation and prosody remains within variations typical of everyday life intra-speaker variability. When pitch differences are remarkable (typically when mixing male and female voices), the sound quality is not natural at all, depending on the intrinsic limits in the range of application of PSOLA.

PSOLA method, in fact, simulates the natural concatenation between successive elementary elements by overlapping and adding them, but the quality of the synthetic sequences is not natural if the overlapping is too heavy (about 80% overlapping). Enhancing too much the pitch, thus, may cause distorsions and alter the perception of the speaker's gender. To the same extent, lowering the pitch below a given range leads to collateral effects (such as creak).


This statement is illustrated by manipulating the vowel AA, uttered by the female speaker AB (Sequence ABAAN. Medium pitch: 210 Hz), and the same vowel uttered by the male speaker JP (Sequence JPAAN. Medium pitch: 85 Hz).
Giving JP the prosody of AB (sequence mix.JP-AB in tables V and VI) results in a distorted sound, in which the perception of the speaker gender is no longer clear.
 

ABAAN mix.JP-AB mix.AB-JP JPAAN
TABLE V 

 

The opposite experience (giving AB the prosody of JP, sequence mix.AB-JP in tables V and VI) produces holes all along the sequence (see figure in table VI). There is no alteration of the speaker's gender (as it could have been expected by the first experiment) but the sound presents the characteristics of a female "creaky voice" ( see [Klatt and Klatt, 1990] ).
 

 

                  MIX.JP-AB                            MIX.AB-JP
TABLE VI 

 
  back to main


CONCLUSION

In the present paper a method for matching vowel signals and transforming one into the other is described and demonstrated. The transformation can be decomposed into a prosodic transformation, which may affect the total duration, local duration, or pitch evolution of the source sequence, and a spectral transformation, which may affect the frequency scale and the amplitude of the spectrum. Using this tool we could produce in some selected cases a quite perfect voice or vocal effort transformation. As the transformation was made in several steps we could observe that prosodic modifications were perceptively more important than spectral modifications, relatively to the voice quality and vocal effort. In some cases we could not obtain a good quality of the resulting sequence. This may be attributed to some limitations of the PSOLA method, which is sensitive to the accuracy of the pitch marks, and which cannot compensate for too large gaps between the sequences to process. Thus the method looks adequate for matching and transforming sequences which are close enough. It should be further investigated in order to process sequences produced by speakers of different vocal genders (male, female, child), or representing vowels distant from each other.


Appendix A: The short term spectrum

Referring to Moulines and Charpentier, "the spectral envelope of the synthetic signal is identical to the short term spectrum X(w) of the original signal". In their work, actually, they dealt with synthetic signals obtained by the repetition of a unique  prototype ST-signal, by means of PSOLA method. In this case, ST-signals are no longer equal, but a visual inspection of the isolated periods (extracted by a Hanning window) revealed that the amplitude spectra are significantly stable in the central part of the sequence (i.e. approximately between the 20% and 60% over the entire length of the sequence). Therefore it has been possible to extract a medium amplitude spectrum, averaged among the central periodes, and to consider it as the envelope amplitude spectrum of the overall signal.

Each spectrum, thus, is calculated by the following steps:

  1. Determining the central periods which represent the "stable" part of the sequence
  2. Extracting the fundamental ST-signals corresponding to these periods
  3. Forcing the window to FFT_SIZE points
  4. Calculating the amplitude spectrum for each ST-signal
  5. Averaging among all the spectra

The short term amplitude spectrum  computed in this way is considered to represent the amplitude spectrum of the entire sequence.
Point no 3 must be better detailed: each period is extracted by means of a Hanning window, then a FFT_SIZE points window is superimposed . If the ST-signal is shorter than the window, a zero padding is accomplished. On the contrary, if it is longer, it is simply cut at the extremes of the window in a symmetrical way. FFT_SIZE equals 256 for female voices (corresponding to an average length of 16 msec, thus "covering" voices with pitch higher than 125 Hz) and 512 for male voices.

This proceeding leads to an extreme simplification, not involving the fundamental frequency. Anyhow, a peak still remains in the zone 0-500Hz (which can be noticed in the figures 2 and 3), due to the periodicity of the structure. This fact has been reported by Moulines and Charpentier: "the spectral envelope signal critically depends on the spectral resolution of the analysis window".[Moulines and Charpentier, 1990]

back to spectral modifications


Appendix B: the pitch mark file
To apply a pitch synchronous method, an information about the evolution of the pitch is needed. In the present work, a pitch marks detection method  has been used, written in C language by C.D'Alessandro (LIMSI-CNRS, http://www.limsi.fr) . It consists of a cascade of different filters for the extraction of a final file containing the pitch marks. In particular, the detection of the fundamental frequency is computed according to an algorithm ([Hermes, 1987]) based on the correlation with a comb function (a method proposed by Philip Martin in 1982 ([Martin, 1982])).
Some modifications have been made, as original programs had been written to fit phrases uttered by male speakers and recorded at a lower level.
The pitch mark files coming out from the processing chain are the analysis and synthesis time axes described in the paragraph about mapping.
 
back to pitch marks file


Acknowledgements

This study has been accomplished entirely at LIMSI-CNRS (Paris, France) during a five months stage.


REFERENCES

Calliope (1989), "La parole et son traitment automatique", Masson Paris, 1989.

F.Charpentier and E.Moulines (1988), "Text-to-speech algorithm based on FFT-synthesis ", CH2561-9/88/0000, IEEE, S14.4, 1988, pp.667-670.

D.G.Childers, K.Wu, D.M.Hicks and B.Yegnanarayana (1989), "Voice Conversion", Speech Communication, 8,  pp. 147-158.

M.G. Di Benedetto and J.S.Liénard (1992), "Extrinsic normalization of vowel formant based on cardinal vowels mapping", ICSLP, BANFF, October 1992.

M.G.Di Benedetto  and J.S.Liénard (1994), "Influence of the vocal effort on vowels", 127th meeting of the Acoustical Society of America,Cambridge,Massachussets, June 1994.

I.Fonagy and J.Fonagy(1966), "Sound pressure level and duration",  Phonetica 15, pp.14-21.

S.Handel (1989), "Listening", Massachussets Institute of technology 1989.

Dik.J.Hermes (1987), "Measurement of pitch by subharmonic summation", J.Acoustical Society of America, Vol.83, N°1, January 1988, pp.257-263.

 D.H.Klatt and L.Klatt (1990), "Analysis, synthesis and perception of voice quality variations among female and male talkers", Journal Acoustical  Society of America, Vol.87 , No 2, February 1990, pp. 820-856.

J.S. Liénard (1977), "Les processus de la communication parlée", Masson Paris, 1977.

J.S.Liénard and M.G.Di Benedetto (1992), "Evaluation  perceptive d'un corpus de voyelles françaises émises isolément par plusieurs locuteurs selon différent forces de voix", 19èmes Journées d' Etude sur la Parole - Bruxelles, May 1992, pp. 469-474.

J.S.Liénard (1997), "Perception et variabilité de la parole", Fondements et perspectives en traitement automatique de la parole, ed. Meloni, AUPELF, 1996.

J.S.Liénard and M.G.Di Benedetto (1999), "Effect of vocal effort on spectral properties of vowels", J.Acoustical Society of America., Vol. 106, No 1, July 1999. 

Ph.Martin(1981), "Extraction  de la fréquence fondamentale par interacorrelation avec una fonction peigne", 12èmes Journées d'études sur la Parole - Montreal, May 1981, pp.221-232.

Ph.Martin (1982), "Comparison of pitch detection by cepstrum and spectral comb analysis", Proc. IEEE Int. Conf. Acoustical Speech Signal Process. ICASSP-82, pp.180-183

E.Moulines et F.Charpentier (1990), "Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones", Elsevier Science Publishers B.V. (North Holland), Speech Communication, Vol.9, Nos. 5/6, Dicembre 1990, pp.453-467.

W.H.Press ,B.P.Flanney (1990), "Numerical Recipes in C  (The art of Scientific Computing)", Cambridge University Press, chapter 12, pp.398-412.

L.R.Rabiner  / R.W.Schaefer (1978), "Digital processing of speech signals", Prentice-Hall, Signal processing series, Inc., Alan Oppenheim editor 1987.

R.Schulmann (1985), "Articulatory targeting and perceptual constancy of loud speech",  French-Swedish Seminary, ICP, Grenoble, 1985.

R.Schulmann (1988), "Articulatory dynamics of loud and normal speech",  J.Acoustical Society of America, Vol.85, No 1 , January 1988, pp.295-312.

M.Slaney, M.Covell and B.Lassiter (1996), "Automatic audio morphing", IEEE-ICASSP, pp. 1001-1004.

H. Traunmüller (1981), "Perceptual dimension of openness in vocal effort",  J.Acoustical Society of America, Vol.59, No 5 , May 1981, pp.1465-1474.

H.Traunmüller (1985), "The role of the fundamental and the higher formants in the perception of  speaker size, vocal effort and vowel openness", French-Swedish seminary, ICP, Grénoble, 1985.

H.Valbret, E.Moulines et J.P.Tubach (1992), "Voice transformation using PSOLA technique", Speech Communication, 11, n° 2/3, June 1992, pp. 175-187.