WEB-SLSThe European Student Journal of Language and Speech |
![]() |
A New Approach to the Evaluation of Vocal Effort by the PSOLA Method
by
A.Tassa# and J.S.Liénard*
# IFA/CNR - Via Fosso del Cavaliere, 00133 Roma - Italy
* LIMSI/CNRS - BP 133, 91403 ORSAY Cedex - France
In this article, a method for changing the prosody and amplitude spectrum of a given signal is presented. In particular, the study focuses on the problem of the characterization of vocal effort. In the present paper, it is shown how to transform a vowel uttered for instance in a soft voice into the same vowel, uttered by the same speaker, but at a normal or at a high voice level. To achieve this goal, the well known PSOLA technique has been used. The aim is partially realized, as the final sequences show actually great similarities with the targets. However, the importance of the method consists in the intermediate steps accomplished, which allow the evaluation of the role of spectral features in the description of vocal effort.
The present study describes a signal processing tool that matches two speech sequences, applying to one of them the prosodic profile or the spectral evolution of the other. The aim is to transform one signal into the other one step by step, in order to evaluate each factor perceptually. As the method implies auditory judgments, the quality of the signal obtained at each step must be as high as possible. This is the main reason for the choice of the PSOLA method ( Pitch Synchronous OverLap and Add). The approach is applied to the study of the Vocal Effort, which is an important component of the voice timbre. The speech material consists of a set of French isolated vowels uttered by several speakers (males and females) with varying degrees of vocal effort.
ABOUT VOCAL EFFORT
The main problem in speech analysis is that the signal structures conveying the linguistic meaning are intimately mixed with those conveying information on the speaker as well as on the communication situation. For instance, vocal effort is not just a matter of energy level. A whisper, though electronically amplified, is still perceived as a whisper, and lowering the volume of a shout is not sufficient to transform it into a soft voice. Variations of the vocal effort affect both prosody and spectral features.
Prosody is an aspect of the speech signal which deals with non-verbal information, i.e. information that cannot be coded in a written text. It is realized by variations of energy, fundamental frequency and syllable duration, as well as by the pauses. Energy level and fundamental frequency are related to the behaviour of the source of the vocal signal. They are interdependent in a complex way: the volume of air expelled from the lungs increases when speaking loud, resulting in an increase of the sound level. The increase of pressure on the glottal aperture usually tends to make the vocal cords vibrate more quickly and, consequently, to produce a pitch increase. Change in duration is a secondary effect often associated to changes in vocal effort. Actually, it has been experienced that loud sequences last quite always more than "normal" ones. This effect has been explained on a psychological basis: if the speaker needs to speak loud, it is because he (or she) wants to be understood in some critical situations ( i.e. noisy places, great distances ...). Therefore, he (or she) makes the produced utterances last more, just as an unconscious effort to be more intelligible [Fonagy and Fonagy, 1966].
Spectral features are strictly related to the behaviour of the vocal tract. When speaking louder, we modify the shape of the vocal tract, which results in a slight shift of some formants towards higher frequencies. According to Traunmüller "an upward displacement in tonotopic position of the group of the lower formants is perceptually interpreted as an increase in vocal effort" [Traunmüller, 1985]. The spectral structure of the signal (i.e. the envelope of the running amplitude spectrum) is also affected by a change of vocal effort. As a quasiperiodic signal, a vowel has a line spectrum: lines are spaced regularly according to the fundamental frequency. Its envelope, though, remains approximately constant (in spite of changes in fundamental frequency), showing some maxima corresponding to the formants. The identity of a vowel is related to the shape of the envelope and mainly determined by the frequency and amplitude of the first three formants. However these parameters are also altered by a change of the vocal effort. Di Benedetto and Liénard, working on a speech corpus of 12 French isolated vowels, uttered by 13 speakers (males and females) according to 3 different voice levels (soft, normal and loud), showed a general increase of the fundamental frequency at a rate of about 5Hz/dB ([Di Benedetto and Liénard , 1994][Liénard and Di Benedetto , 1988]). The first formant frequency generally increased by approximately 3.5 Hz/dB while upper formant positions did not exhibit significant variations. Similar results were found by Schulmann in a CVC context [Schullmann, 1985].
In order to study perceptually the role of the acoustic parameters in the coding of the vocal effort information we needed a tool for changing at will the prosodic and spectral parameters of our vowel signals. Several existing techniques allow such transformations (channel vocoder, LPC vocoder, homomorphic vocoder, phase vocoder, among others). Many studies have been published on the topic of "voice conversion", aiming at the transformation of a male voice into a female voice (see for instance [Childers et al, 1989]) for synthesis purposes. The notion of "voice morphing" (analogous to the video morphing) has been presented by [Slaney et al, 1996 ]). In most cases, the signal is decomposed into two parts: excitation (source wave) and resonance (spectral transfer function) and each part is processed separately by using a time alignement with Dynamic Time Warping. The algorithms are complex and do not guarantee a perfect quality of the transformed voice. Instead, as our goal was to study the effects of variation of the vocal effort, we choose the PSOLA technique ([Moulines et Charpentier, 1990], [Valbret, Moulines and Tubach, 1992]), which was simple to implement and could provide an excellent quality, necessary to perform some psychoacoustic tests. We will now describe the principle of the PSOLA method and its application to the study of vocal effort.
PSOLA
method
Definition
PSOLA is a method used in voice synthesis to create speech material
while retaining a good level of naturalness. The acronym stands for Pitch Synchronous
OverLap and Add and it refers to the fact that the speech material is created by
concatenating ("overlapping and adding") elementary elements. The duration of
those segments is proportional to the pitch periods. The method can be used for changing
pitch and duration of an utterance. This
transformation can be accomplished simply by extracting such periods and by
recollecting them in a way different from the original. The procedure consists of two
steps: an analysis one and a synthesis one.
Analysis
In the analysis phase, short-time (ST) signals are extracted by means
of a weighting window from the original vocal signal. Windows are centered at some
mark points which constitute the analysis time
axis. The duration of the window (Wm) is proportional to the analysis local
pitch period dm(t).
In formulas:
xm(t) = x(t) hm(t-tm) m = 0,...,M
Wm = m dm = m (tm - tm-1)
Where x(t) is the original vocal signal, hm(t)
is the weighting window, xm(t) is the analysis ST-signal, tm is the
sequence of pitch mark points on the analysis time axis (see fig.1) and M is the
total number of pitch periods of the vocal signal. m is the proportionality factor, and
typically m=2 is used for a wide-band analysis, corresponding to an overlap factor of 50%,
among adjacent pitch periods.
Synthesis
In the synthesis phase, synthetic ST-signals xq(t) are
obtained from the analysis ST-signals by means of a transformation Y.
Where M and Q are the total numbers of
pitch periods for the source and the target vocal signals, respectively. If no spectral
modification is required, then Y(..) is the identity function.
Finally, they are concatenated by another synthesis window hq(t),
whose width is twice the synthesis local pitch period dq. Synthetic
ST-signals are centered on the mark points of the synthesis time axis (that is on tq
, with q = 0, ..., Q). Overlap generally occurs, and overlapping samples are added.
The resulting synthetic numeric signal is given by:
xsynth(n) = Sqxq(n)hq(tq-n) / Sqhq2(tq-n)
In this work, however, no synthesis window has been used, so that :
xsynth(n) = Sqxq(n-qq)
Where qq (q = 1,...,Q) is a sequence of local delays for the synthetic sequence. In TABLE I, relations among pitch factor modification, overlapping and duration of the signal (with respect to the original one) are shown, in case the total number of ST-signals is not changed ( and so, pitch and duration are modified by the same factor).
| Pitch factor | Overlapping | Duration |
| m = m | No overlapping | The signal is longer than the original |
| -1 < m < m | < 50% | The signal is longer than the original |
| m = 1 | 50% | The signal is as long as the original |
| m > 1 | > 50% | The signal is shorter than the original |
Mapping
A delicate part of the algorithm is the mapping between analysis and
synthesis time axis. If duration and pitch are to be scaled both by the same factor a,
then a simple "stretching" of the signal is to be made: the synthetic pitch
period becomes dq(t)= a dn(t). Thus, the total duration is a
times the original one and, as the number of ST-signals has not changed, pitch is
automatically scaled by the same factor a. If duration and pitch must be altered by
different factors, a different number of elementary segments is required. PSOLA method, in
this case, makes use of a very simple elimination-repetition technique. If only pitch is
to be modified, then the synthesis time axis will have the same duration, but it will be
necessary to scale the local pitch period, thus varying the total number of ST-signals. If
only duration modification is required, ST-signals must be added (or suppressed) without
altering the distance among adjacent pitch periods. If different scale factors are
necessary, a previous scaling of the duration is to be made, and a successive pitch
scaling is accomplished.
In figure 1 a mapping is shown, in which both pitch and duration are
modified (by different factors), and in which the total number of ST-signals passes from M
to Q. 
Using a pitch synchronous method requires every vocal
signal file to be accompanied by a pitch marks file, giving the information about
where and how extracting the ST-signals (about the pitch
marks detection method used here, see Appendix
B).
Pitch mark points must be placed at the maxima of the instantaneous
energy of the speech signal, so as to maximally preserve the part of the ST signal which
is less influenced by the neighbouring periods. By doing so, the distorsions due to the
overlap operation are minimized.
In this work we consider that each sequence has a specific prosody and short-time spectrum envelope. By PSOLA, the two components can be separated. Therefore, it is possible to create synthetic sequences having the prosody of one sequence and the spectral features of a different one. Perceptive tests should evaluate the resulting perceived vocal effort and should point out the importance of prosody or spectrum. Each vowel (source) has been associated with the same vowel, uttered by the same speaker at a different voice level (target). The transformation between source and target is made in 3 different ways: by prosodic modifications, by spectral modifications, or by both. Thus, for each couple, it is possible to listen to 3 synthetic sequences.
Prosodic modifications
In this case the source sequence has to change its "rythm";
the synthetic sequence must have the spectrum of the source and the prosody of the target,
which means that the synthetic sequence must have the local waveform of the source with
the pitch marks of the target sequence. This aim can be
achieved by PSOLA.
In the analysis, every single ST-signal is extracted from the
original sequence by means of a Hanning window (with m = 2).
A new mapping is required between the analysis time axis and the
synthesis one, i.e. between the source and target pitch marks. The mapping is executed by
a specific program which calculates, for every pitch mark of the source sequence and of
the target, its position with respect to the duration of the entire sequence, and
establishes the correspondences between each pitch mark point of the source and of the
target . Another algorithm implements the equalization of
the overall energy: global energy is calculated for each ST-signal, and a normalizing
factor (rq) is calculated.
In the synthesis phase, the ST-signal xm(n) extracted from
the source sequence is centered at the corresponding mark point (tq) of the
target sequence, after being normalized by the rq factor, for energy
equalization.
rq = Sn|Xqtarget(n)|2/ Si |Xmsource(i)| 2 for i=1,...,M and n = 1,...,Q
Where Sn|Xqtarget(n)|2
is the energy of the qth synthetic ST-signal and Si |Xmsource(i)| 2 is the energy of the mth
analysis ST-signal which corresponds to the previous target one, according to the mapping.
It must be noticed that this factor represents the Y(...)
transformation of eq. 1, that is:
xq(n)=Y(xm(n)) = rq xm(n)
![]() |
|
By the above transformation, the original waveform is preserved, but
prosody has changed.
Spectral modifications
These modifications are achieved in the same way as prosodic
modifications. In this case a one-to-one mapping is used, in order to preserve prosody. A
specific program computes the average overall spectrum (see Appendix A) for
source and target, and the ratio between these spectra represents the transfer function of
the filter which transforms the source signal into the target one.
Z(w) = |Xtarget(w)| / |Xchanged(w)|
It must be observed that this ratio is
calculated after normalization, in order to discard the intrinsic gain factor caused by
energy differences between source and target, which are of no interest in this case. In
fact Xchanged(w) is the spectrum of
the source sequence after normalization, i.e. |Xchanged(w)|=|Xsource(w)| r,
where r is the average normalization factor.
Therefore, each ST-signal is filtered by this specific transfer
function, and replaced in its place (as the mapping is one-to-one).
In formulas, this filtering represents, once again, the Y(...)
function of eq.1, i.e.:
xq(n)=Y(xm(n)) = z(n) * xm(n)
Where the symbol * stands for the
convolution operation and z(n) is the inverse Fourier transform of Z(w).
It must be noticed that the synthetic ST-signal is computed by
multiplying the amplitude spectra and therefore not involving the phase.
![]() |
|
In this way the overall amplitude spectrum has changed ( it matches the one
of the target, except for an amplification factor) while prosody has remained unchanged.
Changing both prosody and spectrum
Both modifications (prosody and spectrum) can be
obtained in two steps: first, a prosodic modification is accomplished, and then a spectral
modification is added. Synthetic sequences obtained in this way do
sound close enough to the target, even if in many cases significant differences are still
perceived, suggesting that some factors still remain to be considered. The most
important might be the signal-noise ratio, since discrepancies appear when manipulating
soft voices which present breathness in their final part.
Final results
In TABLE II it is possible to listen to all the different synthetic
sequences obtained for the vowel AA of the male speaker MB, for the transition from normal
level to soft level. In TABLE III it is possible to listen to other results obtained for
other speakers.
| SOURCE - Speaker MB. Vowel AA. Normal level | |
| Spectral modifications | |
| Prosodic modifications | |
| Both modifications | |
| TARGET - Speaker MB. Vowel AA. Soft level |
The first results show that prosodic
modifications are the main factors influencing vocal effort. Spectral modifications are
related principally to the "colour" and the "shades" of an utterance
and they seem not to be very significant with respect to the prosodic ones. Those
sequences which have been modified only spectrally generally do not allow the perception
of significant differences in vocal effort, though, in some cases, they may introduce
important variations (for instance, the filtered sequence for MLAA.N-P) .
| Speaker | Vowel | SOURCE | Spectral modifications | Prosodic modifications | Completely modified | TARGET |
| AB | AA | Level N | N-L | N-L | N-L | Level L |
| ML | II | Level L | L-P | L-P | L-P | Level P |
| ML | II | Level P | P-L | P-L | P-L | Level L |
| ML | AA | Level N | N-P | N-P | N-P | Level P |
Discussion
Although this work focuses expecially on vocal effort, the method
presented is very general, and it can be profitably employed to analyze other features.
Actually, the method allows a correspondance, and a particular
transformation between two general vocal sequences which are not too different one from
another. Therefore, it could be possible to choose other pairs, corresponding to different
vowels, uttered by different speakers at different voice levels. However, in such a case,
the several differences between the two sequences could lead to no significant result in
evaluating the parameters playing a role in this transformation. These parameters, in
fact, are too numerous and the interactions among the different factors involved in the
determination of vowel, speaker and vocal effort are too complicated.
This is the reason why in this work, only step by step
transformations are accomplished.
Accordingly, only three kinds of pairs have been related:
In the first part of this article, only the first
category has been examined, and it accounts for the evaluation of vocal effort.
However, also the two other associations did lead to interesting
results.
Changing vowel
Relating sequences corresponding to different vowels, leads to an
evaluation of the factors involved in the identity of vowels.
Literature relates that "the formant frequency and the
fundamental frequency are crucial factors in identifying a vowel" [Di Benedetto and Liénard , 1992], and that "it is demonstrated that the tonotopic distance between the
first formant position and the fundamental frequency carries most of the information on
vowel openness" [Traunmüller,
1985].
Actually, in this test, only spectral modifications have been
perceived (in informal perceptive tests) as significant. Anyway it should be stressed that
prosodic modifications did not involve great changes in F0 (less than 40 Hz).
| Speaker | Source vowel | Prosodic modification | Filtered vowel | Target |
| MB | AA | AA-II | AA-II | II |
| MB | AA | AA-OU | AA-OU | OU |
| AM | AA | AA-II | AA-II | II |
| AM | AA | AA-OU | AA-OU | OU |
Vowels obtained by prosodic modifications are not of natural sounding quality. Note that the identity of the vowel does not change much from token to token. Filtering introduces the most significant effect. Synthetic II and OU sound almost natural in the case of speaker MB, worse in the case of speaker AM, as the latter involved much higher pitch values and much higher differences among the pitch of the original vowels (thus involving too much distortion in the difference F1-F0).
Changing speaker
Relating sequences corresponding to the same vowel, uttered at the
same voice level by different speakers should lead to considerations about the
importance of prosody and spectral features in the determination of speakers identity.
Yet, this topic is far from the scope of the present work. In this case the results were
not satisfying.
Relating sequences with not too different pitched voices results in
natural sounding utterances, only if the change in intonation and prosody remains within
variations typical of everyday life intra-speaker variability. When pitch differences are remarkable (typically when mixing male and female
voices), the sound quality is not natural at all, depending on the intrinsic limits in the
range of application of PSOLA.
PSOLA method, in fact, simulates the natural concatenation between successive elementary elements by overlapping and adding them, but the quality of the synthetic sequences is not natural if the overlapping is too heavy (about 80% overlapping). Enhancing too much the pitch, thus, may cause distorsions and alter the perception of the speaker's gender. To the same extent, lowering the pitch below a given range leads to collateral effects (such as creak).
This statement is illustrated by manipulating the vowel AA, uttered
by the female speaker AB (Sequence ABAAN. Medium pitch: 210 Hz), and the same vowel
uttered by the male speaker JP (Sequence JPAAN. Medium pitch: 85 Hz).
Giving JP the prosody of AB (sequence mix.JP-AB in
tables V and VI) results in a distorted sound, in which the perception of the speaker
gender is no longer clear.
The opposite experience (giving AB the
prosody of JP, sequence mix.AB-JP in tables V and VI) produces holes all along the
sequence (see figure in table VI). There is no alteration of the speaker's gender (as it
could have been expected by the first experiment) but the sound presents the
characteristics of a female "creaky voice" ( see [Klatt and Klatt, 1990] ).
| MIX.JP-AB | MIX.AB-JP |
In the present paper a method for matching vowel signals and transforming one into the other is described and demonstrated. The transformation can be decomposed into a prosodic transformation, which may affect the total duration, local duration, or pitch evolution of the source sequence, and a spectral transformation, which may affect the frequency scale and the amplitude of the spectrum. Using this tool we could produce in some selected cases a quite perfect voice or vocal effort transformation. As the transformation was made in several steps we could observe that prosodic modifications were perceptively more important than spectral modifications, relatively to the voice quality and vocal effort. In some cases we could not obtain a good quality of the resulting sequence. This may be attributed to some limitations of the PSOLA method, which is sensitive to the accuracy of the pitch marks, and which cannot compensate for too large gaps between the sequences to process. Thus the method looks adequate for matching and transforming sequences which are close enough. It should be further investigated in order to process sequences produced by speakers of different vocal genders (male, female, child), or representing vowels distant from each other.
Appendix A: The short term spectrum
Referring to Moulines and Charpentier, "the spectral envelope of the synthetic signal is identical to the short term spectrum X(w) of the original signal". In their work, actually, they dealt with synthetic signals obtained by the repetition of a unique prototype ST-signal, by means of PSOLA method. In this case, ST-signals are no longer equal, but a visual inspection of the isolated periods (extracted by a Hanning window) revealed that the amplitude spectra are significantly stable in the central part of the sequence (i.e. approximately between the 20% and 60% over the entire length of the sequence). Therefore it has been possible to extract a medium amplitude spectrum, averaged among the central periodes, and to consider it as the envelope amplitude spectrum of the overall signal.
Each spectrum, thus, is calculated by the following steps:
The short term amplitude spectrum computed in
this way is considered to represent the amplitude spectrum of the entire sequence.
Point no 3 must be better detailed: each period is
extracted by means of a Hanning window, then a FFT_SIZE points window is superimposed . If
the ST-signal is shorter than the window, a zero padding is accomplished. On the contrary,
if it is longer, it is simply cut at the extremes of the window in a symmetrical way.
FFT_SIZE equals 256 for female voices (corresponding to an average length of 16 msec, thus
"covering" voices with pitch higher than 125 Hz) and 512 for male voices.
This proceeding leads to an extreme simplification, not involving the fundamental frequency. Anyhow, a peak still remains in the zone 0-500Hz (which can be noticed in the figures 2 and 3), due to the periodicity of the structure. This fact has been reported by Moulines and Charpentier: "the spectral envelope signal critically depends on the spectral resolution of the analysis window".[Moulines and Charpentier, 1990]
back to spectral modifications
Appendix B: the pitch mark file
To apply a pitch synchronous method, an information
about the evolution of the pitch is needed. In the present work, a pitch marks detection
method has been used, written in C language by C.D'Alessandro (LIMSI-CNRS,
http://www.limsi.fr) . It consists of a cascade of different filters for the extraction of
a final file containing the pitch marks.
In
particular, the detection of the fundamental frequency is computed according to an
algorithm ([Hermes, 1987]) based on the correlation with a comb function (a method proposed by Philip
Martin in 1982 ([Martin, 1982])).
Some modifications have been made, as original programs had been
written to fit phrases uttered by male speakers and recorded at a lower level.
The pitch mark files coming out from the processing chain are the
analysis and synthesis time axes described in the paragraph about mapping.
back to pitch marks file
Acknowledgements
This study has been accomplished entirely at LIMSI-CNRS (Paris, France) during a five months stage.
Calliope (1989), "La parole et son traitment automatique", Masson Paris, 1989.
F.Charpentier and E.Moulines (1988), "Text-to-speech algorithm based on FFT-synthesis ", CH2561-9/88/0000, IEEE, S14.4, 1988, pp.667-670.
D.G.Childers, K.Wu, D.M.Hicks and B.Yegnanarayana (1989), "Voice Conversion", Speech Communication, 8, pp. 147-158.
M.G. Di Benedetto and J.S.Liénard (1992), "Extrinsic normalization of vowel formant based on cardinal vowels mapping", ICSLP, BANFF, October 1992.
M.G.Di Benedetto and J.S.Liénard (1994), "Influence of the vocal effort on vowels", 127th meeting of the Acoustical Society of America,Cambridge,Massachussets, June 1994.
I.Fonagy and J.Fonagy(1966), "Sound pressure level and duration", Phonetica 15, pp.14-21.
S.Handel (1989), "Listening", Massachussets Institute of technology 1989.
Dik.J.Hermes (1987), "Measurement of pitch by subharmonic summation", J.Acoustical Society of America, Vol.83, N°1, January 1988, pp.257-263.
D.H.Klatt and L.Klatt (1990), "Analysis, synthesis and perception of voice quality variations among female and male talkers", Journal Acoustical Society of America, Vol.87 , No 2, February 1990, pp. 820-856.
J.S. Liénard (1977), "Les processus de la communication parlée", Masson Paris, 1977.
J.S.Liénard and M.G.Di Benedetto (1992), "Evaluation perceptive d'un corpus de voyelles françaises émises isolément par plusieurs locuteurs selon différent forces de voix", 19èmes Journées d' Etude sur la Parole - Bruxelles, May 1992, pp. 469-474.
J.S.Liénard (1997), "Perception et variabilité de la parole", Fondements et perspectives en traitement automatique de la parole, ed. Meloni, AUPELF, 1996.
J.S.Liénard and M.G.Di Benedetto (1999), "Effect of vocal effort on spectral properties of vowels", J.Acoustical Society of America., Vol. 106, No 1, July 1999.
Ph.Martin(1981), "Extraction de la fréquence fondamentale par interacorrelation avec una fonction peigne", 12èmes Journées d'études sur la Parole - Montreal, May 1981, pp.221-232.
Ph.Martin (1982), "Comparison of pitch detection by cepstrum and spectral comb analysis", Proc. IEEE Int. Conf. Acoustical Speech Signal Process. ICASSP-82, pp.180-183
E.Moulines et F.Charpentier (1990), "Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones", Elsevier Science Publishers B.V. (North Holland), Speech Communication, Vol.9, Nos. 5/6, Dicembre 1990, pp.453-467.
W.H.Press ,B.P.Flanney (1990), "Numerical Recipes in C (The art of Scientific Computing)", Cambridge University Press, chapter 12, pp.398-412.
L.R.Rabiner / R.W.Schaefer (1978), "Digital processing of speech signals", Prentice-Hall, Signal processing series, Inc., Alan Oppenheim editor 1987.
R.Schulmann (1985), "Articulatory targeting and perceptual constancy of loud speech", French-Swedish Seminary, ICP, Grenoble, 1985.
R.Schulmann (1988), "Articulatory dynamics of loud and normal speech", J.Acoustical Society of America, Vol.85, No 1 , January 1988, pp.295-312.
M.Slaney, M.Covell and B.Lassiter (1996), "Automatic audio morphing", IEEE-ICASSP, pp. 1001-1004.
H. Traunmüller (1981), "Perceptual dimension of openness in vocal effort", J.Acoustical Society of America, Vol.59, No 5 , May 1981, pp.1465-1474.
H.Traunmüller (1985), "The role of the fundamental and the higher formants in the perception of speaker size, vocal effort and vowel openness", French-Swedish seminary, ICP, Grénoble, 1985.
H.Valbret, E.Moulines et J.P.Tubach (1992), "Voice transformation using PSOLA technique", Speech Communication, 11, n° 2/3, June 1992, pp. 175-187.