AVSpeechSynthesizer

Though we’re a long way off from
Hal or
Her,
we shouldn’t forget about the billions of people out there for us to talk to.

Of the thousands of languages in existence,
an individual is fortunate to gain a command of just a few within their lifetime.
And yet,
over several millennia of human co-existence,
civilization has managed to make things work (more or less)
through an ad-hoc network of
interpreters, translators, scholars,
and children raised in the mixed linguistic traditions of their parents.
We’ve seen that mutual understanding fosters peace
and that conversely,
mutual unintelligibility destabilizes human relations.

It’s fitting that the development of computational linguistics
should coincide with the emergence of the international community we have today.
Working towards mutual understanding,
intergovernmental organizations like the United Nations and European Union
have produced a substantial corpus of
parallel texts,
which form the foundation of modern language translation technologies.

Computer-assisted communication
between speakers of different languages consists of three tasks:
transcribing the spoken words into text,
translating the text into the target language,
and synthesizing speech for the translated text.

This article focuses on how iOS handles the last of these: speech synthesis.


Introduced in iOS 7 and available in macOS 10.14 Mojave,
AVSpeechSynthesizer produces speech from text.

To use it,
create an AVSpeechUtterance object with the text to be spoken
and pass it to the speakUtterance(_:) method:


import AVFoundation
        let string = "Hello, World!"
        let utterance = AVSpeechUtterance(string: string)
        let synthesizer = AVSpeechSynthesizer()
        synthesizer.speakUtterance(utterance)
        
NSString *string = @"Hello, World!";
        AVSpeechUtterance *utterance = [[AVSpeechUtterance alloc] initWithString:string];
        utterance.voice = [AVSpeechSynthesisVoice voiceWithLanguage:@"en-US"];
        AVSpeechSynthesizer *synthesizer = [[AVSpeechSynthesizer alloc] init];
        [synthesizer speakUtterance:utterance];
        

You can use the adjust the volume, pitch, and rate of speech
by configuring the corresponding properties on the AVSpeechUtterance object.

When speaking,
a synthesizer can be paused on the next word boundary,
which makes for a less jarring user experience than stopping mid-vowel.


synthesizer.pauseSpeakingAtBoundary(.word)
        
[synthesizer pauseSpeakingAtBoundary:AVSpeechBoundaryWord];
        

Supported Languages

Mac OS 9 users will no doubt have fond memories of the old system voices —
especially the novelty ones, like
Bubbles, Cellos, Pipe Organ, and Bad News.

In the spirit of quality over quantity,
each language is provided a voice for each major locale region.
So instead of asking for “Fred” or “Markus”,
AVSpeechSynthesisVoice asks for en-US or de-DE.

VoiceOver supports over 30 different languages.
For an up-to-date list of what’s available,
call AVSpeechSynthesisVoice class method speechVoices()
or check this support article.

By default,
AVSpeechSynthesizer will speak using a voice
based on the user’s current language preferences.
To avoid sounding like a
stereotypical American in Paris,
set an explicit language by selecting a AVSpeechSynthesisVoice.


let string = "Bonjour!"
        let utterance = AVSpeechUtterance(string: string)
        utterance.voice = AVSpeechSynthesisVoice(language: "fr")
        
NSString *string = @"Bonjour!";
        AVSpeechUtterance *utterance = [[AVSpeechUtterance alloc] initWithString:string];
        utterance.voice = [AVSpeechSynthesisVoice voiceWithLanguage:@"fr-FR"];
        

Many APIs in foundation and other system frameworks
use ISO 681 codes to identify languages.
AVSpeechSynthesisVoice, however, takes an
IETF Language Tag,
as specified BCP 47 Document Series.
If an utterance string and voice aren’t in the same language,
speech synthesis fails.

Not all languages are preloaded on the device,
and may have to be downloaded in the background
before speech can be synthesized.

Customizing Pronunciation

A few years after it first debuted on iOS,
AVUtterance added functionality to control
the pronunciation of particular words,
which is especially helpful for proper names.

To take advantage of it,
construct an utterance using init(attributedString:)
instead of init(string:).
The initializer scans through the attributed string
for any values associated with the AVSpeechSynthesisIPANotationAttribute,
and adjusts pronunciation accordingly.

import AVFoundation
        let text = "It's pronounced 'tomato'"
        let mutableAttributedString = NSMutableAttributedString(string: text)
        let range = NSString(string: text).range(of: "tomato")
        let pronunciationKey = NSAttributedString.Key(rawValue: AVSpeechSynthesisIPANotationAttribute)
        // en-US pronunciation is /tə.ˈme͡ɪ.do͡ʊ/
        mutableAttributedString.setAttributes([pronunciationKey: "tə.ˈme͡ɪ.do͡ʊ"], range: range)
        let utterance = AVSpeechUtterance(attributedString: mutableAttributedString)
        // en-GB pronunciation is /tə.ˈmɑ.to͡ʊ/... but too bad!
        utterance.voice = AVSpeechSynthesisVoice(language: "en-GB")
        let synthesizer = AVSpeechSynthesizer()
        synthesizer.speak(utterance)
        

Beautiful. ?

Of course, this property is undocumented
at the time of writing,
so you wouldn’t know that the IPA you get from Wikipedia
won’t work correctly unless you watched
this session from WWDC 2018.

To get IPA notation that AVSpeechUtterance can understand,
you can open the Settings app,
navigate to Accessibility > VoiceOver > Speech > Pronunciations,
and… say it yourself!

Speech Pronunciation Replacement

Hooking Into Speech Events

One of the coolest features of AVSpeechSynthesizer
is how it lets developers hook into speech events.
An object conforming to AVSpeechSynthesizerDelegate can be called
when a speech synthesizer
starts or finishes,
pauses or continues,
and as each range of the utterance is spoken.

For example, an app —
in addition to synthesizing a voice utterance —
could show that utterance in a label,
and highlight the word currently being spoken:


var utteranceLabel: UILabel!
        // MARK: AVSpeechSynthesizerDelegate
        override func speechSynthesizer(_ synthesizer: AVSpeechSynthesizer,
        willSpeakRangeOfSpeechString characterRange: NSRange,
        utterance: AVSpeechUtterance)
        {
        self.utterranceLabel.attributedText =
        attributedString(from: utterance.speechString,
        highlighting: characterRange)
        }
        
#pragma mark - AVSpeechSynthesizerDelegate
        
        - (void)speechSynthesizer:(AVSpeechSynthesizer *)synthesizer
        willSpeakRangeOfSpeechString:(NSRange)characterRange
        utterance:(AVSpeechUtterance *)utterance
        {
        NSMutableAttributedString *mutableAttributedString = [[NSMutableAttributedString alloc] initWithString:utterance.speechString];
        [mutableAttributedString addAttribute:NSForegroundColorAttributeName
        value:[UIColor redColor] range:characterRange];
        self.utteranceLabel.attributedText = mutableAttributedString;
        }
        

AVSpeechSynthesizer Example

Check out this Playground
for an example of live text-highlighting for all of the supported languages.


Anyone who travels to an unfamiliar place
returns with a profound understanding of what it means to communicate.
It’s totally different from how one is taught a language in High School:
instead of genders and cases,
it’s about emotions
and patience
and clinging onto every shred of understanding.
One is astounded by the extent to which two humans
can communicate with hand gestures and facial expressions.
One is also humbled by how frustrating it can be when pantomiming breaks down.

In our modern age, we have the opportunity to go out in a world
augmented by a collective computational infrastructure.
Armed with AVSpeechSynthesizer
and myriad other linguistic technologies on our devices,
we’ve never been more capable of breaking down the forces
that most divide our species.

If that isn’t universe-denting,
then I don’t know what is.

Source: NSHipster