Thursday, February 28, 2019

SSML google text to voice vs Amazon Alexa

After some work on Alexa flash briefing skills and testing reading of speech with foreign words in it, this lead me down the path of having to use text to speech with ssml to create mp3's of approximately what I had wanted originally.  Below are some of my findings in the differences of Amazon alexa/polly vs google text to speech.


  • Prosody tags on individual words are fine and don't create delays
  • Limited phoneme support, but it least it has some.
  • Pronunciation of foreign sounding words is not as great by default.  Changing spellings of words to try to get it to pronounce better isn't as successful
  • Limited voices
  • The W tag, which allows you to point out what type of usage of the word is being used when spelling is the same for multiple meanings or word types.
  • Prosody pitch changes seem to sound better
  • Better delays at punctuation


  • Better pronunciation by default
  • Wide range of voices
  • Using prosody tags between words creates pauses that are excessive.
  • Ensure prosody closing tag comes after punctuation to avoid the punctuation being pronounced
  • Volume change in prosody can work in unexpected ways.  Increase can lead to lower volume.
  • Malformed tags at some point in the speech will lead to tags being ignored elsewhere.
  • Short non-english words get read as anacronyms when you don't want them to be
  • Words are occasionally read as the wrong type.  Example "bow" being used as a verb, being read as the noun when the sentence usage of the word should be pretty obvious to any analysis engine that it was a verb.  There are no ways to tag the part of speech or give a pronounciation.
  • You can play around with misspelling words to get them to pronounce more accurately if the default spelling doesn't work.  This won't always work, and can add some delays mid word if you need to use dashes or apostrophe's to try to create syllables.

No comments:

Post a Comment