Microsoft Teases Advanced Speech Tech, But It's Not for Everyone

(Mr_Mrs_Marcha/shutterstock)

SEATTLE—Are you thinking about creating a speech bot-driven app for your business? Some of the guidelines around bot creation—as outlined at Microsoft Build by noted Swedish entrepreneur, podcaster, and Windows Platform Development MVP Jessica Engstrom—are common sense. For example, don't build a voice bot just because it's cool new technology, and make sure it fits your business model.

But there are plenty of scenarios where voice does fit. One argument is that the average person types 40 words per minute but speaks 150. Approximately 3,000 new bots are released per week on the Microsoft platform alone, and 95 percent of smartphone owners have tried a personal assistant.

It's not all smooth sailing, though. Engstrom mentioned Microsoft's own disastrous voice-plus-AI experiment, Tay, which the company had to pull in less than a day after the internet taught it to be racist. And she pointed to Burger King, which ran a commercial designed to trigger Google Home but instead read a Wikipedia page saying the Whopper contained cyanide.

Azure Speech Tech phrasing differences

When designing a voice assistant, you should limit the scope of possible answers, Engstrom said. Don't have it ask open-ended questions. Train the voice assistant to handle many ways of phrasing a question or command. Even write a full script of a conversation that makes sense for your bot. Finally, provide audio help, giving examples of what kind of things a user can say.

New for Azure Speech Technology

One of the big announcements at the Build Keynote was the ability to transcribe multiparty speech in meetings while keeping track of which speaker said what. In a separate session, Aarthy Longino, Principal Program Manager for Speech and Language at Microsoft, showed this working in a custom development interface.

At last year's Build, the biggest hit was a meeting "cone" that recognized participants and transcribed what each said. Now that cone, which also sports a 360-degree camera, is being tested by Microsoft customers in private preview. But there are other devices that anyone can get to test the transcription, including the Roobo Smart Audio Dev Kit, which was impressively demoed in the session.

You can find these Cognitive Services Speech Devices at aka.ms/sdsdk-get.

On the other end of speech, and at least as impressive, is text to speech (TTS). Microsoft's Qinying Liao, a Principal Program Manager on Speech Services, showed advances in things like the remarkably natural-sounding new Neural Voices, which was so smooth that attendees in the room voted for it over an actual human reader.

Currently, Neural Voices are only available for nine regional English dialects, but Japanese, Spanish, and Portuguese are in the works.

Another new capability is to add emotion to the TTS: a simple keyword in code can make the generated voice sound cheerful or empathetic. That works the other way, too. In fact, Microsoft's transcription technologies for call centers can detect when an interaction starts to go negative. The Speech Services will let businesses customize recognition and TTS using their own terminology in a new Custom Speech Portal. You can read about all the Azure Speech Services at this help page.

About Our Expert

Michael Muchmore

Contributor

My Experience

I've been testing PC and mobile software for more than 20 years, focusing on photo and video editing, operating systems, and web browsers. Prior to my current role, I covered software and apps for ExtremeTech and headed up PCMag’s enterprise software team. I’ve attended trade shows for Microsoft, Google, and Apple and written about all of them and their products.

I still get a kick out of seeing what's new in video and photo editing software, and how operating systems change over time. I was privileged to byline the cover story of the last print issue of PC Magazine, the Windows 7 review, and I’ve witnessed every Microsoft misstep and win, up to the latest Windows 11.

I’m an avid bird photographer and traveler—I’ve been to 40 countries, many with great birds! Because I’m also a classical music fan and former performer, I’ve reviewed streaming services that emphasize classical music.

Technology I Use

For everyday work, I use a good-old Dell tower with 16GB of RAM, a 12th-gen Intel Core i7 processor, and an Nvidia RTX 3060 Ti GPU that runs on Windows 11. I pair it with a 4K Lenovo ThinkVision P27u-10 monitor and a Logitech MX Vertical mouse. For offsite work, I use a 2024 Microsoft Surface Laptop with a Qualcomm Snapdragon X Elite processor. Camera-wise, I moved to mirrorless from a Canon EOS 80D with a Canon 70-300mm IS USM lens. I now have a Canon EOS R7 with a 100-400mm lens, but I miss my DSLR for several reasons.

In order of usage, the software I turn to most frequently is the Edge web browser, Slack, Adobe Creative Cloud, Microsoft 365, Firefox, Brave, and WhatsApp. I use the Windows Phone link app to see everything on my Samsung Galaxy S21 Ultra phone, which has excellent telephoto capability.

For fitness monitoring, I have a Fitbit Charge 6 and use an Anker Smart Scale P1. I’m also a streaming fan, so I subscribe to both Amazon Music Unlimited (especially for its Dolby Atmos content) and Qobuz (for its high-res sound quality and classical catalog). I recently added a Vizio 5.1 Soundbar SE, which sounds surprisingly good given its low price. To holler commands instead of using a remote control, I have the Amazon Fire TV Cube in the living room, which lets me verbally tell the TV what I want to watch. It hooks up to an LG B4 OLED TV. I have a Sonos One speaker in my kitchen that also ties in with Alexa, as does the Echo Dot 2 With Clock in my bedroom. For serious listening, I have B&W 601 speakers plugged into a Conrad-Johnson Sonographe amp and preamp, with a Cambridge Audio AXN10 streamer as source. For reading, I also have a Nook GlowLight 3.

Read the latest from Michael Muchmore

Read full bio