Voice assistants: So, where are these mysterious voices actually taking us?

Paul Krizsan
30 August 2018

Anyone currently walking past advertising posters for Google’s voice assistants is probably wondering where this voice assistant arms race will lead. Almost every year, the digital companions are equipped with new features and better capabilities. Will we ultimately be left with a homogeneous set of voice assistants that do everything perfectly, or will we face a wild flora of smaller assistants that are highly specialized but generally weaker, albeit in symbiotic relationships with each other? Or will it even look different?


It is no secret that established voice assistants like Amazon’s Alexa and Microsoft’s Cortana have unique strengths, but also equally individual weaknesses. While Alexa demonstrates excellence among voice assistants in the areas of shopping, entertainment and as a companion outside the work environment, Cortana’s strengths lie in organizing daily routines and supporting the user’s productivity.

In May 2018, the manufacturers of the two assistants therefore announced a collaboration: In the future, one should not only be able to address and command Alexa via Amazon Echo, but also be able to call Cortana. Complete with Cortana’s voice.

This form of symbiosis is supposed to strengthen Alexa in particular, but it is also a sign that Cortana will probably not expand on the competitive stage. It is therefore more likely that Cortana will focus on deepening known topics.

Similar to Alexa and Cortana, the most popular voice assistants, Google Assistant and Siri, have weaknesses and strengths. The resulting gaps are now being filled by a new generation of voice assistants that can often only do a few things, but do them much better than general voice assistants such as those from Google, Apple and Co.

Companies like the U.S.-based Soundhound, whose Hound assistant shines especially when it comes to complex questions and commands, are either hoping to participate in the market alongside giants like Amazon by licensing their own framework. This allows corporations that would benefit from speech recognition and commands to use Soundhound’s technology without spending the resources to develop their own.

Voices and embodiment

While for manufacturers of mobile, smart devices in lieu of physical manifestation, it is primarily the voice that is the avatar of the personality, companies from industries such as smart home and automotive have the opportunity to visually lend a hand to the personality of the assistants. Whether physical or digital, this is referred to as embodiment, the lending of a visual language of form.

Amazon Echo

The embodiment can take different forms: Amazon can give Alexa coarser character traits through the design of the Echo products. Thus, the voice assistant does not appear feminine to the extreme, but rather neutral and open, educated and likeable.

Amazon Echo, 2nd generation. Source:


A good example of exaggerated embodiment is Jibo. Jibo is a curious and always joyful five-year-old in the cute body of a table lamp. By rotating the three body parts, the fun robot can, among other things, dance, tilt its head questioningly, and blink and show other emotions thanks to the eye in the display.

Although Jibo’s functions are limited and not nearly as elaborate as those of competitors, Jibo can convince with charm thanks to its physical form.

Jibo. Source:

Nio Nomi

The automotive industry also sees a lot of potential in voice assistants. For many, our four-wheeled companions are already considered family members; you couldn’t ask for a simpler platform. Unlike smartwatches and smartphones, and not least because of the longevity and non-existent compulsion of portability, AI’s in cars can also take physical form. Much like Jibo, Chinese electric vehicle manufacturer Nio’s AI is intended to be perceived primarily as a social companion. Nomi – as Nio’s AI has been christened – can simulate and awaken in humans an astonishing array of human emotions thanks to a display above the car’s central console. It’s true that you feel like Luke Skywalker with a droid in a spaceship, but who can deny themselves those cute eyes?

Nio’s Nomi. Source: Wall Street Jounal

A Forecast

However, the biggest technological leaps in the field of voice assistants are still happening with the market leaders. For example, in May 2018, Google unveiled a demo version of Google Assistant, which could independently make phone calls to humans with such authenticity that the people on the other end couldn’t make out the caller as an artificial intelligence. Google Duplex, as the demo version was called, based its humanity not only on the emulation of human speech, but also on the regular interspersion of filler words such as searching ahms and confirming mhhms.

Do you want to give users the feeling that they are talking to a real person or does the machine have to be recognizable as such?

That this involves the arbitrary deception of people and that the possibility for abuse is outrageously close is obvious. One of the biggest questions in the design of voice assistants in the coming years must therefore be the question of ethics: Do we want to give users the feeling that we are talking to a real human being, or must the machine be recognizable as such?