21st International Conference on Speech and Computer


Hynek Hermansky

Professor, The Johns Hopkins University, USA

If You Can’t Beat Them, Join Them

It is often argued that in in processing of sensory signals such as speech, engineering should apply knowledge of properties of human perception - both have the same goal of getting information from the signal. We show on examples from speech technology that perceptual research can also learn from advances in technology. After all, speech evolved to be heard and properties of hearing are imprinted on speech. Subsequently, engineering optimizations of speech technology often yield human-like processing strategies. Our current focus is on searching for support for our model of human speech communication which suggests that redundancies introduced in speech production in order to protect the message during its transmission through a realistic noisy acoustic environment are being used by human speech perception for a reliable decoding of the message. That leads to a particular architecture of an automatic recognition (ASR) system in which longer temporal segments of spectrally-smoothed temporal trajectories of spectral energies in individual frequency bands of speech are used to derive estimates of the posterior probabilities of speech sounds. Combinations of these estimates in reliable frequency bands are then adaptively fused to yield the final probability vectors, which best satisfy the adopted performance monitoring criteria. Some ASR systems, which already use elements of the suggested architecture are mentioned in this paper.


Hynek Hermansky (F'01, SM'92. M'83, SM'78) received the Dr. Eng. Degree from the University of Tokyo, and Dipl. Ing. Degree from Brno University of Technology, Czech Republic. He is the Julian S. Smith Professor of Electrical Engineering and the Director of the Center for Language and Speech Processing at the Johns Hopkins University in Baltimore, Maryland. He is also a Research Professor at the Brno University of Technology, Czech Republic. He is a Life Fellow of the Institute of Electrical and Electronic Engineers (IEEE) IEEE, and a Fellow of the International Speech Communication Association (ISCA), was twice an elected Member of the Board of ISCA, a Distinguished Lecturer for ISCA and for IEEE, and is the recipient of the 2013 ISCA Medal for Scientific Achievement. He has been working in speech processing for over 30 years, mainly in acoustic processing for speech recognition.

Vanessa Evers

Professor, University of Twente, the Netherlands


The classic image in the psychology of Human-Robot Interaction is that of a person who is focused and eager to learn how to work with or control a robot. The job of the roboticist then is primarily to avoid mistakes in accuracy of detection, manipulation, navigation, decision making, planning and so on to optimize human robot collaboration.


In this talk I will argue that social norms embedded in people, robots and the context in which the robots are used make this approach obsolete. Specifically, I will address the following questions:

– How do people understand robot behaviors?

– What do we know about people and robots collaborating?

– Can a robot understand human social behaviors?

– How does knowledge about human social relationships necessitate a change in our thinking about how humans should be modeled?

– How can the design of robots and their behavior improve acceptance of robots in everyday environments such as our homes, airports, museums, schools, roads, and hospitals?

Through examples of practical deployment of robots, I will explore the fundamentally social relationship people have with autonomous robots and offer essential rules for effective human-robot collaboration.


Vanessa Evers is a full Professor of Human Media Interaction at the University of Twente. Her research focuses on the design and development of Socially Intelligent Agents. This concerns human interaction with autonomous agents such as robots or machine learning systems and cultural aspects of Human Computer Interaction. She is best known for her work on social robotics such as the FROG robot (fun robotic outdoor guide), SPENCER (The airport service robot) and DE-ENIGMA (robot for autism education) that can interpret human behavior automatically and respond to people in a socially acceptable way. She is very active organizing scientific conferences and as editor of academic journals, she is a speaker on AI and Robotics at international events such as the World Economic Forum and is a frequent contributor to the media in newspapers and tv-shows.

Odette Scharenborg

Associate Professor, Delft University of Technology, the Netherlands

The representation of speech in the human and artificial brain

Speech recognition is the mapping of a continuous, highly variable speech signal onto discrete, abstract representations. In both human and automatic speech processing, the phoneme is considered to play an important role. Abstractionist theories of human speech processing assume the presence of abstract, phoneme-like units which sequenced together constitute words, while basically all best-performing, large vocabulary automatic speech recognition (ASR) systems use phoneme acoustic models. There is however ample evidence that phonemes might not be the unit of speech representation during human speech processing. Moreover, phoneme-based acoustic models are known to not be able to deal well with the high-variability of speech due to, e.g., coarticulation, faster speaking rates, or conversational speech. The question how is speech represented in the human/artificial brain, although crucial in both the field of human speech processing and the field of automatic speech processing, has historically been investigated in the two fields separately. I will argue that comparisons between humans and DNNs and cross-fertilization of the two research fields can provide valuable insights into the way humans process speech and improve ASR technology.

Specifically, I will present results of several experiments carried out on both human listeners and DNN-based ASR systems on lexically-guided perceptual learning, i.e., the ability to adapt a sound category on the basis of new incoming information resulting in improved processing of subsequent information. Both human listeners and machines need to adapt their sound categories whenever a new speaker is encountered. Human listeners have been found to do this very fast. I will explain how listeners adapt to the speech of new speakers, and I will present the results of a lexically-guided perceptual study we carried out on a DNN-based ASR system, similar to the human experiments. In order to investigate the speech representations and adaptation processes in the DNN-based ASR systems, we visualized the activations in the hidden layers of the DNN. These visualizations revealed that the DNNs showed an adaptation of the phoneme categories similar to what is assumed happens in the human brain. These visualization techniques were also used to investigate what speech representations are inherently learned by a naïve DNN. In this particular study, the input frames were labeled with different linguistic categories: sounds in the same phoneme class, sounds with the same manner of articulation, and sounds with the same place of articulation. The resulting visualizations showed evidence that the DNN appears to learn structures that humans use to understand speech without being explicitly trained to do so.


Odette Scharenborg (PhD) is an associate professor and Delft Technology Fellow at the Multimedia Computing Group at Delft University of Technology, the Netherlands. Previously, she was an associate professor at the Centre for Language Studies, Radboud University Nijmegen, The Netherlands, and a research fellow at the Donders Institute for Brain, Cognition and Behavior at the same university. Her research interests focus on narrowing the gap between automatic and human spoken-word recognition. Particularly, she is interested in the question where the difference between human and machine recognition performance originates, and whether it is possible to narrow this performance gap. She investigates these questions using a combination of computational modelling, machine learning, behavioral experimentation, and EEG. In 2008, she co-organized the Interspeech 2008 Consonant Challenge, which aimed at promoting comparisons of human and machine speech recognition in noise in order to investigate where the human advantage in word recognition originates. She was one of the initiators of the EU Marie Curie Initial Training Network “Investigating Speech Processing In Realistic Environments” (INSPIRE, 2012-2015). In 2017, she co-organized a 6-week Frederick Jelinek Memorial Summer Workshop on Speech and Language Technology on the topic of the automatic discovery of grounded linguistic units for languages without orthography. In 2017, she was elected onto the board of the International Speech Communication Association (ISCA), and in 2018 onto the IEEE Speech and Language Processing Technical Committee.