Silicon Speech  Communication Advances Through Technology

item3a1a1a item3a1a1a1 item3a1a1b item3a1a1

Silicon Speech performs a broad range of research pertaining to spoken and written language. Speech synthesis, automatic speech recognition, foreign-language instruction, native-language reading instruction, hearing impairment and cognitive-linguistic function are among the applications of interest .

Steven Greenberg, founder and president, trained in linguistics (A.B. – University of Pennsylvania; Ph.D. – University of California, Los Angeles), neuroscience (UCLA, Northwestern, University of Wisconsin, Madison), psychoacoustics (Northwestern), computer science (Pennsylvania, University of California, Berkeley), electrical engineering (UCLA, Wisconsin; UC-Berkeley), and anthropology (Pennsylvania). His publications are available on this web site.

Among the topics of current interest to Silicon Speech are the following.

Speech Synthesis

STRAIGHT TALK Speech Synthesis

Current methods for creating artificial speech (synthesis) are limited in a number of ways. Vocal-tract model synthesis sounds unnatural and machine-like; moreover, it is time-consuming and tedious to fine-tune for specific applications. Concatenative synthesis, currently the leading method for commercial applications, requires large amounts of pre-recorded spoken material and is not easily extensible to speaking styles and topic domains distinct from those recorded. Concatenative methods are particularly deficient at capturing the expressive style and nuance of verbal communication.

STRAIGHT TALK represents a new method of synthesizing speech, which is not subject to these limitations and constraints. Although it uses pre-recorded speech, the manner in which these materials are used differs from conventional concatenative methods. Rather than use cepstral features for modeling the speech signal, perceptually relevant units are modeled in a way that preserves and extends both the segmental and prosodic character of the spoken material. This project is funded by the Air Force Office of Scientific Research, and the Carlsberg Foundation of Denmark, and is performed in collaboration with the Department of Electrical Engineering, University of Washington (Les Atlas, Cameron Colpitts), University of Wakayama (Hideki Kawahara), University of Nagoya (Hideki Banno), the Centre for Applied Hearing Research at the Technical University of Denmark (Torsten Dau and Thomas Christiansen), the Computational Linguistics Department of the Copenhagen Business School (Peter Juel Henrichsen) and the Linguistics Laboratory of the University of Copenhagen (Nina Grønnum).

Distance Cuing in Speech Synthesis for Auditory Displays

Speech is spoken in a variety of ways depending on (1) the distance between talker and listener and (2) the background acoustic environment. This project studies how well listeners are able to judge the distance at which speech is spoken in a variety of listening conditions. The project is also developing new ways of synthesizing realistic-sounding speech material with embedded distance cues. Funded by the Air Force Office of Scientific Research, this research is performed in collaboration with the Speech and Hearing Research Center of the Martinez Veterans Administration Medical Center (Pierre Divenyi), the CIPIC Interface Laboratory at University of California, Davis (Ralph Algazi and Richard Duda) and the Auditory Media Laboratory of Wakayama University in Japan (Hideki Kawahara).

Speech Perception

Information-theoretic Approach to Spoken Language Processing

Conventional methods for assessing how well listeners decode the speech signal are based mostly on the number of words and phonemes identified correctly. Such conventional analyses do not really explain why listeners (or a specific listener) have trouble understanding spoken material in a variety of acoustic conditions. Instead, we use an information-theoretic measure associated with the basic building blocks of spoken language – articulatory-acoustic features – to evaluate how well listeners decode the speech signal. This metric has been successfully applied to Danish consonants under a broad range of listening conditions. This project is funded by the Oticon Foundation of Denmark, and is performed in collaboration with the Centre for Applied Hearing Research at the Technical University of Denmark (Torsten Dau and Thomas Christiansen.

Spoken Word Recognition by Humans: A Single or Multi-layer Processs?

The neural mechanisms by which listeners decode spoken language are not well understood. This project, funded by the Air Force Office of Scientific Research, seeks to understand these mechanisms through a combination of perceptual experiments and computational modeling. It is performed in collaboration with Sensimetrics Corporation (Oded Ghitza).

The Role of Prosodic Information in Speech Processing Under Adverse Acoustic Conditions

We still do not know how listeners decode the speech signal in noisy conditions, particularly in the absence of visual cues. One possibility is that prosody – the rhythm, intonation and meter of speech – helps the listener. In collaboration with the Speech and Hearing Research Center of the Martinez Veterans Administration Medical Center (Pierre Divenyi),the ability of listeners to decode the speech signal is being investigated in extremely noisy conditions to ascertain whether certain prosodic properties of the acoustic signal play a significant role in understanding spoken language in such environments.

Auditory-Visual Processing of Speech

The eyes often contribute as much to understanding spoken language as the ears. The visual component of the speech signal is particularly significant in noisy environments and for the hearing impaired. The visual cues (“visemes”) can mean the difference between understanding most of the words spoken and virtually none.

In collaboration with the Auditory-Visual Speech Recognition Laboratory of the Army Audiology and Speech Center at the Walter Reed Army Medical Center (Ken Grant), Silicon Speech is exploring the temporal constraints, which bind the acoustic and visual components of the speech signal in the brain. This research is important not only for understanding normal speech processing, but also learning how technology can be used to ameliorate communicative deficits associated with hearing loss.

Brain Imaging of Spoken Language Processing

Until recently, the role of specific parts of the brain controlling language was largely guesswork, based on decades of clinical research on brain-damaged individuals. This is no longer the case, thanks to significant advances in brain imaging technology over the past decade. It is now possible to visualize neural activation associated with speech production and reception in fine spatial detail using functional magnetic resonance imaging (fMRI). The time course of fMRI imaging is slow, so other methods, such as magnetoencephalography (MEG), are often used in tandem.

Silicon Speech has an on-going collaboration with the Cognitive Neuroscience of Language Laboratory at the University of Maryland, College Park (David Poeppel) concerning the neural correlates of speech perception (both auditory and auditory-visual). Of particular interest is the time course of speech decoding and understanding.

A second collaboration is with the Auditory Neuroscience and Speech Recognition Laboratory at the Center for Mind and Brain of the University of California, Davis (Lee Miller). The focus of the research in this laboratory is on auditory-visual integration during speech processing.

Hearing Impairment

Hearing Aid Fitting

Conventional methods for fitting a hearing aid are largely based on research performed in the 1930s and 1940s. In this standard paradigm, the primary deficit is one of auditory sensitivity. Provide sufficient gain to those portions of the frequency spectrum most affected by sensori-neural damage and the ability to understand speech should be restored. Unfortunately, this “audibility” model rarely works in the real world, which is why so many who purchase hearing aids are dissatisfied with the results. Silicon Speech helps Hearing Centers Network develop and disseminate novel methods for fitting hearing aids. In controlled tests, HCN fitting methods provided as much as 300% better speech intelligibility than standard procedures. The results of this study were published in the 21st Danavox Symposium [PDF].

Speech Enhancement for Auditory Prostheses

Silicon Speech has collaborated with the Acoustics Research Institute of the Austrian Academy of Sciences on speech processing strategies for cochlear implants (Werner Deutsch and Bernhard Laback).

Speech Enhancement for Hearing Aids

Reverberation and background noise (particularly speech babble) represent particular challenges for the hearing impaired, even when wearing digital hearing aids. One approach to solving this problem is to enhance the information-bearing components of the acoustic signal. Silicon Speech occasionally works with the Speech Processing Laboratory of Sophia University (Takayuki Arai) on methods to enhance speech comprehension under adverse acoustic conditions, particularly for individuals with a hearing impairment.

Reading Instruction

Even in our electronic era, reading is the primary gateway to knowledge and news. Despite intensive instruction in the early grades, many children experience difficulty learning to read; a significant proportion will never read proficiently, even as adults. The untold damage to the individual and to society is beyond measure.

English is a particularly difficult language to learn to read. The reasons for this difficulty are controversial; however, the factors generally cited are (1) spelling, (2) many exceptions to rules, (3) a large vocabulary and (4) difficult grammar. Approximately five to ten percent of children are classified as “dyslexic,” meaning that such difficulties are considered biological in origin.

The proportion of dyslexic children depends on the language. In languages with a transparent sound-to-grapheme correspondence (such as Italian and Spanish), the proportion of dyslexic children is small. In contrast, languages where the sound-to-grapheme relation is relatively opaque (e.g., English and Danish), the proportion of dyslexics is much higher.

In our view, English spelling is far more systematic than generally recognized. The problem lies in attempting to match each orthographic character with a specific sound or set of sounds. English spelling does not work this way (nor does Danish). Silicon Speech is working on new methods for teaching reading based on prosodic properties of spoken language. This research is being performed in collaboration with the Centre for Neuroscience in Education at the University of Cambridge (Usha Goswami).