Topics | RTTH Summer School on Speech Technology 2017

Monday July 10th

Keynote lecture: SENSEI: Making Sense of Human Conversations (Prof. Giuseppe Riccardi)

Conversational interaction is the most natural and persistent paradigm for business relations with customers. In contact centres millions of calls are handled daily. On social media platforms millions of blog posts are exchanged amongst users. What is the micro and macro level analysis of human conversation we can learn from data ? Can we make sense of such conversations and help create assets and value for private and public organizations’ decision makers? And indeed for anyone interested in conversational content?
The SENSEI project has developed summarization technology to help users make sense of human conversation streams from diverse media channels. Second, SENSEI has designed and evaluated its summarization technology in real-world environments, aiming to improve task performance and productivity of end-users.
In this talk we will review the main scientific achievements, results and use cases of the project over different scenarios.

Project Website: http://www.sensei-conversation.eu

Lecture: Deep Learning for Text-to-Speech (TTS) (Dr. Antonio Bonafonte)

During last decade Deep Learning has emerged as a machine learning technique which is becoming predominant in may areas such as computer vision, natural language processing, including machine translation,
and speech processing, including speech and speaker recognition and speech synthesis.

In the first day of the RTTH Summer School, we wil give an introduction to Deep Learning for Speech Processing with particular focus in text-to-speech synthesis, both with parametric representation of the speech signal and with the new end-to-end architectures. The lecture will be divided into three sessions:

Session 1: Introduction to Deep Learning for Speech Processing (90 min.): This session will give an overview of the different architectures (MLP, CNN, RNN, AutoEncoders, …), the creation of embeddings and their employment for language modelling, as well as the use of deep models for speech recognition and machine translation.
Session 2: Parametric Speech synthesis using Deep Learning (1h.): This session will present the different parametric representations for speech synthesis, such like vocoders (for speech signal) and labs (for text); it will give a more in-depth view of the two-stage speech synthesis model and the use of adaptation and interpolation for improving TTS systems.
Session 3: End-to-end Speech Generation using Deep Learning (1h): different speech synthesis tools will be presented, such as Wavenet (Google), DeepVoice (Baidu), Tacotron (Google), or Segan (UPC).

Tuesday July 11th

Lecture: Information search and retrieval on big data (Dr. Laura Docío, Dr. Paula López)

The available technologies for finding information within the web or large repositories are mostly focused in text but, nowadays, interacting with audiovisual contents on the Internet is becoming more and more common. In this way, the amount of videos that can be found in news websites is steadily increasing, and the classical blogs are gently turning into videoblogs. These changes in the way information is displayed require new approaches for searching and retrieving information. It is of special interest the possibility to obtain human-centered information: Who appears in the recordings? What are they talking about? The lecture will be divided into two sessions:

Session 1: Content search on multimedia documents (1h.): Being able to search for contents in multimedia recordings in the same way it is done in written documents would change the way we interact with these contents, but the performance of the available techniques is still very limited due to the complexity of the task. This talk will discuss the current techniques for automatic search within large multimedia databases, their limitations and their future directions.
Session 2: Person characterization in multimedia documents (1h.): Humans are naturally interested in finding out about other people: who they are, what they are doing lately, how they are feeling, what their opinion is… This interest is hardly satisfied when searching within multimedia documents, since the only accessible information is usually displayed in tags: this tags are usually added by humans, which is a highly time consuming task. This talk will address the challenge of automatically tagging and labeling large multimedia collections in order to be able to obtain information about the people present in the documents in terms of characteristics such as their identity, age, sex or emotional state.

Lecture: Silent Speech: Reconstructing Speech from Articulators Movement Data by Machine Learning (Dr. José A. González)

Total removal of the larynx is often required to treat laryngeal cancer: every year some 17,500 people in Europe and North America lose the ability to speak in this way. Current methods for restoring speech include the electro-larynx, which produces an unnatural, electronic voice, oesophageal (belching) speech, which is difficult to learn, and fistula valve speech, which is considered to be the current gold standard but requires regular hospital visits for valve replacement and produces a masculine voice unpopular with female patients. All these methods sacrifice the patient’s spoken identity.
In this talk we describe a technique which has the potential to restore the power of speech by sensing movement of the remaining speech articulators and using machine learning algorithms to derive a transformation which converts this sensor data into an acoustic signal – ‘Silent Speech’. We report experiments with several machine learning techniques and show that the Silent Speech generated, which may be delivered in real time, is intelligible and sounds natural. The identity of the speaker is recognisable.

Finally, in the talk we also describe other potential usages of silent speech interfaces (e.g. maintaining privacy when making phone calls in public areas or speech communication in noisy areas) and the details of an implementation of a silent speech system for Android.

Workshop: TensorFlow for Deep Learning (Dr. Ángel Gómez)

TensorFlow is a multipurpose open source library for numerical computation using data flow graphs. It has been designed with deep learning in mind but it is applicable to a much wider range of problems. In this workshop we will cover the very basics of TensorFlow and apply them for deep learning programming. First we will explore the main aspects of the TensorFlow python API and will put it into practice with a simple python examples. Then we will learn how to build more advanced solutions for deep learning, implementing our first deep neural network and applying it to a classification problem. The workshop will be divided into three sessions:

Session 1: Basics of TensorFlow (30 min.): In this first part we will describe the basics TensorFlow and graph-based programming: model definition and execution; constant, variable and placeholders tensors; basic operators and gradient computation and optimization.
Session 2: Deep Neural Network implementation (20 min.): In this second part we will focus on a DNN implementation, training with data, weights optimization by gradient descent and net evaluation. To this end we will consider a non-linearly separable classification problem.
Session 3: Hands-on training:
- S3.1: Simple data regression (30 min.).
- S3.2: DNN build-up, training and evaluation (40 min.).

Wednesday July 12th

Lecture: Applied research and technology transfer in speech technologies (Dr. Arantza del Pozo)

Speech technologies and applications are already present in our daily lives. We use them to interact with personal assistants (such as Siri, Cortana or Google Now), dictate text messages or reach customer services – applications that are mostly provided by large American multinationals. However, speech technologies can also be used in many more application scenarios and other companies are looking to integrate them in their products and services. In this talk, we will give an overview of the current speech technology market trends and needs. We will analyse the related technological challenges and propose a technology development roadmap according to the identified needs. Finally, we will show some concrete examples of technology transfer in speech technologies.

Lecture: Language modelling challenges for a IoT world (Dr. Jesús Andrés-Ferrer)

IoT provides an unmatched opportunity for massive data and domain adaptation. In this talk we will highlight some of the challenges of IoT with the focus on the speech technologies from the advantageous viewpoint of a leader in speech and natural language processing technologies. Some of those opportunities and challenges will be analysed from a speech recognition standpoint with a special focus on the recent improvements in the language modelling technologies. One of the main challenges comes from the massive amount of data and limited resources for exploiting them. Many interesting works have been proposed in the literature that address those and others challenges. Another interesting challenge is the contrast between many of the well established corpora for LM in the research community and real production systems. We will analyse some of those corpora as well as the proposed techniques aiming at production systems. Finally, we will shift our focus to the privacy concerns to which those massive amounts of data are inherently bounded.

Lecture: The UPV experience in speech and video for educational content (Dr. Carlos Turró)

In the last 10 years, the Universitat Politècnica de Valencia (UPV) has developed a big program of digital content production. That program has been able to ease the video production process and so UPV has got some big numbers in video production for learning. Also since 2010 UPV is part of the Opencast consortium for video recording, and now we have 64 lecture rooms equipped for autonomous recordings. So the UPV environment is a great testbed for development and trials of speech technology. UPV has a lot of needs in that area, but also have researchers to deal with them. In the talk I will present the project, the content, the speech developments that we are using and some ideas about what could be used in our University.

Thursday July 13th

Lecture: Robust speech recognition in mobile devices (Prof. Antonio Peinado)

This lecture will provide an overview of the different technologies applied in mobile devices for obtaining a robust speech recognition process. This lecture will be divided into two sessions:

Session 1: Review: From 90s until now (1h.): Initial technologies for speech recognition in mobile devices, use of remote speech recognition (RSR): speech-enabled services, architectures, standards, robustness issues, …; RSR nowadays (Google cloud speech API).
Session 2: Microphone array in portable devices (1h.)