Invited Speakers

Day 1

Volker Helzle (Filmakademie Baden-Württemberg, Institute of Animation)

Volker Helzle

Volker Helzle is in charge of Research and Development at the Institute of Animation at Filmakademie Baden-Württemberg. After graduating from HDM Stuttgart Media University (Dipl. Ing. AV-Medien) in 2000 he moved to California and for three years worked at Eyematic Interfaces (later acquired by Google) where his team pioneered facial performance capture substantially contributing to the engineering and development of the Eyematic Facestation. In 2003 he joined Filmakademie where he supervises the research and development department at the Institute of Animation. The primary focus of his first few years at Filmakademie has been the development of facial animation tools. This led to one of the first plausible technology tests, realizing a virtual actor in an exemplary VFX production. In addition to the technical research Volker is supervising the curriculum for the postgraduate Technical Director (TD) course at Filmakademie. TDs tackle the technological challenges of Animation, VFX and Transmedia productions at Filmakademie. The close relation to the research group allows students to engage in multidisciplinary projects. As a program consultant he contributes to the organization of the annual FMX conference. Being a C-64 kid of the 80ties, Volker's life was strongly influenced by video games and early computer graphics. To this day he is a passionate gamer but also finds interest in completely analogical activities like mountain hiking, gardening or yoga.

Keynote: An artistic & tool driven approach for believable digital characters.

This talk focuses on the practical tools developed at the Filmakademie for the creation of believable digital characters. We will discuss solutions that have been implemented to achieve realistic and physically plausible facial deformations during a short setup time. We will also look into new applications that made use of these characters like the cloud based animated messaging service for mobile devices (Emote), an interactive installation where animated characters recite poetry in an emotional way, or the approach we are taking to use stylized animated faces in the research on Autism.

Veronica Orvalho (University of Porto, Department of Computer Science)

Veronica Orvalho

Verónica Costa Orvalho holds a Ph.D in Software Development (Computer Graphics) from Universitat Politécnica de Catalunya (2007), where her research centred on "Facial Animation for CG Films and Videogames". She has been working in IT companies for the past 15 years, such as IBM and Ericsson, and Film companies, including Patagonik Film Argentina. She has given many workshops and has international publications related to game design and character animation in conferences such as SIGGRAPH. She has received international awards for several projects: "Photorealistic facial animation and recognition", "Face Puppet" and "Face In Motion". She has received the 2010 IBM Scientific Award for her work of facial rig retargeting. Now, she is a full time professor of Porto University. In 2010 she founded Porto Interactive Center ( at Porto University, which is the host of several International and national projects as project coordinator or participant. She has strong connections with the film and game companies and provided consulting and participated in several productions like Fable 2, The Simpsons Ride. She has current and past close collaboration with film and game companies such as: Blur Studios, Electronic Arts and Microsoft. Her main research interests are in developing new methods related to motion capture, geometric modeling and deformation, facial emotion synthesis and analysis, real time animation for virtual environments and the study of intelligent avatars.

Keynote: How to create a look-a-like avatar pipeline using low-cost equipment.

Creating a 3D avatar that looks like a specific person is time-consuming, requires expert artists, expensive equipment and a complex pipeline. In this talk I will walk you through the avatar animation pipeline created at PIC (Porto Interactive Center, for the VERE (Virtual Embodiment and Robotic re-Embodiment, European Project. This new pipeline does not require the user to have artistic knowledge, uses regular cameras to create the 3D avatar and a web cam to generate the animation. In this talk i will explain how we designed and created the look-a-like system at each stage: modelling, rigging and animation. I will also describe the challenge we had to overcome and the current status of the system. I will show some of our current avatar results, which could be used for example in games, interactive applications and virtual reality. I look forward to see you at the talk!

Day 2

Jean-Luc Schwartz (GIPSA Lab, Grenoble)

Jean-Luc Schwartz

Jean-Luc Schwartz, Research Director at CNRS, has been leading ICP (Institut de la Communication Parlée, Grenoble France) from 2003 to 2006. His main areas of research involve perceptual processing, perceptuo-motor interactions, audiovisual speech perception, phonetic bases of phonological systems and the emergence of language, with publications in cognitive psychology (e.g. Cognition, Perception & Psychophysics, Behavioral & Brain Sciences, Hearing Research), neurosciences (e.g. Neuroimage or Human Brain Mapping), signal processing and computational modelling (e.g. IEEE Trans. Speech and Audio Processing, JASA, Computer Speech and Language, Language and Cognitive Processes), and phonetics in relation with phonology (e.g. Journal of Phonetics or Phonology Laboratory). He has been involved in many national and European projects, and responsible of some of them. He coordinated a number of special issues of journals such as Speech Communication, Primatology, Philosophical Transactions of the Royal Society B, Frontiers in Psychology, Journal of Phonetics. He organized several international workshops on Audiovisual Speech Processing, Language Emergence or Face-to-Face Communication.

Keynote: Audiovisual binding in speech perception

We have been elaborating in the last years in Grenoble a series of experimental works in which we attempt to show that audiovisual speech perception comprises an "audiovisual binding" stage before fusion and decision. This stage would be in charge to extract and associate the auditory and visual cues corresponding to a given speech source, before further categorisation processes could take place at a higher stage. We developed paradigms to characterize audiovisual binding in terms of both "streaming" and "chunking" adequate pieces of information. This can lead to elements of a possible computational model, in relation with a larger theoretical perceptuo-motor framework for speech perception, the "Perception-for-Action-Control" Theory.

Day 3

Frank Soong (Microsoft Research Asia)

Frank Soong

Frank K. Soong is a Principal Researcher and Research Manager, Speech Group, Microsoft Research Asia (MSRA), Beijing, China, where he works on fundamental research on speech and its practical applications. His professional research career spans over 30 years, first with Bell Labs, US, then with ATR, Japan, before joining MSRA in 2004. At Bell Labs, he worked on stochastic modeling of speech signals, optimal decoder algorithm, speech analysis and coding, speech and speaker recognition. He was responsible for developing the recognition algorithm which was developed into voice-activated mobile phone products rated by the Mobile Office Magazine (Apr. 1993) as the "outstandingly the best". He is a co-recipient of the Bell Labs President Gold Award for developing the Bell Labs Automatic Speech Recognition (BLASR) software package. He has served as a member of the Speech and Language Technical Committee, IEEE Signal Processing Society and other society functions, including Associate Editor of the IEEE Speech and Audio Transactions and chairing IEEE Workshop. He published extensively with more than 200 papers and co-edited a widely used reference book, Automatic Speech and Speech Recognition- Advanced Topics, Kluwer, 1996. He is a visiting professor of the Chinese University of Hong Kong (CUHK) and a few other top-rated universities in China. He is also the co-Director of the National MSRA-CUHK Joint Research Lab. He got his BS, MS and PhD from National Taiwan Univ., Univ. of Rhode Island, and Stanford Univ, all in Electrical Eng. He is an IEEE Fellow "for contributions to digital processing of speech".

Lijuan Wang (Microsoft Research Asia)

Lijuan Wang

Lijuan Wang received B.E. from Huazhong Univ. of Science and Technology and Ph.D. from Tsinghua Univ., China in 2001 and 2006 respectively. In 2006, she joined the speech group of Microsoft Research Asia, where she is currently a lead researcher. Her research areas include audio-visual speech synthesis, deep learning (feedforward and recurrent neural networks), and speech synthesis (TTS)/recognition. She has published more than 25 papers on top conferences and journals and she is the inventor/co-inventor of more than 10 granted/pending USA patents. She is a senior member of IEEE and a member of ISCA.

Keynote: From Text-to-Speech (TTS) to Talking Head - A machine learning approach to A/V speech modeling and rendering

In this talk, we will present our research results in A/V speech modeling and rendering via a statistical, machine learning approach. A Gaussian Mixture Model (GMM) based Hidden Markov Model (HMM) will be reviewed first in speech modeling where GMM is for modeling the stochastic nature of speech production while HMM, for characterizing the Markovian nature of speech parameter trajectories. All speech parametric models are estimated via an EM algorithm based maximum likelihood procedure and the resultant models are used to generate speech parameter trajectories for a given text input, say a sentence, in the maximum probability sense. Thus generated parameters is then used to synthesize corresponding speech waveforms via a vocoder or to render high quality output speech by our "trajectory tiling algorithm" where appropriate segments of the training speech database are used to "tile" the generated trajectory optimally. Similarly, the lips movement of a talking head, along with the jointly moving articulatory parts like jaw, tongue and teeth, can also be trained and rendered according to the optimization procedure. The visual parameters of a talking head can be collected via 2D- or 3D-video(via stereo, multi-camera recording equipment or consumer grade, capturing devices like Microsoft Kinect)and the corresponding visual trajectories of intensity, color and spatial coordinates are modeled and synthesized similarly. Recently, feedforward Deep Neural Net (DNN) and Recurrent Neural Net machine learning algorithms have been applied to speech modeling for both recognition and synthesis applications. We have deployed both forms of neural nets in TTS training successfully. The RNN, particularly, with a longer memory can model speech prosody of longer contexts in speech, say in a sentence, better. We will also cover the topics of cross-lingual TTS and talking head modeling, where audio and visual data collected in one source language can be used to train a TTS or talking head in a different target language. The mouth shapes of a mono-lingual speaker have also been found adequate for rendering synced lips movement of talking heads in different languages. Various demos of TTS and talking head will be shown to illustrate our research findings.