Papers by Christian Raymond
Automatic learning of interpretatation strategies for spoken dialogue systems
HAL (Le Centre pour la Communication Scientifique Directe), 2004
HAL (Le Centre pour la Communication Scientifique Directe), Sep 16, 2022
HAL (Le Centre pour la Communication Scientifique Directe), 2002
Dans cet article, nous décrivons notre participation au Défi Fouille de Texte (DeFT) 2012. Ce déf... more Dans cet article, nous décrivons notre participation au Défi Fouille de Texte (DeFT) 2012. Ce défi consistait en l'attribution automatique de mots-clés à des articles scientifiques en français, selon deux pistes pour lesquelles nous avons employé des approches différentes. Pour la première piste, une liste de mots-clés était fournie. Nous avons donc abordé ce problème comme une tâche de recherche d'information dans laquelle les mots-clés sont les requêtes. Cette approche a donné d'excellents résultats. Pour la seconde piste, seuls les articles étant fournis, nous avons employé une approche s'appuyant sur un extracteur de terme et une réordonnancement par apprentissage.
Abstract—One of the first steps in building a spoken language understanding (SLU) module for dial... more Abstract—One of the first steps in building a spoken language understanding (SLU) module for dialogue systems is the extraction of flat concepts out of a given word sequence, usually provided by an automatic speech recognition (ASR) system. In this paper, six different modeling approaches are investigated to tackle the task of concept tagging. These methods include classical, well-known generative and discriminative methods like Finite State Transducers

HAL (Le Centre pour la Communication Scientifique Directe), Jun 26, 2017
Cet article décrit la participation de l'équipe LinkMedia de l'IRISA à DeFT 2017. Notre équipe a ... more Cet article décrit la participation de l'équipe LinkMedia de l'IRISA à DeFT 2017. Notre équipe a participé aux 3 tâches : classification des tweets non figuratifs selon leur polarité (tâche 1), l'identification du langage figuratif (tâche 2) et la classification des tweets figuratifs et non figuratifs selon leur polarité (tâche 3). Pour ces trois tâches, nous adoptons une démarche d'apprentissage artificiel. Plus précisément, nous explorons l'intérêt de trois méthodes de complexité croissante : i) les k plus proches voisins issues de la recherche d'information, ii) le boosting d'arbres de décision, et iii) les réseaux neuronaux récurrents. Nos approches n'exploitent aucune ressource externe riche (lexiques, corpus annotés) et sont uniquement fondées sur le contenu textuel des tweets (et d'autres tweets pour la dernière approche). Cela nous permet d'évaluer l'intérêt de chacune de ces méthodes, mais aussi des représentations qu'elles exploitent, à savoir les sacs-de-mots pour la première, les n-grams pour la deuxième et le plongement de mots (word embedding) pour les réseaux neuronaux.

IEEE Transactions on Speech and Audio Processing, Nov 1, 2003
This paper introduces new recognition strategies based on reasoning about results obtained with d... more This paper introduces new recognition strategies based on reasoning about results obtained with different Language Models (LMs). Strategies are built following the conjecture that the consensus among the results obtained with different models gives rise to different situations in which hypothesized sentences have different word error rates (WER) and may be further processed with other LMs. New LMs are built by data augmentation using ideas from latent semantic analysis and trigram analogy. Situations are defined by expressing the consensus among the recognition results produced with different LMs and by the amount of unobserved trigrams in the hypothesized sentence. The diagnostic power of the use of observed trigrams or their corresponding class trigrams is compared with that of situations based on values of sentence posterior probabilities. In order to avoid or correct errors due to syntactic inconsistence of the recognized sentence, automata, obtained by explanation-based learning, are introduced and used in certain conditions. Semantic Classification Trees are introduced to provide sentence patterns expressing constraints of long distance syntactic coherence. Results on a dialogue corpus provided by France Telecom R&D have shown that starting with a WER of 21.87% on a test set of 1422 sentences, it is possible to subdivide the sentences into three sets characterized by automatically recognized situations. The first one has a coverage of 68% with a WER of 7.44%. The second one has various types of sentences with a WER around 20%. The third one contains 13% of the sentences that should be rejected with a WER around 49%. The second set characterizes sentences that should be processed with particular care by the dialogue interpreter with the possibility of asking a confirmation from the user.

Video hyperlinking represents a classical example of multimodal problems. Common approaches to su... more Video hyperlinking represents a classical example of multimodal problems. Common approaches to such problems are early fusion of the initial modalities and crossmodal translation from one modality to the other. Recently, deep neural networks, especially deep autoencoders, have proven promising both for crossmodal translation and for early fusion via multimodal embedding. A particular architecture, bidirectional symmetrical deep neural networks, have been proven to yield improved multimodal embeddings over classical autoencoders, while also being able to perform crossmodal translation. In this work, we focus firstly at evaluating good single-modal continuous representations both for textual and for visual information. Word2Vec and paragraph vectors are evaluated for representing collections of words, such as parts of automatic transcripts and multiple visual concepts, while different deep convolutional neural networks are evaluated for directly embedding visual information, avoiding the creation of visual concepts. Secondly, we evaluate methods for multimodal fusion and crossmodal translation, with different single-modal pairs, in the task of video hyperlinking. Bidirectional (symmetrical) deep neural networks were shown to successfully tackle downsides of multimodal autoencoders and yield a superior multimodal representation. In this work, we extensively tests them in different settings, with different single-modal representations, within the context of video hyperlinking. Our novel bidirectional symmetrical deep neural networks are compared to classical autoencoders and are shown to yield significantly improved multimodal embeddings that significantly (α = 0.0001) outperform multimodal embeddings obtained by deep autoencoders with an absolute improvement in precision at 10 of 14.1 % when embedding visual concepts and automatic transcripts and an absolute improvement of 4.3 % when embedding automatic transcripts with features obtained with very deep convolutional neural networks, yielding 80 % of precision at 10.

Common approaches to problems involving multiple modalities (classification, retrieval, hyperlink... more Common approaches to problems involving multiple modalities (classification, retrieval, hyperlinking, etc.) are early fusion of the initial modalities and crossmodal translation from one modality to the other. Recently, deep neural networks, especially deep autoencoders, have proven promising both for crossmodal translation and for early fusion via multimodal embedding. In this work, we propose a flexible crossmodal deep neural network architecture for multimodal and crossmodal representation. By tying the weights of two deep neural networks, symmetry is enforced in central hidden layers thus yielding a multimodal representation space common to the two original representation spaces. The proposed architecture is evaluated in multimodal query expansion and multimodal retrieval tasks within the context of video hyperlinking. Our method demonstrates improved crossmodal translation capabilities and produces a multimodal embedding that significantly outperforms multimodal embeddings obtained by deep autoencoders, resulting in an absolute increase of 14.14 in precision at 10 on a video hyperlinking task (α = 10 −4).
HAL (Le Centre pour la Communication Scientifique Directe), 2016
This paper presents the runs that were submitted to the TRECVid Challenge 2016 for the Video Hype... more This paper presents the runs that were submitted to the TRECVid Challenge 2016 for the Video Hyperlinking task. The task aims at proposing a set of video segments, called targets, to complement a query video segment defined as anchor. The 2016 edition of the task encouraged participants to use multiple modalities. In this context, we chose to submit four runs in order to assess the pros and cons of using two modalities instead of a single one and how crossmodality differs from multimodality in terms of relevance. The crossmodal run performed best and obtained the best precision at rank 5 among participants.

Lecture Notes in Computer Science, Dec 31, 2016
Video hyperlinking is the process of creating links within a collection of videos. Starting from ... more Video hyperlinking is the process of creating links within a collection of videos. Starting from a given set of video segments, called anchors, a set of related segments, called targets, must be provided. In the past years, a number of content-based approaches have been proposed with good results obtained by searching for target segments that are very similar to the anchor in terms of content and information. Unfortunately, relevance has been obtained to the expense of diversity. In this paper, we study multimodal approaches and their ability to provide a set of diverse yet relevant targets. We compare two recently introduced crossmodal approaches, namely, deep auto-encoders and bimodal LDA, and experimentally show that both provide significantly more diverse targets than a state-of-the-art baseline. Bimodal auto-encoders offer the best trade-off between relevance and diversity, with bimodal LDA exhibiting slightly more diverse targets at a lower precision.
Architectures of Recurrent Neural Networks (RNN) recently become a very popular choice for Spoken... more Architectures of Recurrent Neural Networks (RNN) recently become a very popular choice for Spoken Language Understanding (SLU) problems; however, they represent a big family of different architectures that can furthermore be combined to form more complex neural networks. In this work, we compare different recurrent networks, such as simple Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM) networks, Gated Memory Units (GRU) and their bidirectional versions, on the popular ATIS dataset and on MEDIA, a more complex French dataset. Additionally, we propose a novel method where information about the presence of relevant word classes in the dialog history is combined with a bidirectional GRU, and we show that combining relevant word classes from the dialog history improves the performance over recurrent networks that work by solely analyzing the current sentence.
Recently, word embedding representations have been investigated for slot filling in Spoken Langua... more Recently, word embedding representations have been investigated for slot filling in Spoken Language Understanding, along with the use of Neural Networks as classifiers. Neural Networks, especially Recurrent Neural Networks, that are specifically adapted to sequence labeling problems, have been applied successfully on the popular ATIS database. In this work, we make a comparison of this kind of models with the previously state-of-the-art Conditional Random Fields (CRF) classifier on a more challenging SLU database. We show that, despite efficient word representations used within these Neural Networks, their ability to process sequences is still significantly lower than for CRF, while also having a drawback of higher computational costs, and that the ability of CRF to model output label dependencies is crucial for SLU.

Open Computer Science, 2019
Nowadays, the real life constraints necessitates controlling modern machines using human interven... more Nowadays, the real life constraints necessitates controlling modern machines using human intervention by means of sensorial organs. The voice is one of the human senses that can control/monitor modern interfaces. In this context, Automatic Speech Recognition is principally used to convert natural voice into computer text as well as to perform an action based on the instructions given by the human. In this paper, we propose a general framework for Arabic speech recognition that uses Long Short-Term Memory (LSTM) and Neural Network (Multi-Layer Perceptron: MLP) classifier to cope with the nonuniform sequence length of the speech utterances issued from both feature extraction techniques, (1) Mel Frequency Cepstral Coefficients MFCC (static and dynamic features), (2) the Filter Banks (FB) coefficients. The neural architecture can recognize the isolated Arabic speech via classification technique. The proposed system involves, first, extracting pertinent features from the natural speech signal using MFCC (static and dynamic features) and FB. Next, the extracted features are padded in order to deal with the non-uniformity of the sequences length. Then, a deep architecture represented by a recurrent LSTM or GRU (Gated Recurrent Unit) architectures are used to encode the sequences of MFCC/FB features as a fixed size vector that will be introduced to a Multi-Layer Perceptron network (MLP) to perform the classification (recognition). The proposed system is assessed using two different databases, the first one concerns the spoken digit recognition where a comparison with other related works in the literature is performed, whereas the second one contains the spoken TV commands. The obtained results show the superiority of the proposed approach.

Continuous multimodal representations suitable for multimodal information retrieval are usually o... more Continuous multimodal representations suitable for multimodal information retrieval are usually obtained with methods that heavily rely on multimodal autoencoders. In video hyperlinking, a task that aims at retrieving video segments, the state of the art is a variation of two interlocked networks working in opposing directions. ese systems provide good multimodal embeddings and are also capable of translating from one representation space to the other. Operating on representation spaces, these networks lack the ability to operate in the original spaces (text or image), which makes it di cult to visualize the crossmodal function, and do not generalize well to unseen data. Recently, generative adversarial networks have gained popularity and have been used for generating realistic synthetic data and for obtaining high-level, single-modal latent representation spaces. In this work, we evaluate the feasibility of using GANs to obtain multimodal representations. We show that GANs can be used for multimodal representation learning and that they provide multimodal representations that are superior to representations obtained with multimodal autoencoders. Additionally, we illustrate the ability of visualizing crossmodal translations that can provide human-interpretable insights on learned GAN-based video hyperlinking models.

The paper addresses the issue of confidence measure reliability provided by automatic speech reco... more The paper addresses the issue of confidence measure reliability provided by automatic speech recognition systems for use in various spoken language processing applications. In this context, a conditional random field (CRF)-based combination of contextual features is proposed to improve wordlevel confidence measures. More precisely, the method consists in combining phonetic, lexical, linguistic and semantic features to enhance confidence measures, explicitly exploiting context information. The combination is performed using CRFs whose selected patterns enable to establish a precise diagnosis about the interest of individual and contextual features. Experiments, conducted on the French broadcast news corpus ESTER, demonstrate the added-value of the proposed CRF-based combination of contextual features, with significant improvement of the normalized cross entropy and of the equal error rate.

HAL (Le Centre pour la Communication Scientifique Directe), Jun 22, 2015
Cet article décrit la participation de l'équipe LinkMedia de l'IRISA à DeFT 2015. Notre équipe pa... more Cet article décrit la participation de l'équipe LinkMedia de l'IRISA à DeFT 2015. Notre équipe particpé à deux tâches : la classification en valence des tweets (tâche 1) et la classification à grain fin, elle même, décomposée en deux sous-tâches, à savoir la détection des classes génériques de l'information exprimée dans un tweet (tâche 2.1) et la classification des classes spécifiques (tâches 2.2) de l'émotion/sentiment/opinion exprimée. Pour ces trois tâches, nous adoptons une démarche d'apprentissage artificiel. Plus précisément, nous explorons l'intér de trois méthodes : i) le boosting d'arbres de décision, ii) l'apprentissage bayésien utilisant une technique issue de la recherche d'information, et iii) les réseaux neuronnaux convolutionnels. Nos approches n'exploitent aucune ressource externe (lexiques, corpus) et sont uniquement fondées sur le contenu textuel des tweets. Cela nous permet d'évaluer l'intérêt de chacune de ces méthodes, mais aussi des représentations qu'elles exploitent, à savoir les sacs-de-mots pour les deux premières et le plongement de mots (word embedding) pour les réseaux neuronaux.

HAL (Le Centre pour la Communication Scientifique Directe), Mar 22, 2016
Nowadays, wind power and precise forecasting are of great importance for the development of moder... more Nowadays, wind power and precise forecasting are of great importance for the development of modern electrical grids. In this paper we propose a prediction system for time series based on Kernel Principal Component Analysis (KPCA) and Extreme Learning Machine (ELM). To compare the proposed approach, three dimensionality reduction techniques were used: full space (50 variables), part of space (last four variables) and classical Principal Components Analysis (PCA). These models were compared using three evaluation criteria: mean absolute error (MAE), root mean square error (RMSE), and normalized mean square error (NMSE). The results show that the reduction of the original input space affects positively the prediction output of the wind speed. Thus, It can be concluded that the non linear model (KPCA) model outperform the other reduction techniques in terms of prediction performance.
Two approaches to Spoken Language Understanding based on frames describing chunked knowledge are ... more Two approaches to Spoken Language Understanding based on frames describing chunked knowledge are described. They are applied to the MEDIA corpus annotated in terms of concepts expressing chunks of spoken sentences. General rules of knowledge composition and inference appear to be adequate to effectively applying the application ontology for obtaining frame based representations of dialogue turns. The main difficulty appears to be the characterization of the syntactic knowledge expressing semantic links between knowledge chunks. This knowledge can be hand-crafted or automatically learned from examples. It is shown that the latter approach outperforms the former if applied to ASR error prone transcriptions.
Uploads
Papers by Christian Raymond