2012 IEEE 12th International Conference on Data Mining, 2012
Kernel clustering algorithms have the ability to capture the non-linear structure inherent in man... more Kernel clustering algorithms have the ability to capture the non-linear structure inherent in many real world data sets and thereby, achieve better clustering performance than Euclidean distance based clustering algorithms. However, their quadratic computational complexity renders them nonscalable to large data sets. In this paper, we employ random Fourier maps, originally proposed for large scale classification, to accelerate kernel clustering. The key idea behind the use of random Fourier maps for clustering is to project the data into a low-dimensional space where the inner product of the transformed data points approximates the kernel similarity between them. An efficient linear clustering algorithm can then be applied to the points in the transformed space. We also propose an improved scheme which uses the top singular vectors of the transformed data matrix to perform clustering, and yields a better approximation of kernel clustering under appropriate conditions. Our empirical studies demonstrate that the proposed schemes can be efficiently applied to large data sets containing millions of data points, while achieving accuracy similar to that achieved by state-of-the-art kernel clustering algorithms.
Digital data explosion mandates the development of scalable tools to organize the data in a meani... more Digital data explosion mandates the development of scalable tools to organize the data in a meaningful and easily accessible form. Clustering is a commonly used tool for data organization. However, many clustering algorithms designed to handle large data sets assume linear separability of data and hence do not perform well on real world data sets. While kernel-based clustering algorithms can capture the non-linear structure in data, they do not scale well in terms of speed and memory requirements when the number of objects to be clustered exceeds tens of thousands. We propose an approximation scheme for kernel k-means, termed approximate kernel k-means, that reduces both the computational complexity and the memory requirements by employing a randomized approach. We show both analytically and empirically that the performance of approximate kernel k-means is similar to that of the kernel k-means algorithm, but with dramatically reduced run-time complexity and memory requirements.
Abstract—The ubiquity of personal computing technology has produced an abundance of staggeringly ... more Abstract—The ubiquity of personal computing technology has produced an abundance of staggeringly large data sets—the Library of Congress has stored over 160 terabytes of web data and it is estimated that Facebook alone logs over 25 terabytes of data per day. There is a great need for systems by which one can elucidate the similarity and dissimilarity among and between groups in these data sets. Clustering is one way to find these groups. In this paper, we propose an approximation method for the fuzzy and possibilistic kernel c-means clustering algorithms. Our approximation constrains the cluster centers to be linear combinations of a size m randomly selected subset of the n input objects, where m<<n. The proposed algorithm only requires an m × n rectangular portion of the full n × n kernel matrix and the n diagonal values, resulting in significant memory savings. Furthermore, the computational complexity of the c-means algorithm is substantially reduced. We demonstrate that up...
Proceedings of the 8th Workshop on Ph.D. Workshop in Information and Knowledge Management, 2015
In clustering applications involving documents and images, in addition to the large number of dat... more In clustering applications involving documents and images, in addition to the large number of data points (N) and their high dimensionality (d), the number of clusters (C) into which the data need to be partitioned is also large. Kernel-based clustering algorithms, which have been shown to perform better than linear clustering algorithms, have high running time complexity in terms of N, d and C. We propose an efficient sparse kernel k-means clustering algorithm, which incrementally samples the most informative points from the data set using importance sampling, and constructs a sparse kernel matrix using these sampled points. Each row in this matrix corresponds to a data point's similarity with its p-nearest neighbors among the sampled points (p -- N). This sparse kernel matrix is used to perform clustering and obtain the cluster labels. This combination of sampling and sparsity reduces both the running time and memory complexity of kernel clustering. In order to further enhance...
Stream clustering methods, which group continuous, temporally ordered dynamic data instances, hav... more Stream clustering methods, which group continuous, temporally ordered dynamic data instances, have been used in a number of applications such as stock market analysis, network analysis, and cosmological analysis. Most of the popular stream clustering algorithms are linear in nature, i.e. they assume that the data is linearly separable in the input space and use measures such as the Euclidean distance to define the inter-point similarity. Though these linear clustering algorithms are efficient, they do no achieve acceptable cluster quality on real-world data. Kernel-based clustering algorithms, which use non-linear similarity measures, yield better cluster quality, but are unsuitable for clustering data streams due to their high running time and memory complexity. We propose an efficient kernel-based clustering algorithm, called the Approximate Stream Kernel k-means, which uses importance sampling to sample a subset of the data stream, and clusters the entire stream based on each dat...
KERNEL-BASED CLUSTERING OF BIG DATA By Radha Chitta There has been a rapid increase in the volume... more KERNEL-BASED CLUSTERING OF BIG DATA By Radha Chitta There has been a rapid increase in the volume of digital data o ver the recent years. A study by IDC and EMC Corporation predicted the creation of 44 zettaby tes (10 bytes) of digital data by the year 2020. Analysis of this massive amounts of data, popu larly known asbig data, necessitates highly scalable data analysis techniques. Clusterin g is an exploratory data analysis tool used to discover the underlying groups in the data. The state-ofthe-art algorithms for clustering big data sets arelinear clustering algorithms, which assume that the data is linear ly separable in the input space, and use measures such as the Euclidean distance to define the inter-point similarities. Though efficient, linear clustering algorithms do not achie ve high cluster quality on real-world data sets, which are not linearly separable. Kernel-based clust ering algorithms employ non-linear similarity measures to define the inter-point similarities. As ...
Question answering is an important and difficult task in the natural language processing domain, ... more Question answering is an important and difficult task in the natural language processing domain, because many basic natural language processing tasks can be cast into a question answering task. Several deep neural network architectures have been developed recently, which employ memory and inference components to memorize and reason over text information, and generate answers to questions. However, a major drawback of many such models is that they are capable of only generating single-word answers. In addition, they require large amount of training data to generate accurate answers. In this paper, we introduce the Long-Term Memory Network (LTMN), which incorporates both an external memory module and a Long Short-Term Memory (LSTM) module to comprehend the input data and generate multi-word answers. The LTMN model can be trained end-to-end using back-propagation and requires minimal supervision. We test our model on two synthetic data sets (based on Facebook's bAbI data set) and t...
Proceedings of the Seventeenth International Conference on Artificial Intelligence and Law
The problem of answering multiple choice questions, based on the content of documents has been st... more The problem of answering multiple choice questions, based on the content of documents has been studied extensively in the machine learning literature. We pose the due diligence problem, where lawyers study legal contracts and assess the risk in potential mergers and acquisitions, as a multiple choice question answering problem, based on the text of the contract. Existing frameworks for question answering are not suitable for this task, due to the inherent scarcity and imbalance in the legal contract data available for training. We propose a question answering system which first identifies the excerpt in the contract which potentially contains the answer to a given question, and then builds a multi-class classifier to choose the answer to the question, based on the content of this excerpt. Unlike existing question answering systems, the proposed system explicitly handles the imbalance in the data, by generating synthetic instances of the minority answer categories, using the Synthetic Minority Oversampling Technique. This ensures that the number of instances in all the classes are roughly equal to each other, thus leading to more accurate and reliable classification. We demonstrate that the proposed question answering system outperforms the existing systems with minimal amount of training data.
Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '17, 2017
Predicting the future health information of patients from the historical Electronic Health Record... more Predicting the future health information of patients from the historical Electronic Health Records (EHR) is a core research task in the development of personalized healthcare. Patient EHR data consist of sequences of visits over time, where each visit contains multiple medical codes, including diagnosis, medication, and procedure codes. e most important challenges for this task are to model the temporality and high dimensionality of sequential EHR data and to interpret the prediction results. Existing work solves this problem by employing recurrent neural networks (RNNs) to model EHR data and utilizing simple a ention mechanism to interpret the results. However, RNN-based approaches su er from the problem that the performance of RNNs drops when the length of sequences is large, and the relationships between subsequent visits are ignored by current RNN-based approaches. To address these issues, we propose Dipole, an end-to-end, simple and robust model for predicting patients' future health information. Dipole employs bidirectional recurrent neural networks to remember all the information of both the past visits and the future visits, and it introduces three a ention mechanisms to measure the relationships of di erent visits for the prediction. With the a ention mechanisms, Dipole can interpret the prediction results e ectively. Dipole also allows us to interpret the learned medical code representations which are con rmed positively by medical experts. Experimental results on two real world EHR datasets show that the proposed Dipole can signi cantly improve the prediction accuracy compared with the state-of-the-art diagnosis prediction approaches and provide clinically meaningful interpretation. * is work was mostly done when the rst author was an intern in Xerox. † is work was done when the h author was part of Xerox.
Kernel-based clustering algorithms have the ability to capture the non-linear structure in real w... more Kernel-based clustering algorithms have the ability to capture the non-linear structure in real world data. Among various kernel-based clustering algorithms, kernel k-means has gained popularity due to its simple iterative nature and ease of implementation. However, its run-time complexity and memory footprint increase quadratically in terms of the size of the data set, and hence, large data sets cannot be clustered efficiently. In this paper, we propose an approximation scheme based on randomization, called the Approximate Kernel k-means. We approximate the cluster centers using the kernel similarity between a few sampled points and all the points in the data set. We show that the proposed method achieves better clustering performance than the traditional low rank kernel approximation based clustering schemes. We also demonstrate that its running time and memory requirements are significantly lower than those of kernel k-means, with only a small reduction in the clustering quality ...
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
In recent years, Deep Learning has been successfully applied to multimodal learning problems, wit... more In recent years, Deep Learning has been successfully applied to multimodal learning problems, with the aim of learning useful joint representations in data fusion applications. When the available modalities consist of time series data such as video, audio and sensor signals, it becomes imperative to consider their temporal structure during the fusion process. In this paper, we propose the Correlational Recurrent Neural Network (CorrRNN), a novel temporal fusion model for fusing multiple input modalities that are inherently temporal in nature. Key features of our proposed model include: (i) simultaneous learning of the joint representation and temporal dependencies between modalities, (ii) use of multiple loss terms in the objective function, including a maximum correlation loss term to enhance learning of cross-modal information, and (iii) the use of an attention model to dynamically adjust the contribution of different input modalities to the joint representation. We validate our model via experimentation on two different tasks: video-and sensor-based activity classification, and audiovisual speech recognition. We empirically analyze the contributions of different components of the proposed CorrRNN model, and demonstrate its robustness, effectiveness and state-of-the-art performance on multiple datasets.
Proceedings of the 27th ACM International Conference on Information and Knowledge Management
The goal of diagnosis prediction task is to predict the future health information of patients fro... more The goal of diagnosis prediction task is to predict the future health information of patients from their historical Electronic Healthcare Records (EHR). The most important and challenging problem of diagnosis prediction is to design an accurate, robust and interpretable predictive model. Existing work solves this problem by employing recurrent neural networks (RNNs) with attention mechanisms, but these approaches suffer from the data sufficiency problem. To obtain good performance with insufficient data, graph-based attention models are proposed. However, when the training data are sufficient, they do not offer any improvement in performance compared with ordinary attention-based models. To address these issues, we propose KAME, an end-to-end, accurate and robust model for predicting patients' future health information. KAME not only learns reasonable embeddings for nodes in the knowledge graph, but also exploits general knowledge to improve the prediction accuracy with the proposed knowledge attention mechanism. With the learned attention weights, KAME allows us to interpret the importance of each piece of knowledge in the graph. Experimental results on three real world datasets show that the proposed KAME significantly improves the prediction performance compared with the state-of-the-art approaches, guarantees the robustness with both sufficient and insufficient data, and learns interpretable disease representations.
Background Diagnosis aims to predict the future health status of patients according to their hist... more Background Diagnosis aims to predict the future health status of patients according to their historical electronic health records (EHR), which is an important yet challenging task in healthcare informatics. Existing diagnosis prediction approaches mainly employ recurrent neural networks (RNN) with attention mechanisms to make predictions. However, these approaches ignore the importance of code descriptions, i.e., the medical definitions of diagnosis codes. We believe that taking diagnosis code descriptions into account can help the state-of-the-art models not only to learn meaning code representations, but also to improve the predictive performance, especially when the EHR data are insufficient. Methods We propose a simple, but general diagnosis prediction framework, which includes two basic components: diagnosis code embedding and predictive model. To learn the interpretable code embeddings, we apply convolutional neural networks (CNN) to model medical descriptions of diagnosis cod...
Partitional clustering algorithms, which partition the dataset into a pre-defined number of clust... more Partitional clustering algorithms, which partition the dataset into a pre-defined number of clusters, can be broadly classified into two types: algorithms which explicitly take the number of clusters as input and algorithms that take the expected size of a cluster as input. In this paper, we propose a variant of the k-means algorithm and prove that it is more efficient than standard k-means algorithms. An important contribution of this paper is the establishment of a relation between the number of clusters and the size of the clusters in a dataset through the analysis of our algorithm. We also demonstrate that the integration of this algorithm as a pre-processing step in classification algorithms reduces their runningtime complexity.
Kernel-based clustering algorithms have the ability to capture the non-linear structure in real w... more Kernel-based clustering algorithms have the ability to capture the non-linear structure in real world data. Among various kernel-based clustering algorithms, kernel k-means has gained popularity due to its simple iterative nature and ease of implementation. However, its run-time complexity and memory footprint increase quadratically in terms of the size of the data set, and hence, large data sets cannot be clustered efficiently. In this paper, we propose an approximation scheme based on randomization, called the Approximate Kernel k-means. We approximate the cluster centers using the kernel similarity between a few sampled points and all the points in the data set. We show that the proposed method achieves better clustering performance than the traditional low rank kernel approximation based clustering schemes. We also demonstrate that it's running time and memory requirements are significantly lower than those of kernel k-means, with only a small reduction in the clustering quality on several public domain large data sets. We then employ ensemble clustering techniques to further enhance the performance of our algorithm.
Uploads
Papers by Radha Chitta