Machine Learning Discriminates Classes, Detects Clusters in High-Dimensional Data Domains
David Miller, Professor of Electrical Engineering
Robust Machine Intelligence and Control Lab
Department of Electrical Engineering
Consider an uncatalogued database of one hundred thousand text documents. It would take substantial human resources to organize such a database ``by topic'', i.e. to identify the number of topics present, to assign documents to topics, and to identify defining topic-specific keywords. Since each document may be represented by ~ 20,000 word (presence, absence) ``features'', this machine learning task is a formidable, high-dimensional, unsupervised clustering problem. Likewise, consider early prognostication of Alzheimer's disease, based on a baseline 3-D brain scan and cognitive , clinical, as well as genetic data. With hundreds of thousands or even millions of voxels (as well as genetic loci), this is a very high-dimensional supervised learning task.
Dr. David J. Miller's research work is centered on these unsupervised and supervised machine learning problems, addressing fundamental problems such as how to accurately estimate the number of clusters and the cluster-specific features in very high-dimensional domains, how to train supervised classifiers when labeled examples are scarce (semisupervised learning), how to achieve reliable aggregate decisionmaking from unreliable as well as possibly malicious individual voters/``experts'', and how to discover the presence of latent, unknown classes in observed data sets. His work addresses applications to text document modeling, medical image processing (brain and knee), bioinformatics (identifying gene-environment interactions that confer either increased risk of or protection against disease), computer network intrusion detection, crowdsourcing, and anomaly detection in general. Dr. Miller's research work has been funded by NSF, NIH, AFRL, ONR, and NASA, with current funding from NSF and AFRL.
Related publications include:
Y. Aksu, D. J. Miller, G. Kesidis, D. Bigler, Q. Yang, ``An MRI-derived definition of MCI-to-AD conversion for long-term, automatic prognosis of MCI patients'', PLoS One, 2011.
D.J. Miller et al., ``An algorithm for learning maximum entropy probability models of disease risk that efficiently searches and sparingly encodes multilocus genomic interactions'', Bioinformatics, 2009.
M.W. Graham and D.J. Miller, ``Unsupervised learning of parsimonious mixtures on large spaces with integrated feature and component selection'', IEEE Trans. on Signal Processing, 2006.
D.J. Miller and J. Browning, ``A mixture model and EM-based algorithm for class discovery, robust classification, and outlier rejection in mixed labeled/unlabeled data sets'', IEEE Trans. on Pattern Analysis and Machine Intelligence, Nov. 2003, pp. 1468-1483.