Ph.D. Theses

The doctoral dissertation represents the culmination of the entire graduate school experience. It is a snapshot of all that a student has accomplished and learned about their dissertation topics. While we could post these on our publications page, we feel that they deserve a page of their own. Here are Ph.D. theses from lab members in reverse chronological order.

Deep Neural Networks for 3D Processing and High-Dimensional Filtering
by Hang Su, May 2020. [pdf]

Abstract:

Deep neural networks (DNN) have seen tremendous success in the past few years, advancing state of the art in many AI areas by significant margins. Part of the success can be attributed to the wide adoption of convolutional filters. These filters can effectively capture the invariance in data, leading to faster training and more compact representations, and at the same can leverage efficient parallel implementations on modern hardware. Since convolution operates on regularly structured grids, it is a particularly good fit for texts and images where there are inherent rigid 1D or 2D structures. However, extending DNNs to 3D or higher-dimensional spaces is non-trivial, because data in such spaces often lack regular structure, and the curse of dimensionality can also adversely impact performance in multiple ways.

In this thesis, we present several new types of neural network operations and architectures for data in 3D and higher-dimensional spaces and demonstrate how we can mitigate these issues while retaining the favorable properties of 2D convolutions. First, we investigate view-based representations for 3D shape recognition. We show that a collection of 2D views can be highly informative, and we can adapt standard 2D DNNs with a simple pooling strategy to recognize objects based on their appearances from multiple viewing angles with unprecedented accuracies. Our next study makes a connection between 3D point cloud processing and sparse high-dimensional filtering. The resulting representation is highly efficient and flexible, and enables native 3D operations as well as joint 2D-3D reasoning. Finally, we show that high-dimensional filtering is also a powerful tool for content-adaptive image filtering. We demonstrate its utility in computer vision applications where preserving sharp details in output is critical, including joint upsampling and semantic segmentation.

Improving Visual Recognition with Unlabeled Data
by Aruni RoyChowdhury, May 2020. [pdf]

Abstract:

The success of deep neural networks has resulted in computer vision systems that obtain high accuracy on a wide variety of tasks such as image classification, object detection, semantic segmentation, etc. However, most state-of-the-art vision systems are dependent upon large amounts of labeled training data, which is not a scalable solution in the long run. This work focuses on improving existing models for visual object recognition and detection without being dependent on such large-scale human-annotated data.

We first show how large numbers of hard examples (cases where an existing model makes a mistake) can be obtained automatically from unlabeled video sequences by exploiting temporal consistency cues in the output of a pre-trained object detector. These examples can strongly influence a model's parameters when the network is re-trained to correct them, resulting in improved performance on several object detection tasks. Further, such hard examples from unlabeled videos can be used to address the problem of unsupervised domain adaptation. We focus on the automatic adaptation of an existing object detector to a new domain with no labeled data, assuming that a large number of unlabeled videos are readily available.

Finally, we address the problem of face recognition, which has achieved high accuracy by employing deep neural networks trained on massive labeled datasets. Further improvements through supervised learning require significantly larger datasets and hence massive annotation efforts. We improve upon the performance of face recognition models trained on large-scale labeled datasets by using unlabeled faces as additional training data. We present insights and recipes for training deep face recognition models with labeled and unlabeled data at scale, addressing real-world challenges such as overlapping identities between the labeled and unlabeled datasets, as well as label noise introduced by clustering errors.

Motion Segmentation - Segmentation of Independently Moving Objects in Video
by Pia Katalin Bideau, Feburary 2020. [pdf]

Abstract:

The ability to recognize motion is one of the most important functions of our visual system. Motion allows us both to recognize objects and to get a better understanding of the 3D world in which we are moving. Because of its importance, motion is used to answer a wide variety of fundamental questions in computer vision such as: (1) Which objects are moving independently in the world? (2) Which objects are close and which objects are far away? (3) How is the camera moving?
My work addresses the problem of moving object segmentation in unconstrained videos. I developed a probabilistic approach to segment independently moving objects [ArXiv] in a video sequence, connecting aspects of camera motion estimation, relative depth and flow statistics. My work consists of three major parts:

Modeling motion using a simple (rigid) motion model strictly following the principles of perspective projection and segmenting the video into its different motion components by assigning each pixel to its most likely motion model in a Bayesian fashion. [ECCV16]
Combining piecewise rigid motions to more complex, deformable and articulated objects, guided by learned semantic object segmentations. [CVPR18]
Learning highly variable motion patterns using a neural network trained on synthetic (unlimited) training data. Training data is automatically generated strictly following the principles of perspective projection. In this way well-known geometric constraints are precisely characterized during training to learn the principles of motion segmentation rather than identifying well-known structures that are likely to move. [ECCV18 workshop]

This work shows that a careful analysis of the motion field not only leads to a consistent segmentation of moving objects in a video sequence, but also helps us understand the scene geometry of the world we are moving in.

Higher-Order Representations for Visual Recognition
by Tsung-Yu Lin, Feburuary 2020. [pdf]

Abstract:

In this thesis, we present a simple and effective architecture called Bilinear Convolutional Neural Networks (B-CNNs). These networks represent an image as a pooled outer product of features derived from two CNNs and capture localized feature interactions in a translationally invariant manner. B-CNNs generalize classical orderless texture-based image models such as bag-of-visual-words and Fisher vector representations. However, unlike prior work, they can be trained in an end-to-end manner. In the experiments, we demonstrate that these representations generalize well to novel domains by fine-tuning and achieve excellent results on fine-grained, texture and scene recognition tasks. The visualization of fine-tuned convolutional filters shows that the models are able to capture highly localized attributes. We present a texture synthesis framework that allows us to visualize the pre-images of fine-grained categories and the invariances that are captured by these models.

In order to enhance the discriminative power of the B-CNN representations, we investigate normalization techniques for rescaling the importance of individual features during aggregation. Spectral normalization scales the spectrum of the covariance matrix obtained after bilinear pooling and offers a significant improvement. However, the computation involves singular value decomposition, which is not computationally efficient on modern GPUs. We present an iteration-based approximation of matrix square-root along with its gradients to speed up the computation and study its effect on fine-tuning deep neural networks. Another approach is democratic aggregation, which aims to equalize the contributions of individual feature vector into the final pooled image descriptor. This achieves a comparable improvement, and can be approximated in a low-dimensional embedding unlike the spectral normalization. Therefore, this approach is friendly to aggregating higher-dimensional features. We demonstrate that the two approaches are closely related, and we discuss their trade-off between performance and efficiency.

Improving Face Clustering in Videos
by SouYoung Jin, Feburuary 2020. [pdf]

Abstract:

Human faces represent not only a challenging recognition problem for computer vision, but are also an important source of information about identity, intent, and state of mind. These properties make the analysis of faces important not just as algorithmic challenges, but as a gateway to developing computer vision methods that can better follow the intent and goals of human beings.In this thesis, we are interested in face clustering in videos. Given a raw video, with no caption or annotation, we want to group all detected faces by their identity. We address three problems in the area of face clustering and propose approaches to tackle them.

The existing link-based face-clustering system is sensitive to a false connection between two different people. We introduce a new similarity measure that helps the verification system to provide very few false connections at moderate recall. Further, we also introduce a novel clustering method called Erdos and Renyi clustering, which is based on the observations from a random graph model theory, that large clusters can be fully connected by joining just a small fraction of their node pairs. Our results present state-of-the-art results on multiple video data sets and also on standard face databases.

What happens if faces are not sufficiently clear for direct recognition, due to the small scale, occlusion, or extreme pose? We observe that, when humans are uncertain about the identity of two faces, we use clothes or other contextual cues, e.g. specific objects or textures, to infer identity. With this observation, we propose the Face-Background Network (FB-Net), which takes as input not only the faces but also the entire scene to enhance the performance of face clustering. In order for the network to learn background features that are informative about the identity, we introduce a new dataset that contains face identities in the context of consistent scenes. We show that FB-Net outperforms the state-of-the-art method which uses face-level features only for the task of video face clustering.

The performance of face clustering depends on a good face detector. However, improving the performance of a face detector requires expensive labeling of faces. In this work, we propose an approach to reduce mistakes of the existing face detector by using many hours of freely available unlabeled videos on the web. Specifically, with the observation that false positives/negatives are often isolated in time, we demonstrate a method to mine hard examples automatically using temporal continuity in videos. In particular, we analyze the output of a trained detector on video sequences and mine detections that are isolated in time, which is likely to be hard examples. Our experiments show that re-training detectors on these automatically obtained examples often significantly improves performance. We present experiments on multiple architectures and multiple data sets, including face detection, pedestrian detection, and other object categories.

Integration of Robotic Perception, Action, and Memory
by Li Yang Ku, May 2018. [pdf]

Abstract:

In the book “On Intelligence”, Hawkins states that intelligence should be measured by the capacity to memorize and predict patterns. I further suggest that the ability to predict action consequences based on perception and memory is essential for robots to demonstrate intelligent behaviors in unstructured environments. However, traditional approaches generally represent action and perception separately as computer vision modules that recognize objects and as planners that execute actions based on labels and poses.

I propose here a more integrated approach where action and perception are combined in a memory model, in which a sequence of actions can be planned based on predicted action outcomes. In this framework, hierarchical visual features based on convolutional neural networks are introduced to capture the essential affordances. These features in different hierarchies are associated with robot controllers of corresponding kinematic subchains to support manipulation. Through learning from demonstration, both actions and informative features in the memory model can be learned efficiently. As more demonstrations are recorded and more interactions are observed, the robot becomes more capable of predicting the consequences of actions, thus, is better at planning sequences of actions to solve tasks under different circumstances.

Incorporating Boltzmann Priors for semantic labeling in images and videos
by Andrew Kae, May 2014. [pdf]

Abstract:

Semantic labeling is the task of assigning category labels to regions in an image. For example, a scene may consist of regions corresponding to categories such as sky, water, and ground, or parts of a face such as eyes, nose, and mouth. Semantic labeling is an important mid-level vision task for grouping and organizing image regions into coherent parts. Labeling these regions allows us to better understand the scene itself as well as properties of the objects in the scene, such as their parts, location, and interaction within the scene. Typical approaches for this task include the conditional random field (CRF), which is well-suited to modeling local interactions among adjacent image regions. However the CRF is limited in dealing with complex, global (long-range) interactions between regions in an image, and between frames in a video. This thesis presents approaches to modeling long-range interactions within images and videos, for use in semantic labeling.

In order to model these long-range interactions, we incorporate priors based on the restricted Boltzmann machine (RBM). The RBM is a generative model which has demonstrated the ability to learn the shape of an object and the CRBM is a temporal extension which can learn the motion of an object. Although the CRF is a good baseline labeler, we show how the RBM and CRBM can be added to the architecture to model both the global object shape within an image and the temporal dependencies of the object from previous frames in a video. We demonstrate the labeling performance of our models for the parts of complex face images from the Labeled Faces in the Wild database (for images) and the YouTube Faces Database (for videos). Our hybrid models produce results that are both quantitatively and qualitatively better than the baseline CRF alone for both images and videos.

Unsupervised Joint Alignment, Clustering and Feature Learning
by Marwan Mattar, May 2014. [pdf]

Abstract:

Joint alignment is the process of transforming instances in a data set to make them more similar based on a pre-defined measure of joint similarity. This process has great utility and applicability in many scientific disciplines including radiology, psychology, linguistics, vision, and biology. Most alignment algorithms suffer from two shortcomings. First, they typically fail when presented with complex data sets arising from multiple modalities such as a data set of normal and abnormal heart signals. Second, they require hand-picking appropriate feature representations for each data set, which may be time-consuming and ineffective, or outside the domain of expertise for practitioners.

In this thesis we introduce alignment models that address both shortcomings. In the first part, we present an efficient curve alignment algorithm derived from the congealing framework that is effective on many synthetic and real data sets. We show that using the byproducts of joint alignment, the aligned data and transformation parameters, can dramatically improve classification performance. In the second part, we incorporate unsupervised feature learning based on convolutional restricted Boltzmann machines to learn a representation that is tuned to the statistics of the data set. We show how these features can be used to improve both the alignment quality and classification performance. In the third part, we present a nonparametric Bayesian joint alignment and clustering model which handles data sets arising from multiple modes. We apply this model to synthetic, curve and image data sets and show that by simultaneously aligning and clustering, it can perform significantly better than performing these operations sequentially. It also has the added advantage that it easily lends itself to semi-supervised, online, and distributed implementations.

Overall this thesis takes steps towards developing an unsupervised data processing pipeline that includes alignment, clustering and feature learning. While clustering and feature learning serve as auxiliary information to improve alignment, they are important byproducts. Furthermore, we present a software implementation of all the models described in this thesis. This will enable practitioners from different scientific disciplines to utilize our work, as well as encourage contributions and extensions, and promote reproducible research.

Improving Text Recognition in Images of Natural Scenes.
by Jacqueline Feild, February 2014. [pdf]

Abstract:

The area of scene text recognition focuses on the problem of recognizing arbitrary text in images of natural scenes. Examples of scene text include street signs, business signs, grocery item labels, and license plates. With the increased use of smartphones and digital cameras, the ability to accurately recognize text in images is becoming increasingly useful and many people will benefit from advances in this area.

The goal of this thesis is to develop methods for improving scene text recognition. We do this by incorporating new types of information into models and by exploring how to compose simple components into highly effective systems. We focus on three areas of scene text recognition, each with a decreasing number of prior assumptions. First, we introduce two techniques for character recognition, where word and character bounding boxes are assumed. We describe a character recognition system that incorporates similarity information in a novel way and a new language model that models syllables in a word to produce word labels that can be pronounced in English. Next we look at word recognition, where only word bounding boxes are assumed. We develop a new technique for segmenting text for these images called bilateral regression segmentation, and we introduce an open-vocabulary word recognition system that uses a very large web-based lexicon to achieve state of the art recognition performance. Lastly, we remove the assumption that words have been located and describe an end-to-end system that detects and recognizes text in any natural scene image.

Probabilistic Models for Motion Segmentation in Image Sequences.
by Manjunath Narayana, February 2014.

Abstract:

Motion segmentation is the task of assigning a binary label to every pixel in an image sequence specifying whether it is a moving foreground object or stationary background. It is often an important task in many computer vision applications such as automatic surveillance and tracking systems. Depending on whether the camera is stationary or moving, different approaches are possible for segmentation. Motion segmentation when the camera is stationary is a well studied problem with many effective algorithms and systems in use today. In contrast, the problem of segmentation with a moving camera is much more complex. In this thesis, we make contributions to the problem of motion segmentation in both camera settings. First for the stationary camera case, we develop a probabilistic model that intuitively combines the various aspects of the problem in a system that is easy to interpret and extend. In most stationary camera systems, a distribution over feature values for the background at each pixel location is learned from previous frames in the sequence and used for classification in the current frame. These pixelwise models fail to account for the influence of neighboring pixels on each other. We propose a model that by spatially spreading the information in the pixelwise distributions better reflects the spatial influence between pixels. Further, we show that existing algorithms that use a constant variance value for the distributions at every pixel location in the image are inaccurate and present an alternate pixelwise adaptive variance method. These improvements result in a system that outperforms all existing algorithms on a standard benchmark. Compared to stationary camera videos, moving camera videos have fewer established solutions for motion segmentation. One of the contributions of this thesis is the development of a viable segmentation method that is effective on a wide range of videos and robust to complex background settings. In moving camera videos, motion segmentation is commonly performed using the image plane motion of pixels, or optical flow. However, objects that are at different depths from the camera can exhibit different optical flows, even if they share the same real-world motion. This can cause a depth-dependent segmentation of the scene. While such a segmentation is meaningful, it can be ineffective for the purpose of identifying independently moving objects. Our goal is to develop a segmentation algorithm that clusters pixels that have similar real-world motion. Our solution uses optical flow orientations instead of the complete vectors and exploits the well-known property that under translational camera motion, optical flow orientations are independent of object depth. We introduce a non-parametric probabilistic model that automatically estimates the number of observed independent motions and results in a labeling that is consistent with real-world motion in the scene. Most importantly, static objects are correctly identified as one segment even if they are at different depths. Finally, a rotation compensation algorithm is proposed that can be applied to real-world videos taken with hand-held cameras. We benchmark the system on over thirty videos from multiple data sets containing videos taken in challenging scenarios. Our system is particularly robust on complex background scenes containing objects at significantly different depths.

Weakly Supervised Learning for Unconstrained Face Processing
by Gary B. Huang, May 2012. [pdf]

Abstract:

Machine face recognition has traditionally been studied under the assumption of a carefully controlled image acquisition process. By controlling image acquisition, variation due to factors such as pose, lighting, and background can be either largely eliminated or specifically limited to a study over a discrete number of possibilities. Applications of face recognition have had mixed success when deployed in conditions where the assumption of controlled image acquisition no longer holds. This dissertation focuses on this unconstrained face recognition problem, where face images exhibit the same amount of variability that one would encounter in everyday life.

We formalize unconstrained face recognition as a binary pair matching problem (verification), and present a data set for benchmarking performance on the unconstrained face verification task. We observe that it is comparatively much easier to obtain many examples of unlabeled face images than face images that have been labeled with identity or other higher level information, such as the position of the eyes and other facial features. We thus focus on improving unconstrained face verification by leveraging the information present in this source of weakly supervised data.

We first show how unlabeled face images can be used to perform unsupervised face alignment, thereby reducing variability in pose and improving verification accuracy. Next, we demonstrate how deep learning can be used to perform unsupervised feature discovery, providing additional image representations that can be combined with representations from standard hand-crafted image descriptors, to further improve recognition performance. Finally, we combine unsupervised feature learning with joint face alignment, leading to an unsupervised alignment system that achieves gains in recognition performance matching that achieved by supervised alignment.

Using Context to Enhance the Understanding of Face Images
by Vidit Jain, September 2010. [pdf]

Abstract:

Faces are special objects of interest. Developing automated systems for detecting and recognizing faces is useful in a variety of application domains including providing aid to visually-impaired people and managing large-scale collections of images. Humans have a remarkable ability to detect and identify faces in an image, but related automated systems perform poorly in real-world scenarios, particularly on faces that are difficult to detect and recognize. Why are humans so good? There is general agreement in the cognitive science community that the human brain uses the context of the scene shown in an image to solve the difficult cases of detection and recognition. This dissertation focuses on emulating this approach by using different kinds of contextual information for improving the performance of various approaches for face detection and face recognition.

For the face detection problem, we describe an algorithm that employs the easyto- detect faces in an image to find the difficult-to-detect faces in the same image. For the face recognition problem, we present a joint probabilistic model for image-caption pairs. This model solves the difficult cases of face recognition in an image by using the context generated from the caption associated with the same image. Finally, we present an effective solution for classifying the scene shown in an image, which provides useful context for both of the face detection and recognition problems.

Unified Detection and Recognition for Reading Text in Scene Images
by Jerod Weinman, May 2008. [pdf]

Abstract:

Although an automated reader for the blind first appeared nearly two-hundred years ago, computers can currently “read” document text about as well as a sevenyear- old. Scene text recognition brings many new challenges. A central limitation of current approaches is a feed-forward, bottom-up, pipelined architecture that isolates the many tasks and information involved in reading. The result is a system that commits errors from which it cannot recover and has components that lack access to relevant information.

We propose a system for scene text reading that in its design, training, and operation is more integrated. First, we present a simple contextual model for text detection that is ignorant of any recognition. Through the use of special features and data context, this model performs well on the detection task, but limitations remain due to the lack of interpretation. We then introduce a recognition model that integrates several information sources, including font consistency and a lexicon, and compare it to approaches using pipelined architectures with similar information. Next we examine a more unified detection and recognition framework where features are selected based on the joint task of detection and recognition, rather than each task individually. This approach yields better results with fewer features. Finally, we demonstrate a model that incorporates segmentation and recognition at both the character and word levels. Text with difficult layouts and low resolution are more accurately recognized by this integrated approach. By more tightly coupling several aspects of detection and recognition, we hope to establish a new unified way of approaching the problem that will lead to improved performance. We would like computers to become accomplished grammar-school level readers.

Image Classification with Bags of Local Features
by Dima Lisin, May 2006. [pdf]

Abstract:

Many classification techniques expect class instances to be represented as feature vectors, i.e. points in a feature space. In computer vision classification problems, it is often possible to generate an informative feature vector representation of an image, for example using global texture or shape descriptors. However, in other cases, it may be beneficial to treat images as variable size unordered sets or bags of features, in which each feature represents a localized salient image structure or patch. These local features do not require a segmentation, and can be useful for object recognition in the presence of occlusion and clutter.

The local features are often used to find point correspondences between images to be later used for 3D reconstruction, object recognition, detection, or image retrieval. However, there are many cases when exact correspondences are difficult or even impossible to compute. Furthermore, point correspondences may not be necessary, unless one is interested in recovering the 3D shape of an object. If the correspondences are not computed, then this representation indeed constitutes an unordered set of local features.

In this dissertation we present methods for object class recognition using bags of features without relying on point correspondences. We also show that using bags of features and more traditional feature vector representation of images together can improve classification accuracy. We then propose and evaluate several methods of combining the two representations. The proposed techniques are applied to a challenging marine science domain.

Home

Research

Databases

Information

Ph.D. Theses