Final exam: Shared and Distributed Memory Parallel Algorithms to Solve Big Data Problems in Biological, Social Network and Spatial Domain Applications

Date: 
November 28, 2016 - 9:15am
Location: 
2520C UCC

PhD Candidate: Rahil Sharma

Abstract

Big data refers to information which cannot be processed and analyzed using traditional approaches and tools, due to 4 V's—sheer Volume, Velocity at which data is received and processed, and data Variety and Veracity. Today massive volumes of data originate in domains such as geospatial analysis, biological and social networks, etc. Hence, scalable algorithms for efficient processing of this massive data is a significant challenge in the field of computer science. One way to achieve such efficient and scalable algorithms is by using shared and distributed memory parallel programming models. In this thesis, we present a variety of such algorithms to solve problems in various above mentioned domains. We solve five problems that fall into two categories.

The first group of problems deals with the issue of community detection. Detecting communities in real world networks is of great importance because they consist of patterns that can be viewed as independent components, each of which has distinct features and can be detected based upon network structure. For example, communities in social networks can help target users for marketing purposes, provide user recommendations to connect with and join communities or forums, etc. We develop a novel sequential algorithm to accurately detect community structures in biological protein-protein interaction (PPI) networks, where a community corresponds with a functional module of proteins. Generally, such sequential algorithms are computationally expensive, which makes them impractical to use for large real world networks. To address this limitation, we develop a new highly scalable Symmetric Multiprocessing (SMP) based parallel algorithm to detect high quality communities in large subsections of social networks like Facebook and Amazon. Due to the SMP architecture, however, our algorithm cannot process networks whose size is greater than the size of the RAM of a single machine. With the increasing size of social networks, community detection has become even more difficult, since network size can reach up to hundreds of millions of vertices and edges. Processing such massive networks requires several hundred gigabytes of RAM, which is only possible by adopting distributed infrastructure. To address this, we develop a hybrid (shared + distributed memory) parallel algorithm to efficiently detect communities in massive social networks.

The second group of problems deals with the issue of efficiently processing spatial Light Detection and Ranging (LiDAR) data. LiDAR data is widely used in forest studies, coastal change mapping, landscape classification, 3D urban modeling, agricultural crop studies, etc. Technological advancements in building LiDAR sensors have enabled highly accurate and dense LiDAR point clouds resulting in massive data volumes, which pose computing issues with processing and storage. We develop the first published landscape driven data reduction algorithm, which uses the slope-map of the terrain as a filter to reduce the data without sacrificing its accuracy. Our algorithm is highly scalable and adopts shared memory based parallel architecture. We also develop a parallel shared memory interpolation technique that is used to generate highly accurate continuous terrains, i.e. Digital Elevation Models (DEMs), from discrete LiDAR point clouds.

Advisor: Suely Oliveira