Private Outlier Detection and Content based Encrypted Search
Nisarg Raval (homepage)
Today, our daily lives are constantly driven and influenced by services like social networking, personalized advertisements, cloud infrastructure etc. This makes any form of personal or professional data vulnerable to both security and privacy threats. User looses control over the data as soon as its published on the Internet. Recently, popular photo sharing service Instagram tried to change its privacy policy, giving itself the right to sell users’ information including their photographs to advertisers without any notification or compensation. With incidents like AOL and Netflix debacle, organizations are refraining from releasing customer data even after anonymization. This hinders future research and developments in many ways due to unavailability of customer data, behavior and patterns. Techniques like k-anonymity and l-diversity help to preserve user privacy to some extent but fails to provide any solid guarantees. Recently proposed, differential privacy provides promising results in this area with theoretical bounds on information leak and preservation of privacy even against the auxiliary information. However, differential privacy considers individual’s record independent of other data and can not be applied to linked data like social graph, time series etc. Furthermore, all these techniques compromise utility of the data while maintaining privacy.
At present, two approaches are widely used to ensure security of the data. First one is based on Secure Multiparty Computations (SMC), where users communicate with each other using SMC protocol to securely evaluate any function on their data. SMC gives provable security without compromising utility of the data. However, the solutions based on the general protocol of SMC requires enormous computational and communication overhead, thus limiting the practical deployment of the secure algorithms. The second approach is to encrypt the data at the user end and perform any operations in the encrypted domain to preserve the confidentiality of the data. This leads to difficulty in performing various operations like search, retrieval and computations. Homomorphic encryption allows secure computations over encrypted data but its impractical due to high computational cost.
In this thesis, we focus on developing privacy preserving, communication and computationally efficient algorithms in the domain of Data Mining and Information Retrieval. Specifically, we design efficient privacy preserving solutions for two widely used tasks - outlier detection and content based similarity search over encrypted data. Private computation of outliers among data collected from different sources has many applications in the area of data mining. First, we propose a novel algorithm of distance based outlier detection using Locality Sensitive Hashing (LSH) in a distributed setting where data is horizontally partitioned among data owners. We show that the proposed algorithm is scalable in case of large datasets with high dimensionality. Then, we extend our distributed outlier algorithm to propose an algorithm for private outlier detection that broke the previous known bounds of quadratic cost. We compare our algorithm with the state of the art technique and show that our algorithm is better both in terms of computational as well as communication complexity.
Next, we explore the properties of hierarchical index structures and propose an algorithm for content based similarity search over encrypted data that do not reveal user query to the server. To the best of our knowledge, we are the first one to propose a definition and a scheme for the problem of Content Similarity Searchable Symmetric Encryption (CS-SSE). Finally, we extend our CS-SSE scheme for Private Content Based Image Retrieval, which is essential for many applications like searching health records using CT scans and patent search for logos. We also show that our algorithm achieves optimal computational cost with similar accuracy and privacy when compared to the existing techniques.
Year of completion: | 2013 |
Advisor : | C. V. Jawahar and Dr. Kannan Srinathan |