- Keyword Extraction
I explored machine learning techniques, with a research-based NLP team at Lexalytics, to find an alternative approach to the tree based algorithm for sentiment analysis of a given product/company from user reviews. The motivation behind this task was that such tree-based algorithms do not work well with incorrect grammatical sentences (mostly present in datasets such as user reviews). We often assume that sentences connect with each other structurally and exploit this hierarchical structure of sentences in our algorithms (like tree-based). For example, from the structural hierarchy of this sentence, "I bought headphones today. Its sound quality is great.", a machine would learn that the sentiment great refers to the headphone. Whereas sentences like "I bought headphones today. Sound quality is great.", lacks structural dependency. Due to the absence of this dependency between sentences, a machine would not know that the sound quality is a feature of headphone and hence the sentiment great refers to the headphone.
In the first part of the project, the goal was to find an unsupervised method to connect such unstructured sentences. We can view these connections between sentences as a product-feature relationship. In the above example, the headphone is the product and sound quality is its feature. Similarly, in a sentence like "I work at Microsoft. Management of the company is poor.", Microsoft is the product and the management its feature. These connections or features can then be used in various applications such as sentiment analysis. I extracted features of a given product from user reviews of Amazon and Yelp by clustering words from sentences whose embeddings are most similar to the embeddings of product. I improved noisy cluster by filtering out features having large cosine distance with the product. In addition to the feature set, the model forms an interesting cluster of features representing a category like price, physical feature and soft feature in case of headphones. The next step is evaluating the model for good clusters as well as downstream tasks such as sentiment analysis. - Entity recognition and linking
We worked in a team of 4 with Professor Andrew McCallum in the IESL lab, in collaboration with Chan Zuckerberg Initiative to explore the use of deep learning techniques to automate entity recognition and linking in biomedical journals. We explore a Bi-LSTM and CRF model for entity recognition, and a separate LSTM based, modular neural model for entity linking. We compare the results of these models against the results of TaggerOne, a semi-Markov model that learns both of these tasks jointly. The Bi-LSTM and CRF model performed better than baseline TaggerOne by approximately 3% for the entity recognition task.
- Cross-domain image retrieval
Performed image retrieval for fashion dataset, given a consumer image (or a query image) retrieved the corresponding/most similar shop images. Trained a siamese network with triplet loss achieved accuracy of 55.3%.
- Irony detection in english tweets
Irony is a common phenomenon in social media, and is inherently difficult to analyse, not just automatically but often for humans too. The aim of this project was to experiment with the different combinations of features, blending it with different machine learning models. The dataset and major source of encouragement behind this project comes from the sentiment analysis task present in SemEval 2018. The best performance model was Logistic Regression that achieved an accuracy in the range of 65-67% and f1-score in the range of 63-65% with different combination of linguistic features.
- Detecting diabetic retinopathy in the eye using Transfer Learning
Convolutional neural network have been used to detect the presence of diabetic retinopathy from a dataset of Fluorescein Angiography photographs. The dataset was obtained from kaggle where it was provided by EyePacs. A VGG19 network trained on this image achieved an accuracy of 74% and sensitivity of 77%