**Shailesh Mani Pandey** shailesh.pandey@utexas.edu ![ ](res/images/self.jpg width="250px" border="1") Software Engineer at Google,
Mountain View, California Twitter: [@manishailesh96](https://twitter.com/manishailesh96) LinkedIn: [@manishailesh](https://www.linkedin.com/in/manishailesh/) | Github: [@mani-shailesh](https://github.com/mani-shailesh) [[Resume](https://drive.google.com/open?id=0B14zqFHFuOjudWhiQWlzdjd3b2s)] Education ========================================================================== Master of Science (MS) - Computer Science,
[Department of Computer Science](https://www.cs.utexas.edu/),
[The University of Texas at Austin](https://www.utexas.edu/)
Aug 2019 - Dec 2020
CGPA: 3.97/4.0 Bachelor of Technology (B.Tech.) - Computer Science and Engineering,
[Department of Computer Science and Engineering](http://cse.iitrpr.ac.in/),
[Indian Institute of Technology (IIT) Ropar](http://www.iitrpr.ac.in)
Jul 2013 - May 2017
CGPA: 9.44/10.0 Publications ========================================================================== PANDEY, S.; AGARWAL, T.; C. KRISHNAN, N.. Multi-Task Deep Learning for Predicting Poverty From Satellite Images. AAAI Conference on Artificial Intelligence, North America, apr. 2018. Available at: . Date accessed: 22 Feb. 2020. Work Experience ========================================================================== SDE Intern, AmazonSmile, [Amazon.com](https://www.amazon.com): May 2020 - Aug 2020 - Made full stack changes to redesign customers’ experience with the address book UI. - Added support for the creation of military addresses, and updating addresses and related documents. - Added support for viewing items shipped to an address, and archiving unused addresses. - Programming languages / technologies: Java (Spring Framework), and JavaScript Software Engineer, [Arista Networks](https://www.arista.com): Jul 2017 - Jul 2019 - Designed and developed telemetry based inventory management service in a team of two. - Served as a release manager for the Foster release of CloudVision Portal (CVP). - Enhanced the scalability of the existing code base to increase the number of supported devices by more than 100%. - Technologies/languages used: Hadoop, Kafka, Go, Python and Java Software Engineering Intern, [Saavn Media Pvt. Ltd](https://www.saavn.com): May 2016 - Jul 2016 - Used Simhash and an efficient indexing technique to quickly search for near-duplicates of any song. - Created a pipeline component for assigning songs to clusters of near-duplicate songs. - Implemented few Saavn Pro features in UWP app. - Technologies/languages used: Python, MongoDB, SQL, C and C# Summer Intern, [CDEEP](http://www.cdeep.iitb.ac.in/), [IIT Bombay](http://www.iitb.ac.in): May 2015 - Jul 2015 - Designed and developed a portable educational game/quiz development platform. - Technologies/languages used: Unity3D, JavaScript, and C#. Research ========================================================================== **SRoCE: Software RDMA over Commodity Ethernet**
Advisors: [Prof. Chris Rossbach](http://www.cs.utexas.edu/~rossbach/) and [Prof. Simon Peter](http://www.cs.utexas.edu/~simon/)
[[Code](https://github.com/mani-shailesh/rdma-tas)] [[Paper](res/docs/SRoCE.pdf)] - Software-based flexible RDMA verbs implementation using high performance user-space TCP stack (TAS) - Achieved 3x single-connection throughput for 1000 bytes RDMA ops as compared to H/W RDMA NICs - __Abstract__: RDMA networks are used in datacenters and high performance computing clusters to support high-throughput, low-latency networking by allowing specialized hardware to directly copy to and from the application memory and network. We propose a software-based flexible RDMA verbs implementation that uses TAS - high-performance user-space TCP stack - as the underlying transport layer to allow similar semantics without hardware and network requirements. We discuss the design space that we explore and evaluate our prototype. The single-connection throughput and latency achieved by our implementation indicate not only that providing RDMA interface over TCP using commodity NICs is feasible but also comparable to the hardware implementation. **Multi-task Deep Learning for Predicting Poverty from Satellite Images**
Advisor: [Prof. Narayanan C. Krishnan](http://cse.iitrpr.ac.in/ckn/people/ckn.html)
[[Code](https://github.com/mani-shailesh/satimage)] [[Paper](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16441/16388)] - Engineered a novel multi-task fully convolutional deep neural network to process large (1920 x 1920) day-time satellite images. - Predicted developmental statistics at village level and income statistics at sub-district level for several states of India. - __Abstract__: Estimating economic and developmental parameters such as poverty levels of a region from satellite imagery is a challenging problem that has many applications. We propose a two step approach to predict poverty in a rural region from satellite imagery. First, we engineer a multi-task fully convolutional deep network for simultaneously predicting the material of roof, source of lighting and source of drinking water from satellite images. Second, we use the predicted developmental statistics to estimate poverty. Using full-size satellite imagery as input, and without pre-trained weights, our models are able to learn meaningful features including roads, water bodies and farm lands, and achieve a performance that is close to the optimum. In addition to speeding up the training process, the multi-task fully convolutional model is able to discern task specific and independent feature representations. **Semantic Audio-Visual Navigation through Distractor Silencing**
Advisor: [Prof. Yuke Zhu](https://www.cs.utexas.edu/~yukez/)
[[Paper](res/docs/SemAudioVisNav.pdf)] - __Abstract__: In recent times, embodied AI has witnessed significant progress in single-source audio-visual navigation, which tasks an agent to reach the only audio source in an unknown environment by relying on audio and visual cues. In this work, we further generalize this setting and introduce the task of *semantic audio-visual navigation*, in which the agent is instead tasked to navigate to the audio source of a certain semantic class while there are other sources of different classes playing in the environment at the same time. We propose two unique approaches to solve this task, both of which rely on the principle of extracting the target audio from the mixed audio and thus effectively silencing the distractor sources for accurate and efficient navigation. One of the approaches does implicit extraction by learning disentangled latent features for navigation that are conditioned on the audio classes in the mixture. The other approach extracts the target audio more explicitly by using an additional class-conditional extraction module. We demonstrate our approaches on Replica, a challenging dataset of real-world 3D scans. Our approaches improve over the current end-to-end reinforcement learning based state-of-the-art audio-visual navigation agent that is customized to account for audio semantics, in multiple challenging evaluation settings, thus demonstrating the effectiveness of target audio extraction based agents for successful navigation in multi-audio settings. Project slides with navigation videos: [https://bit.ly/semanticAudioNavigation](https://bit.ly/semanticAudioNavigation) (Please view the slides by signing in using a Google account to play the videos.) **Designing Image Captioning Models That Can Read**
Advisors: [Prof. Raymond Mooney](https://www.cs.utexas.edu/~mooney/)
[[Paper](res/docs/ocr_image_captioning.pdf)] - __Abstract__: Image descriptions can have multiple applications, such as helping visually impaired people to quickly understand the image content. While we have made significant progress in automatically captioning images, current approaches are unable to include embedded text in these tasks, although text is omnipresent in human environments and frequently critical to understand our surroundings. In this work, we build a model-agnostic text comprehension module for image captioning systems which allows them to extract the textual information from images. To demonstrate the effectiveness of our method, we augment the Bottom-Up and Top-Down (BUTD) model with our text comprehension system and show an improvement in performance on tasks that require understanding text present in the images. We show that integrating OCR tokens, attending on those tokens, using spatial information, and copying OCR tokens directly to generated captions is crucial for generating informative captions. Our proposed approach outperforms the base BUTD architecture, with nearly a 3% performance improvement across multiple accuracy metrics. More importantly, our BUTD + text comprehension model is able to reduce the performance gap to human performance on the challenging TextCaps dataset by nearly 50%. **Making It Easier To Publish Your Work**
Advisors: [Prof. Hovav Shacham](https://www.cs.utexas.edu/~hovav/)
[[Paper](res/docs/easier_publishing.pdf)] - App to modify a conference submission PDF to target or avoid arbitrarily selected reviewers. - Successfully targeted the favorable reviewer with probability > 80% and avoided the adverse reviewer with probability > 99%. - __Abstract__: With an increasingly high number of submissions and reviewers to the top research conferences, the assignment of papers to reviewers is a difficult problem facing conference organizers. In this paper, we show that it is possible to modify a paper submission in such a way as to force the Toronto Paper Matching System (TPMS), a popular solution to this problem, to pick or avoid arbitrarily chosen reviewers for the paper. Specifically, we embed keywords into the PDF so that there are no visible changes to the document but TPMS determines the paper is related to a desired reviewer’s area of expertise or unrelated to an unwanted reviewer. We show the effectiveness of our attack through various experiments, and find that an attacker can get her paper reviewed by a favorable reviewer with a probability as high as 80% and can avoid unfavorable reviewers with a probability as high as 99%. We also highlight potential mitigations against the proposed attack and exciting directions for future research in this area. **Explaining BERT for Paraphrase Identification**
Advisor: [Prof. Greg Durrett](https://www.cs.utexas.edu/~gdurrett/)
[[Paper](res/docs/explaining_bert.pdf)] - Analyzed performance of BERT on paraphrase identification using model-agnostic technqiues, such as LIME. - Found that while BERT is robust to minor grammatical mistakes, it does not generalize well across datasets, and has hard time interpreting numbers. - __Abstract__: BERT is a relatively new pre-trained language representation model that achieves near state-of-the-art performance in many NLP benhcmarks. We specifically look at its performance on paraphrase identification task. We explain the predictions made by BERT on specific examples and also test its generality and robustness. We show that although BERT achieves competitive performance and is fairly robust, it suffers from lack of generality across datasets, results in asymmetric behaviour on the inherently symmetric task, and has hard time interpreting numbers present in the input text. We believe that an exhaustive explanation of a model’s predictions can help us to understand its limitations, improve its training procedure and ultimately extract better results. **Connecting CNN Layers to Human Brain Regions**
Advisor: [Prof. Alexander Huth](https://www.cs.utexas.edu/~huth/)
[[Paper](res/docs/connecting_cnn_brain.pdf)] - __Abstract__: Recent advances in Computer Vision tasks, such as object recognition, have been driven by deep Convolutional Neural Networks (CNNs). These models have a layered architecture and are typically trained on millions of images. It has been observed that the layers of these networks learn increasingly complex features from the image as the depth increases by building on the features from previous layers. In this project, we improve our understanding of the brain activations in different regions by comparing them to this layered approach in Computer Vision models.