Machine Learning – The Alliance-NSUT's Newspaper

By Ishan Nigam (IT, Batch of 2014)

A widely quoted definition of Machine Learning goes as follows: A computer program is said to learn from an experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.For the uninitiated though, it would suffice to know that Machine Learning and its associated branches aim to answer this simple question: How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes as the system gains in experience?

Introduction

Recent advancements in the growth of artificial intelligence have been accompanied by a resurgence of interest in Machine Learning. Machine Learning is a sub-field of artificial intelligence which is concerned with developing computational theories and algorithms of learning processes and building learning machines. The ability to learn is central to intelligent behavior. Thus, the developments and advancements in this field are central to the progress of artificial intelligence.

Machine Learning is basically an intersection of the fields of Computer Science and Statistics. Computer Science experts try to build machines that solve problems; statisticians make inferences from data using modeling assumptions. Machine Learning builds on both. Machine Learning focuses on how to get computers to program themselves. It incorporates questions about which computational architectures and algorithms can be used to most effectively capture, index and retrieve data, and how multiple learning subtasks can be brought together as part of a larger system.

Importance and relevance in the IT industry

Machine learning methods play a key role within a small, yet important domain in the world of computer science. Beyond its obvious role as a method for software development, it is also likely to help reshape our view of Computer Science in the future. Machine learning emphasizes the design of self-monitoring systems that diagnose and repair themselves on their own; and on approaches that model their users and their thinking processes.

Machine learning methods are efficient methods available for developing particular types of software. A few examples are:

(1) The application is too complex for humans to manually design the algorithm. For example, software for sensor-based perception tasks, such as speech recognition and computer vision, fall into this category.

(2)The application requires that the software customize to its operational environment after it is released by the programmer. For example, hand gesture recognition systems that customize to the user who purchases the software. Machine learning here provides the mechanism for adaptation.

Implanting learning ability in computers is practically necessary. Present day computer applications require the representation of a large amount of complex knowledge and data in programs and thus perform tremendous computation tasks. Our ability to explicitly code the computers falls short of the demand for these applications. If the computers are endowed with the ability to learn, then the task of coding the machine is significantly reduced.

Important applications

• Google search has a high accuracy rate in terms of what the user “expects” from search because its search engine has learned to rank web-pages depending on many features. The “many-features” are a trade secret Google Inc. has guarded for more than a decade now!

• SMS or email spam classifiers are classification problems addressed by Machine Learning algorithms.

• Database mining: Analyzing data trends or patterns which yield inferences about it. This technique has allowed great advancements in DNA sequencing and web-data mining.

• Recommender systems on the Internet Movie Database (IMDB), Amazon, Flipkart, and Netflix are based on Machine Learning algorithms as well.

Future

Present day computer programs in general with the exception of a few advanced Machine Learning programs cannot correct their own errors, improve from past mistakes, or learn to perform a new task by analogy to a previously seen task. In contrast, human beings are capable of all the above. Machine Learning aims to produce smarter computers capable of simulating intelligent behavior. The understanding of human learning and its computational aspect is a worthy scientific goal.

Humans have long been fascinated by their capabilities to behave rationally and based on experience. Great efforts have been made to try to understand the basis of this nature of intelligence. It is clear that central to our intelligence is our ability to learn. A thorough understanding of human learning process is crucial to understand human intelligence. Machine Learning aims to provide insight into the underlying principles of human learning, and this may lead to the discovery of more effective education techniques. It is worth exploring other methods of learning which may be more efficient and effective than human learning.

Machine Learning has become feasible in many important applications primarily because of the recent progress in learning algorithms and theory, the rapid increase of computational power, the availability of huge amount of data, and interests in commercial application development involving Machine Learning techniques.

PreCog

PreCog is a group of researchers(http://precog.iiitd.edu.in) at IIIT Delhi coordinated by Dr. Ponnurangam Kumaraguru (popularly known as “PK”). The research group aims to build technologies to characterise and predict acts of cybercrime, and to develop an early warning system that could help policy analysis and law enforcement agencies. The group also focuses on building security and privacy systems in the Indian context. Their work utilizes Machine Learning techniques involved in Data Mining, Text Mining, Statistics and Human Computer Interaction. The focus is on building real-time usable software based on research conducted by the group thus creating a real, measurable impact instead of limiting work to just publications. Below are the two projects that PreCog is currently working on where ML is applied.

PhishAri: Detecting Phishing in Twitter

PhishAri is a Chrome browser extension for Twitter which detects phishing tweets in real-time. Phishing is a way of attempting to acquire information such as usernames and passwords by masquerading as a trustworthy entity in an electronic communication. PhishAri makes real-time decisions to save the user from risky clicks on phish URLs. PhishAri works through instant indications.

The extension shows a legitimate or a phish indicator (next to the URL) for tweets with URLs. Phishing URLs have a red indicator, warning users not to click on the same. The legitimate links have a green indicator.

PhishAri, presently available in its beta version, is downloadable from the Chrome Web store(link). The research team is working on developing sophisticated algorithms to make the extension faster and more accurate.

Dr. Kumaraguru himself worked on a similar idea, PhishGuru, during his PhD at CMU. The research group he was part of built a tool to train the mailing system to classify mail as either phish or legitimate mail. That research group went on to start a company called Wombat Security Technologies (www.wombatsecurity.com); an example of how research work translates into applications in the real world.

SMSAssassin: Detecting SMS Spam using Crowd Sourcing Approach

Due to the exponential increase in the use of Short Message Service (SMS) over mobile phones in developing countries there has been a burst of spam SMSes. The main goal of this research project is to build algorithms and solutions to reduce the SMS spam in developing countries like India. The team uses crowd-sourcing approach, applying Machine Learning techniques while keeping the user’s preferences in mind while designing the algorithms.

The team has encountered numerous challenges. For example, a considerable issue is that of regional languages. Texts typed in Roman script but meant to be in Hindi language might be classified as spam even though they are legitimate ones.

The project has an open dataset. Anyone may contribute to it http://precog.iiitd.edu.in/usable-security.html#smsassassin. Dr. Kumaraguru is considering putting it online for the benefit of other research that might be going on in this area. SMSAssassin is currently being evaluated on its effectiveness in the real-world among some volunteers.

SMSAssassin will be presented as a demo at the 13th International Conference on Mobile Data Management (July 23-26, 2012) in Bengaluru, India.

Resources and Research Opportunities

Online courses have stirred the interest of the masses in this relatively unheard of field in Information Technology. Andrew Ng (Associate Professor at Stanford University) was the instructor of the free online Stanford Engineering Everywhere Machine Learning class that was hosted on https://www.coursera.org/course/ml. The course was started in October 2011, and over a 100,000 students worldwide registered. Other such online courses have now sprung up,including a more elaborate course on Machine Learning by Andrew Ng being hosted on academicearth.orgcourses/machine-learning.

NSIT professors are involved in applied work on Machine Learning such as Pattern Recognition and Classification which is used in image processing, though not much theoretical work is being done currently.
IIIT Delhi Research Showcase is an annual two-day event to showcase the research and development efforts of students at IIIT Delhi. Projects on display this year included “Iris Recognition under Alcohol Influence” (Pattern Recognition and Classification), “PhishAri” (Spam Classifier), “What’s Next Up” and “Bon Appetite” (Recommender Systems). Students may write to IIIT professors depending on their interests on any topic that may match the research interests of the professor. Sarthak Kukreti, a 3rd year student from NSIT, is working on a classification system currently under Dr. Somitra Sanidhya.

“Machine Learning and data mining are areas rife with potential for developing intelligent and adaptive applications on distributed and heterogeneous platforms. Nature-inspired heuristics give wonderful opportunities to glean knowledge from nature’s adaptation and optimization mechanisms and apply them to tackle some of the issues in these fields. I undertake projects in this area. Currently, we are investigating quality aspects of e-governance and e-learning, information retrieval/ text classification and database protection.” Dr. Shampa Chakraverty, HOD Department of Computer Engineering

Industry

Machine Learning skills are sought out by almost all the leading IT software giants such as Google, Microsoft, IBM and Yahoo. From the global industrial perspective, Machine Learning is seen as a great asset to the skill set of IT professionals.

IBM Research India, Yahoo! Labs in Bangalore and Microsoft Research (MSR) work on it as well.

“As companies work to build software such as collaborative filtering, spam filtering and fraud-detection applications that seek patterns in jumbo-size data sets, we are seeing a rapid increase in the need for people with machine-learning knowledge, or the ability to design and develop algorithms and techniques to improve computers’ performance. Demand for these applications is pulling up the need for data mining, statistical modeling and data structure skills, among others.” – Kevin Scott (senior engineering manager at Google Inc.)

Research @ NSIT

1. Monitoring real world market by mining Twitter feeds:

Tushar Rao, a 4th year student from NSIT, has worked on mining of social media feeds through Twitter to monitor movements in the real world markets at IIIT Delhi under Dr. Saket Srivastava.
Emerging interest of trading companies and hedge funds in mining social web has created new avenues for intelligent systems that make use of public opinion in driving investment decisions. In high frequency trading, investors track memes and other feed on micro-blogging forums to judge the public behavior as an important feature while making short term investment decisions. Project involved identification and modeling complex relationship between tweet board literature (like bullishness, volume and agreement) with the financial market instruments (like volatility, trading volume and stock prices).
Twitter sentiments for more than 4 million tweets were analyzed between June 2010 and July 2011 for Dow Jones Industrial Average (DJIA), National Association of Securities Dealers Automated Quotations (NASDAQ)-100, Gold, Oil, USD Forex rates and 11 other big cap technological stocks. The results show high causative correlation (up to 0.88 for returns) between stock prices and twitter sentiment features. Monitoring social feeds provides valuable public behavior elements that can be exploited to retain a portfolio within limited risk state (highly improved hedging bets) during typical market conditions; as described in the hedging model provided in his paper.

2. Social Network Analysis : The Tie Strength Problem:

Recent trends in the behaviour of people on the internet, and the spurt for personalization of the web has driven a lot of attention among researchers to work on the problem of Tie Strength, using data available on social networking sites like Facebook, Twitter, etc.

Understanding Tie Strength can improve social media design elements, including privacy controls, message routing and information prioritization in databases. Potential usage of this work can also be made in making complex recommender systems, lead generation marketing and in organizational or telecom networks.

Computer Science Researchers all over the world have been using Machine Learning Techniques, Graph Theory along with some other statistical techniques to analyse this problem, which they believe can revolutionalise the way we see the internet and the way companies see the market.

Arnab Kumar, a 3rd year student had been working on Social Network Analysis (SNA) under the guidance of Mrs. Sushma Nagpal, Assisstant Professor, NSIT. Their primary field of research was the Tie Strength Problem, with one of their research papers being accepted at the IC3 2012(5th International Conference on Contemporary Computing 2012), with further publication by Springer in the Communications in Computer and Information Science Series (CCIS Series) and indexing in Digital Bibliography & Library Project (DPLB), Institute for Scientific Information (ISI), and Scopus.

Acknowledgements

I am grateful to Dr. Kumaraguru for sharing his personal experiences, as well as his role in PreCog and the current work going on in the research group he heads at IIIT Delhi. Dr. Shampa Chakraverty’s views were very helpful as she has been working on Machine Learning for quite some time now in our college. I would like to thank Tushar Rao and Arnab Kumar for sharing their experience with me about their research work. I would also like to thank Samarth Bharadwaj(PhD scholar at IIIT Delhi) for informing me about the areas in which research is going on at IIIT Delhi. Finally, I would like to mention Sarthak Kukreti, a 4th year COE student, whose inputs were invaluable in shaping a few sections in this article.