Research

Detecting network attacks is the first and important step in network security. A major branch in this area is anomaly detection. My Ph.D. research concentrates on detecting abnormal behaviors in web applications by employing the following methodology.

For a web application, we conduct a set of measurements to reveal the existence of abnormal behaviors in it. We observe the differences between normal and abnormal behaviors. By applying a variety of methods in information extraction, such as heuristics algorithms, machine learning, and information theory, we extract features useful for building a classification system to detect abnormal behaviors. In particular, we have studied the following four detection problems in web security.

Detection of Blog Bots

Most bloggers have been victims of blog bots and their spam comments. Blog bots are automated scripts that post comments to blogs, often including spam or traceback links. An effective defense is to detect and validate the human presence during the form filling procedure. If you know your visitor is not human (namely, bot), you reject its posting, right? Conventional detection methods require direct participation of human users, such as recognizing a CAPTCHA image. This may be burdensome for users. As CAPTCHA images have become more and more distorted, we've a hard time recognizing them, even as human.

Our work presents a new detection approach by using behavioral biometrics, primarily mouse and keystroke dynamics, to distinguish between human and bot. Based on passive monitoring, the proposed approach does not require any direct user participation. We collect real user input data from Internet, and use a variety of useful features to characterize behavioral differences between human and bot, such as moving speed, efficiency, angle, typing rhythm... Our detection system consists of two main components: a webpage-embedded logger and a server-side classifier. The logger records mouse movement and keystroke data while a user is filling out a form, and streams this data in batches to a classifier, which classifies the poster as human or bot. Our experimental results demonstrate an overall detection accuracy greater than 99%, with negligible overhead.

You may not be very confident to use behavioral biometrics to identify individuals (say, differentiating Alice from Bob). However, it works very well for our binary classification problem (bot or human), assisted by machine learning. As this project can be easily extended to other HTML-form-involved applications, such as account registration, message posting, online voting, I’ve been considering taking this project into a Web service. For example, the email service provider can embed our JavaScript logger into its account registration page. After receiving the user input data (sensitive textual data could be masked), our server passes the classification result back to the email provider that then makes its decision to accept the user registration or not.

[ComNet'12] Blog or Block: Detecting Blog Bots through Behavioral Biometrics. [] [online version]
Zi Chu, Steven Gianvecchio, Aaron Koehl, Haining Wang, and Sushil Jajodia.

Detection of Social Spam Campaigns on Twitter

If you’re a regular tweeter, you may have been harassed by spammers. What’s worse is that, they may coordinate a great amount of spam accounts to launch spam campaigns to storm Twitter in a short time window. Spammers have become more sophisticated. Now they’ve learned to distribute workload to individual accounts, and thus each account can spam in a stealthy way and fly under-the-radar. This project addresses the collective campaign problem.

The popularity of Twitter greatly depends on contents contributed by users. Unfortunately, Twitter has attracted spammers to post spam content which pollutes the community. Social spamming is more successful than traditional methods such as email spamming by using social relationship between users. Conventional detection methods check individual messages or accounts for the existence of spam. Our work takes the collective perspective, and focuses on detecting spam campaigns that manipulate multiple accounts to spread spam on Twitter. As a complement to conventional detection methods, our work brings efficiency and robustness. More specifically, we design an automatic classification system based on machine learning, and apply multiple features to classifying spam campaigns. The experimental evaluation demonstrates the efficacy of the proposed classification system.

[ACNS'12] Detecting Social Spam Campaigns on Twitter. []
Zi Chu, Indra Widjaja, and Haining Wang.
To appear in Proceedings of Conference on Applied Cryptography and Network Security, Singapore, June 2012.
Part of the work was done during my internship at Bell Labs, Alcatel-Lucent, NJ in 2011.

Detection of Twitter Bots

As I started this project in early 2009, Twitter was gaining popularity and hot attention. Many developers were creating gadgets based on Twitter APIs. Some were abused, known as Twitter bots. Instead of classifying content (spam or legitimate), my goal was to determine the automation degree of an account, whether it is manually operated or automatically piloted, or even mixed. Soon Twitter began to suspend aggressive automation and tightened the regulation of 3rd-party tools, which confirmed the direction of this project.

The popularity and open structure of Twitter have attracted a large number of automated programs, known as bots, which appear to be a double-edged sword to Twitter. Legitimate bots generate a large amount of benign tweets delivering news and updating feeds, while malicious bots spread spam or malicious contents. More interestingly, in the middle between human and bot, there has emerged cyborg referred to either bot-assisted human or human-assisted bot. To assist human users in identifying who they are interacting with, this project focuses on the classification of human, bot and cyborg accounts on Twitter. We first conduct a set of large-scale measurements with a collection of over 500,000 accounts. We observe the difference among human, bot and cyborg in terms of tweeting behavior, tweet content, and account properties. Based on the measurement results, we propose a classification system that includes the following four parts: (1) an entropy-based component that calculates the complexity of posting time inter-arrivals, (2) a machine-learning-based component that captures spam content, (3) an account properties component, and (4) a decision maker. It uses the combination of features extracted from an unknown user to determine the likelihood of being a human, bot or cyborg. Our experimental evaluation demonstrates the efficacy of the proposed classification system.

[ACSAC'10] Who is Tweeting on Twitter: Human, Bot, or Cyborg? []
Zi Chu, Steven Gianvecchio, Haining Wang, and Sushil Jajodia.
In Proceedings of Annual Computer Security Applications Conference, Austin, TX, USA, December 2010.

Anti-Hotlinking

Unauthorized hotlinking is an unethical behavior that links web resources on hosting servers into web pages belonging to hotlinkers. It violates the interests of hosting servers by consuming bandwidth and detracting site visiting traffic. To fully understand the nature of hotlinking, we conduct a large-scale measurement-based study and observe that hotlinking widely exists over the Internet and is severe in certain categories of websites. Moreover, we perform a detailed postmortem analysis on a real hotlink-victim site. After analyzing a group of commonly used hotlinking attacks and the weakness of current defense methods, we present an anti-hotlinking framework for protecting materials on hosting servers based on existing network security techniques. The framework can be easily deployed at the server side with moderate modifications, and is highly customizable with different granularities of protection. We implement a prototype of the framework and evaluate its effectiveness against hotlinking attacks.

[ComCom'10] An Investigation of Hotlinking and Its Countermeasures. [] [online version]
Zi Chu and Haining Wang.
In Journal of Computer Communications (Elsevier), Vol. 43, No. 4, April 2011.

Web Co-browsing

As we were doing the project in 2008, PDAs played the role of today's tablets. If one was on business trip, she might carry a PDA instead of bringing a heavy laptop. The project was very fun, and was reported by MIT's Technology Review and the Computer Power User magazine [read the coverage Traveling the Web Together here]. If we had an iPad back then, we might have come up with a more interesting app.

This project allows multiple users to access the same web page in a simultaneous way and collaboratively fulfill certain tasks. We propose a simple framework for Real-time Collaborative Browsing (RCB). The RCB framework is a pure browser-based solution. It leverages the power of Ajax techniques and the extensibility of modern web browsers for performing co-browsing. RCB achieves real-time collaboration among web users, without the involvement of any third-party servers or proxies. It enables fine-grained co-browsing on arbitrary web sites and pages. We implement the RCB in Firefox, and validate its efficacy through real experiments.

[USENIX ATC'09] RCB: A Simple and Practical Framework for Real-time Collaborative Browsing. []
Chuan Yue, Zi Chu and Haining Wang.
In Proceedings of USENIX Annual Technical Conference, San Diego, CA, USA, June 2009.