Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace

Ahmed Abbasi, Hsinchun Chen

Research output: Contribution to journalArticle

217 Citations (Scopus)

Abstract

One of the problems often associated with online anonymity is that it hinders social accountability, as substantiated by the high levels of cybercrime. Although identity cues are scarce in cyberspace, individuals often leave behind textual identity traces. In this study we proposed the use of stylometric analysis techniques to help identify individuals based on writing style. We incorporated a rich set of stylistic features, including lexical, syntactic, structural, content-specific, and idiosyncratic attributes. We also developed the Writeprints technique for identification and similarity detection of anonymous identities. Writeprints is a Karhunen-Loeve transforms-based technique that uses a sliding window and pattern disruption algorithm with individual author-level feature sets. The Writeprints technique and extended feature set were evaluated on a testbed encompassing four online datasets spanning different domains: email, instant messaging, feedback comments, and program code. Writeprints outperformed benchmark techniques, including SVM, Ensemble SVM, PCA, and standard Karhunen-Loeve transforms, on the identification and similarity detection tasks with accuracy as high as 94% when differentiating between 100 authors. The extended feature set also significantly outperformed a baseline set of features commonly used in previous research. Furthermore, individual-author-level feature sets generally outperformed use of a single group of attributes.

Original languageEnglish (US)
Article number7
JournalACM Transactions on Information Systems
Volume26
Issue number2
DOIs
StatePublished - Mar 1 2008

Fingerprint

Electronic mail
Syntactics
Testbeds
Mathematical transformations
Feedback
Cyberspace

Keywords

  • Discourse
  • Online text
  • Style classification
  • Stylometry
  • Text mining

ASJC Scopus subject areas

  • Information Systems

Cite this

@article{b266c8ddf1204cbf87ea16fb07d26a57,
title = "Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace",
abstract = "One of the problems often associated with online anonymity is that it hinders social accountability, as substantiated by the high levels of cybercrime. Although identity cues are scarce in cyberspace, individuals often leave behind textual identity traces. In this study we proposed the use of stylometric analysis techniques to help identify individuals based on writing style. We incorporated a rich set of stylistic features, including lexical, syntactic, structural, content-specific, and idiosyncratic attributes. We also developed the Writeprints technique for identification and similarity detection of anonymous identities. Writeprints is a Karhunen-Loeve transforms-based technique that uses a sliding window and pattern disruption algorithm with individual author-level feature sets. The Writeprints technique and extended feature set were evaluated on a testbed encompassing four online datasets spanning different domains: email, instant messaging, feedback comments, and program code. Writeprints outperformed benchmark techniques, including SVM, Ensemble SVM, PCA, and standard Karhunen-Loeve transforms, on the identification and similarity detection tasks with accuracy as high as 94{\%} when differentiating between 100 authors. The extended feature set also significantly outperformed a baseline set of features commonly used in previous research. Furthermore, individual-author-level feature sets generally outperformed use of a single group of attributes.",
keywords = "Discourse, Online text, Style classification, Stylometry, Text mining",
author = "Ahmed Abbasi and Hsinchun Chen",
year = "2008",
month = "3",
day = "1",
doi = "10.1145/1344411.1344413",
language = "English (US)",
volume = "26",
journal = "ACM Transactions on Information Systems",
issn = "1046-8188",
publisher = "Association for Computing Machinery (ACM)",
number = "2",

}

TY - JOUR

T1 - Writeprints

T2 - A stylometric approach to identity-level identification and similarity detection in cyberspace

AU - Abbasi, Ahmed

AU - Chen, Hsinchun

PY - 2008/3/1

Y1 - 2008/3/1

N2 - One of the problems often associated with online anonymity is that it hinders social accountability, as substantiated by the high levels of cybercrime. Although identity cues are scarce in cyberspace, individuals often leave behind textual identity traces. In this study we proposed the use of stylometric analysis techniques to help identify individuals based on writing style. We incorporated a rich set of stylistic features, including lexical, syntactic, structural, content-specific, and idiosyncratic attributes. We also developed the Writeprints technique for identification and similarity detection of anonymous identities. Writeprints is a Karhunen-Loeve transforms-based technique that uses a sliding window and pattern disruption algorithm with individual author-level feature sets. The Writeprints technique and extended feature set were evaluated on a testbed encompassing four online datasets spanning different domains: email, instant messaging, feedback comments, and program code. Writeprints outperformed benchmark techniques, including SVM, Ensemble SVM, PCA, and standard Karhunen-Loeve transforms, on the identification and similarity detection tasks with accuracy as high as 94% when differentiating between 100 authors. The extended feature set also significantly outperformed a baseline set of features commonly used in previous research. Furthermore, individual-author-level feature sets generally outperformed use of a single group of attributes.

AB - One of the problems often associated with online anonymity is that it hinders social accountability, as substantiated by the high levels of cybercrime. Although identity cues are scarce in cyberspace, individuals often leave behind textual identity traces. In this study we proposed the use of stylometric analysis techniques to help identify individuals based on writing style. We incorporated a rich set of stylistic features, including lexical, syntactic, structural, content-specific, and idiosyncratic attributes. We also developed the Writeprints technique for identification and similarity detection of anonymous identities. Writeprints is a Karhunen-Loeve transforms-based technique that uses a sliding window and pattern disruption algorithm with individual author-level feature sets. The Writeprints technique and extended feature set were evaluated on a testbed encompassing four online datasets spanning different domains: email, instant messaging, feedback comments, and program code. Writeprints outperformed benchmark techniques, including SVM, Ensemble SVM, PCA, and standard Karhunen-Loeve transforms, on the identification and similarity detection tasks with accuracy as high as 94% when differentiating between 100 authors. The extended feature set also significantly outperformed a baseline set of features commonly used in previous research. Furthermore, individual-author-level feature sets generally outperformed use of a single group of attributes.

KW - Discourse

KW - Online text

KW - Style classification

KW - Stylometry

KW - Text mining

UR - http://www.scopus.com/inward/record.url?scp=42049084142&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=42049084142&partnerID=8YFLogxK

U2 - 10.1145/1344411.1344413

DO - 10.1145/1344411.1344413

M3 - Article

AN - SCOPUS:42049084142

VL - 26

JO - ACM Transactions on Information Systems

JF - ACM Transactions on Information Systems

SN - 1046-8188

IS - 2

M1 - 7

ER -