Identifying language groups within multilingual cybercriminal forums

Victor Benjamin, Hsinchun Chen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Online cybercriminal communities exist in various geopolitical regions, including America, China, Russia, and more. Some multilingual forums exist where cybercriminals of differing geopolitical origin interact and exchange hacking knowledge and cybercriminal assets. Researchers can study such forums to better understand the global cybercriminal supply chain and cybercrime trends. However, little work has focused on identifying members of different language groups and geopolitical origin within such forums. One challenge is the necessity of a technique that scales across multiple languages. We are motivated to explore computational techniques that support automated and scalable categorization of cybercriminal forum participants into varying language groups. In particular, we make use of Paragraph Vectors, a state-of-The-Art neural network language model to generate fixed-length vector representations (i.e., document embeddings) of messages posted by forum participants. Results indicate Paragraph Vectors outperforms traditional n-gram frequency approaches for generating document embeddings that are useful for clustering cybercriminals into language groups.

Original languageEnglish (US)
Title of host publicationIEEE International Conference on Intelligence and Security Informatics: Cybersecurity and Big Data, ISI 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages205-207
Number of pages3
ISBN (Electronic)9781509038657
DOIs
StatePublished - Nov 15 2016
Event14th IEEE International Conference on Intelligence and Security Informatics, ISI 2015 - Tucson, United States
Duration: Sep 28 2016Sep 30 2016

Other

Other14th IEEE International Conference on Intelligence and Security Informatics, ISI 2015
CountryUnited States
CityTucson
Period9/28/169/30/16

Fingerprint

Supply chains
Neural networks
Language
Language model
Russia
China
Knowledge exchange
Global supply chain
Clustering
Computational techniques
Assets
Cybercrime
Online communities

Keywords

  • Cybecrminal community
  • Cybersecurity
  • Language modeling
  • Multilingual
  • Neural network

ASJC Scopus subject areas

  • Information Systems
  • Artificial Intelligence
  • Computer Networks and Communications
  • Information Systems and Management
  • Safety, Risk, Reliability and Quality

Cite this

Benjamin, V., & Chen, H. (2016). Identifying language groups within multilingual cybercriminal forums. In IEEE International Conference on Intelligence and Security Informatics: Cybersecurity and Big Data, ISI 2016 (pp. 205-207). [7745471] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ISI.2016.7745471

Identifying language groups within multilingual cybercriminal forums. / Benjamin, Victor; Chen, Hsinchun.

IEEE International Conference on Intelligence and Security Informatics: Cybersecurity and Big Data, ISI 2016. Institute of Electrical and Electronics Engineers Inc., 2016. p. 205-207 7745471.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Benjamin, V & Chen, H 2016, Identifying language groups within multilingual cybercriminal forums. in IEEE International Conference on Intelligence and Security Informatics: Cybersecurity and Big Data, ISI 2016., 7745471, Institute of Electrical and Electronics Engineers Inc., pp. 205-207, 14th IEEE International Conference on Intelligence and Security Informatics, ISI 2015, Tucson, United States, 9/28/16. https://doi.org/10.1109/ISI.2016.7745471
Benjamin V, Chen H. Identifying language groups within multilingual cybercriminal forums. In IEEE International Conference on Intelligence and Security Informatics: Cybersecurity and Big Data, ISI 2016. Institute of Electrical and Electronics Engineers Inc. 2016. p. 205-207. 7745471 https://doi.org/10.1109/ISI.2016.7745471
Benjamin, Victor ; Chen, Hsinchun. / Identifying language groups within multilingual cybercriminal forums. IEEE International Conference on Intelligence and Security Informatics: Cybersecurity and Big Data, ISI 2016. Institute of Electrical and Electronics Engineers Inc., 2016. pp. 205-207
@inproceedings{80497836041b43ddbdae13d39f858fec,
title = "Identifying language groups within multilingual cybercriminal forums",
abstract = "Online cybercriminal communities exist in various geopolitical regions, including America, China, Russia, and more. Some multilingual forums exist where cybercriminals of differing geopolitical origin interact and exchange hacking knowledge and cybercriminal assets. Researchers can study such forums to better understand the global cybercriminal supply chain and cybercrime trends. However, little work has focused on identifying members of different language groups and geopolitical origin within such forums. One challenge is the necessity of a technique that scales across multiple languages. We are motivated to explore computational techniques that support automated and scalable categorization of cybercriminal forum participants into varying language groups. In particular, we make use of Paragraph Vectors, a state-of-The-Art neural network language model to generate fixed-length vector representations (i.e., document embeddings) of messages posted by forum participants. Results indicate Paragraph Vectors outperforms traditional n-gram frequency approaches for generating document embeddings that are useful for clustering cybercriminals into language groups.",
keywords = "Cybecrminal community, Cybersecurity, Language modeling, Multilingual, Neural network",
author = "Victor Benjamin and Hsinchun Chen",
year = "2016",
month = "11",
day = "15",
doi = "10.1109/ISI.2016.7745471",
language = "English (US)",
pages = "205--207",
booktitle = "IEEE International Conference on Intelligence and Security Informatics: Cybersecurity and Big Data, ISI 2016",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

TY - GEN

T1 - Identifying language groups within multilingual cybercriminal forums

AU - Benjamin, Victor

AU - Chen, Hsinchun

PY - 2016/11/15

Y1 - 2016/11/15

N2 - Online cybercriminal communities exist in various geopolitical regions, including America, China, Russia, and more. Some multilingual forums exist where cybercriminals of differing geopolitical origin interact and exchange hacking knowledge and cybercriminal assets. Researchers can study such forums to better understand the global cybercriminal supply chain and cybercrime trends. However, little work has focused on identifying members of different language groups and geopolitical origin within such forums. One challenge is the necessity of a technique that scales across multiple languages. We are motivated to explore computational techniques that support automated and scalable categorization of cybercriminal forum participants into varying language groups. In particular, we make use of Paragraph Vectors, a state-of-The-Art neural network language model to generate fixed-length vector representations (i.e., document embeddings) of messages posted by forum participants. Results indicate Paragraph Vectors outperforms traditional n-gram frequency approaches for generating document embeddings that are useful for clustering cybercriminals into language groups.

AB - Online cybercriminal communities exist in various geopolitical regions, including America, China, Russia, and more. Some multilingual forums exist where cybercriminals of differing geopolitical origin interact and exchange hacking knowledge and cybercriminal assets. Researchers can study such forums to better understand the global cybercriminal supply chain and cybercrime trends. However, little work has focused on identifying members of different language groups and geopolitical origin within such forums. One challenge is the necessity of a technique that scales across multiple languages. We are motivated to explore computational techniques that support automated and scalable categorization of cybercriminal forum participants into varying language groups. In particular, we make use of Paragraph Vectors, a state-of-The-Art neural network language model to generate fixed-length vector representations (i.e., document embeddings) of messages posted by forum participants. Results indicate Paragraph Vectors outperforms traditional n-gram frequency approaches for generating document embeddings that are useful for clustering cybercriminals into language groups.

KW - Cybecrminal community

KW - Cybersecurity

KW - Language modeling

KW - Multilingual

KW - Neural network

UR - http://www.scopus.com/inward/record.url?scp=85003874923&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85003874923&partnerID=8YFLogxK

U2 - 10.1109/ISI.2016.7745471

DO - 10.1109/ISI.2016.7745471

M3 - Conference contribution

AN - SCOPUS:85003874923

SP - 205

EP - 207

BT - IEEE International Conference on Intelligence and Security Informatics: Cybersecurity and Big Data, ISI 2016

PB - Institute of Electrical and Electronics Engineers Inc.

ER -