TY - GEN
T1 - Evolution of privacy loss in wikipedia
AU - Rizoiu, Marian Andrei
AU - Xie, Lexing
AU - Caetano, Tiberio
AU - Cebrian, Manuel
N1 - Publisher Copyright:
© 2015 Copyright held by the owner/author(s).
PY - 2016/2/8
Y1 - 2016/2/8
N2 - The cumulative effect of collective online participation has an important and adverse impact on individual privacy. As an online system evolves over time, new digital traces of individual behavior may uncover previously hidden statistical links between an individual's past actions and her private traits. To quantify this effect, we analyze the evolution of individual privacy loss by studying the edit history of Wikipedia over 13 years, including more than 117,523 different users performing 188,805,088 edits. We trace each Wikipedia's contributor using apparently harmless features, such as the number of edits performed on predefined broad categories in a given time period (e.g. Mathematics, Culture or Nature). We show that even at this unspecific level of behavior description, it is possible to use off-the-shelf machine learning algorithms to uncover usually undisclosed personal traits, such as gender, religion or education. We provide empirical evidence that the prediction accuracy for almost all private traits consistently improves over time. Surprisingly, the prediction performance for users who stopped editing after a given time still improves. The activities performed by new users seem to have contributed more to this effect than additional activities from existing (but still active) users. Insights from this work should help users, system designers, and policy makers understand and make long-term design choices in online content creation systems.
AB - The cumulative effect of collective online participation has an important and adverse impact on individual privacy. As an online system evolves over time, new digital traces of individual behavior may uncover previously hidden statistical links between an individual's past actions and her private traits. To quantify this effect, we analyze the evolution of individual privacy loss by studying the edit history of Wikipedia over 13 years, including more than 117,523 different users performing 188,805,088 edits. We trace each Wikipedia's contributor using apparently harmless features, such as the number of edits performed on predefined broad categories in a given time period (e.g. Mathematics, Culture or Nature). We show that even at this unspecific level of behavior description, it is possible to use off-the-shelf machine learning algorithms to uncover usually undisclosed personal traits, such as gender, religion or education. We provide empirical evidence that the prediction accuracy for almost all private traits consistently improves over time. Surprisingly, the prediction performance for users who stopped editing after a given time still improves. The activities performed by new users seem to have contributed more to this effect than additional activities from existing (but still active) users. Insights from this work should help users, system designers, and policy makers understand and make long-term design choices in online content creation systems.
KW - De-anonymization
KW - Online privacy
KW - Temporal loss of privacy
UR - http://www.scopus.com/inward/record.url?scp=84964380950&partnerID=8YFLogxK
U2 - 10.1145/2835776.2835798
DO - 10.1145/2835776.2835798
M3 - Conference contribution
T3 - WSDM 2016 - Proceedings of the 9th ACM International Conference on Web Search and Data Mining
SP - 215
EP - 224
BT - WSDM 2016 - Proceedings of the 9th ACM International Conference on Web Search and Data Mining
PB - Association for Computing Machinery, Inc
T2 - 9th ACM International Conference on Web Search and Data Mining, WSDM 2016
Y2 - 22 February 2016 through 25 February 2016
ER -