So, how can we verify who penned a historical text known only from fragments? How can we establish the true creator of an Internet lampoon? How can we really determine if the text of a thesis or doctoral dissertation is not plagiarized? In many cases, traditional stylometric methods fail or do not lead to sufficiently reliable conclusions.
In Information Sciences, scientists from the Institute of Nuclear Physics of the Polish Academy of Sciences (IFJ PAN) in Cracow have presented their own statistical tool for stylometric analysis. Constructed with the use of graphs, it makes it possible look at the structure of texts in a qualitatively new way.
“The conclusions of our research are, on the one hand, encouraging. They indicate that the individuality of any person manifests itself clearly in the way they use a surprisingly small number of words. But there is also another, darker side of the coin. Since it turns out we are so original, it will be easier to identify us by our statements,” says Prof. Stanislaw Drozdz (IFJ PAN, Cracow University of Technology). ...
The method hinges on the use of networks and graphs:
We suggested that the characteristic features of the style be sought in a network representation of the text, using graphs,” explains Tomasz Stanisz, PhD student at the IFJ PAN and the first author of the publication, and he specifies: “The graph is a collection of points, or vertices of the graph, connected by lines, i.e. the edges of the graph. In the simplest case – in the so-called unweighted network – the vertices correspond to individual words and are connected by edges if and only if two given words have occurred adjacent to each other at least once in the text. For example, for the sentence ‘Jane is hungry’, the graph would have three vertices, one for each word, but there would only be two edges, one between ‘Jane’ and ‘is’, the other between ‘is’ and ‘hungry’.” ...
Cool stuff. Read the rest at the link.