throbber
Literature Fingerprinting:
`A New Method for Visual Literary Analysis
`Daniel A. Keim∗
`Daniela Oelke†
`University of Konstanz
`University of Konstanz
`
`ABSTRACT
`In computer-based literary analysis different types of features are
`used to characterize a text. Usually, only a single feature value or
`vector is calculated for the whole text. In this paper, we combine
`automatic literature analysis methods with an effective visualiza-
`tion technique to analyze the behavior of the feature values across
`the text. For an interactive visual analysis, we calculate a sequence
`of feature values per text and present them to the user as a character-
`istic fingerprint. The feature values may be calculated on different
`hierarchy levels, allowing the analysis to be done on different reso-
`lution levels. A case study shows several successful applications of
`our new method to known literature problems and demonstrates the
`advantage of our new visual literature fingerprinting.
`Keywords: Visual literature analysis, visual analytics, literature
`fingerprinting
`Index Terms:
`J.5 [Computer Applications]: Arts and
`Humanities—Linguistics, Literature; I.6.9 [Visualization]: Infor-
`mation Visualization—Visualization Techniques and Methodolo-
`gies
`
`1 INTRODUCTION
`Traditional literary analysis is mostly done without the use of a
`computer. One of the reasons is obvious: to properly understand
`a text not only the words that are used are important but also the
`context they are used in, and understanding them in the context is
`difficult to achieve algorithmically. However, there are some fields
`of literary analysis in which computers have already proved useful
`in the past. This includes the classification of texts and some aspects
`of literary criticism. Often these methods are based on features that
`are supposed to characterize the text. Feature extraction methods
`can be as simple as calculating the average sentence length or as
`complicated as estimating the vocabulary richness of a text. In the
`case of text classification often conventional classification methods
`such as Support Vector Machines or Bayesian Networks are used,
`that work fully automatically. In other applications, for example
`in the case of authorship attribution, more transparency is required.
`Then, nearest neighbor classification is a popular approach in which
`an unclassified document is attributed to the author with the most
`similar features given some reference documents with known au-
`thorship. For methods that are based on multidimensional feature
`vectors often a transformation to a low dimensional space is done
`(using PCA, SVD or Karhunen-Lo`eve transform) and the results
`are visualized in a two-dimensional scatterplot [3, 9, 12] or with
`barcharts [9]. All the approaches have in common, that a single
`feature vector or value is used to characterize the whole text. This
`means that a lot of information is disregarded, since the change of
`the values as the text proceeds can reveal characteristic traits of an
`author or show interesting patterns (see fig. 1 for a first impression).
`∗e-mail: keim@inf.uni-konstanz.de
`†e-mail: oelke@inf.uni-konstanz.de
`
`IEEE Symposium on Visual Analytics Science and Technology 2007
`October 30 - November 1, Sacramento, CA, USA
`978-1-4244-1659-2/07/$25.00 ©2007 IEEE
`
`Figure 1: Visualization of the two novels “The Iron Heel” and “Chil-
`dren of the Frost” by Jack London. Color is mapped to vocabulary
`richness.
`It can easily be seen that the structure of the two novels
`is very different with respect to this measure. This would be camou-
`flaged if only a single value for each book would be calculated.
`
`Our idea presented in this paper is to calculate the features for
`different hierarchy levels of the text (such as words, sentences,
`chapters, . . . ) and create a characteristic fingerprint of the text
`which contains significantly more information than just a single
`number and therefore enables the user to gain a deeper understand-
`ing. We successfully apply the method to known literature prob-
`lems (e.g. authorship attribution) and show that by combining the
`automatic analytical methods of literature science with an effec-
`tive visualization technique new insights into literary texts can be
`gained.
`
`Outline
`The rest of the paper is organized as follows: In section 2 the dif-
`ferent types of computer-based literary analysis are introduced and
`different variables for literary analysis are briefly reviewed. Using
`some of the variables, in section 3 we then test their discrimina-
`tion power with respect to the authorship attribution problem using
`novels of Mark Twain and Jack London. In section 4, we locally an-
`alyze the literature fingerprints of two novels of Twain and London
`and of a much bigger and more diverse text - the bible. Section 5
`introduces our framework and finally, section 6 concludes the paper
`and outlines some interesting future applications of our new tech-
`nique that are planned together with literature scientists.
`
`2 BASICS FROM LITERATURE SCIENCE
`2.1 Computer-based literary analysis
`In a number of recent digital library projects, huge amounts of
`literature have already been digitized.
`In order to be able to
`computationally support the analysis of these texts algorithms are
`needed that can cope with all levels of natural language, namely
`the lexical, syntactic, and semantic aspects. Although the field of
`natural language processing has made significant progress over
`the last years, there are still a number of aspects the algorithms
`cannot cope with properly. This is especially true if the semantics
`or meaning of a text has to be taken into account, because of the
`vast amount of words, which have different meanings in different
`contexts, and the impressive flexibility and complexity of natural
`language. Still, computers are of great help whenever the lexical or
`syntactic structure of a text needs to be analyzed. In this section,
`we introduce two fields of computer-based literary analysis, in
`
`115
`
`Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 25, 2009 at 22:29 from IEEE Xplore. Restrictions apply.
`
`EX1062
`Roku V. Media Chain
`U.S. Patent No. 9,715,581
`
`

`

`which computers have already been successfully applied, namely
`text classification and different types of literary criticism.
`
`types of text classification can be distin-
`Three different
`guished:
`topic-oriented classification, classification into genres,
`and authorship attribution. For topic-oriented classification often
`TF-IDF vectors are used to characterize a text. TF-IDF (term
`frequency - inverted document frequency) vectors are made up of
`the frequency of each term in the document corpus weighted by
`the importance of that term with respect to the other documents
`in the corpus.
`Intuitively, a term is seen as characteristic for a
`document if its frequency within the document is much higher than
`its frequency in the rest of the corpus [10]. A typical application
`of topic-oriented classification is, e.g., labeling newspaper articles
`as Politics, Business, Sports, or Entertainment. For discrimination
`between different literary genres, such as Fiction / Non-Fiction,
`Children’s literature, Travel literature, Poetry, etc.,
`it is useful
`to consider the grammatical structure, the parts of speech used,
`and even the layout of the text in addition to TF-IDF vectors. In
`contrast, a special requirement of authorship attribution is that
`the extracted features should not be consciously controllable by
`the writer to prevent the method from being misdirected by a
`forged text. Note that for all mentioned methods the quality of
`the classification highly depends on the suitability of the features
`for discriminating the objects of the given categories. Therefore,
`enabling the user to understand the discrimination power of the
`features with respect to the classification task is of high importance.
`
`Computer-assisted literary criticism is a rather young field in
`the studies of literature. According to [6], a frequently mentioned
`objection is that the words and sentences of a text cannot be
`analyzed without properly taking the context they are used in
`into account.
`Therefore, most researchers in literary studies
`only use computers to collect data that is afterwards analyzed
`conventionally. Yet, there are some cases in which the computer
`has already proven useful, e.g., for comparing an author’s revisions
`(from version to version) or for the analysis of prosody and poetic
`phonology. Computer-assisted studies have also been performed
`in the context of sequence analysis, such as assigning quoted
`passages to speakers and locating them in the sequence of the
`text [6]. Another interesting field for computer-assisted analysis
`is translation criticism, in which metrics, rhythm, style, and other
`variables of the original text and the translation are compared to
`evaluate the quality of the translation.
`
`All mentioned approaches have in common that one feature vec-
`tor or value is calculated per text or per text block. Even if the text is
`split into chapters and paragraphs, usually the value for each chap-
`ter or paragraph is considered as a single object. In most cases, the
`values are averaged over the whole text, which leads to a smoothing
`of passages with an unusual trend, camouflaging interesting pat-
`terns. None of the computer-based literature analysis methods used
`so far deals with the behavior of the values across the text, which
`means that this very important information is completely ignored.
`In our approach, we therefore consider the literature in more de-
`tail by analyzing the texts on different hierarchy levels (i.e. calcu-
`lating one value per sentence, paragraph, chapter, or text block).
`By visualizing the results of the detailed literature analysis together
`with their position in the text, even local analyses become possi-
`ble. Moreover, the comparison of the visualizations for different
`variables leads to insights about the discrimination power of the
`different literature analysis variables. Since the success of each of
`the methods highly depends on an appropriate choice of the feature
`analysis variables, the possibility to efficiently compare the effec-
`tiveness of the variables with respect to a specific task provides new
`ways of an in-depth literature analysis.
`
`2.2 Variables for literary analysis
`Different variables for literary analysis have been proposed. They
`can roughly be classified into three groups: statistical measures,
`vocabulary measures, and syntax measures.
`In [8], a compre-
`hensive survey on variables for literary analysis with a focus on
`authorship attribution can be found.
`Information about variables
`for text theme classification can be, for example, found in [7]. In
`this subsection, we briefly introduce some important text analysis
`measures to give the reader an overview of the field and provide
`the necessary background knowledge for the following sections.
`The focus will be on variables which measure the stylistic traits of
`literary texts in general and the style of an author in particular.
`
`Statistical measures
`Calculating the average word length or the average number of
`syllables per word are two simple variables to characterize a text.
`While the first one does not provide reliable results, the second
`one can be useful to distinguish different genres. This is intuitively
`plausible because in poetic texts the number of syllables of a word
`is much more important than in prose texts.
`Sentence length is an indicator of style that can be used to estimate
`how good the rhythm of a text is preserved in a translation of the
`text. It is also used for authorship attribution studies, although in
`the context of authorship attribution it can be problematic since the
`length of the sentences is consciously controllable by an author
`and is not meaningful if the text has been edited by someone else.
`It has been shown that the distribution of sentence length is a more
`reliable marker for authorship than the average sentence length.
`Yet, it is also more difficult to evaluate. Here our technique proves
`useful because the visualization of the results allows an effective
`comparison of the distribution.
`Instead of working on the words directly, it is also possible to
`analyze the proportions of certain parts of speech (POS) (such
`as nouns, verbs, adjectives . . .) in the text. By this, the degree of
`formality of a text can be measured or the style of a text can be
`compared to its translation in another language.
`
`Vocabulary measures
`Vocabulary measures are based on the assumption that authors
`(and their texts) differ from each other with respect to vocabulary
`richness (how many words are in the vocabulary of the author and
`is s/he able to use his/her vocabulary by applying new words as the
`text proceeds) and with respect to word usage (which words are
`preferred if several can be applied).
`
`To measure the characteristic word usage of an author the
`frequencies of specific words are counted. The success of this
`method highly depends on the appropriate choice of words for
`which the frequencies are compared. Different approaches have
`been suggested, e.g., to group the words into categories such as
`idiomatic expressions, scientific terminology, or formal words,
`and count the number of occurrences for each group or compare
`the frequency distributions of the words. Good results have been
`reported for function words such as ”the, and, to, of, in . . . ” as the
`set of words. According to [5], function words have the advantage
`that writers cannot avoid using them, which means that they can be
`found in every text and almost every sentence. Furthermore, they
`have little semantic meaning and are therefore among the words
`that are least dependent on context. With the exception of auxiliary
`words they are also not inflected, which simplifies counting them.
`Finally,
`the choice of specific function words is mainly done
`unconsciously which means that it is an interesting measure for
`authorship attribution.
`
`Measures of vocabulary richness are mainly based on the evalu-
`ation of the number of tokens and different types. In the following,
`
`116
`
`Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 25, 2009 at 22:29 from IEEE Xplore. Restrictions apply.
`
`

`

`used when there is doubt whether the person that claims to have
`written the text is really the creator. One example for such a doubt-
`ful situation is the assignment of the 15th book of the series of the
`Wizard of Oz. The book was published after the death of its au-
`thor L. Frank Baum and was said to have been only edited by his
`successor Ruth Thompson who wrote the next books of the series.
`However, some literature specialists think that Ruth Thompson also
`wrote the 15th book and that the attribution to Baum was only due
`to commercial motives to ease the transition from one author to the
`next without losing sales. See [5] for an interesting analysis on the
`problem.
`Authorship attribution has also been named stylometry, because
`the classification is based on the distinct stylistic traits of a docu-
`ment and is independent of its semantic meaning. To measure style,
`certain features of the text are extracted that clearly discriminate
`the literary work of one author from another author. Classical
`authorship attribution is mostly done on a pure statistical basis,
`excluding non-numeric measures. To get reliable results, enough
`texts of the potential writers with known authorship have to be
`available as basis for attributing the text in doubt to one of them.
`
`3.2 Case study with literature of Mark Twain and Jack
`London
`In this subsection, we will present the results of a study with
`literature of Mark Twain and Jack London. Our goal was to test the
`existing literature analysis measures and see whether our detailed
`visual representation leads to new insights.
`
`In our study we used the following texts, that are all publicly
`available from Project Gutenberg [1]:
`• Jack London:
`- The Call of the Wild
`- Children of the Frost
`- The Iron Heel
`- Jerry of the Islands
`- The Sea-Wolf
`- The Son of the Wolf.
`• Mark Twain:
`- A Connecticut Yankee in King Arthur’s Court
`- A Tramp Abroad
`- Chapters From My Autobiography
`- Following the Equator
`- The Adventures of Huckleberry Finn
`- The Innocents Abroad
`- Life on the Mississippi
`- The Prince and the Pauper
`- The Gilded Age: A Tale of Today
`- The Adventures of Tom Sawyer.
`We preprocessed the texts by removing the preamble and other
`Gutenberg specific parts of the document and by replacing short
`forms with the corresponding long forms (e.g. isn’t → is not). Af-
`terwards we used the Standford POS tagger to annotate the texts
`[2]. For that we had to remove the chapter titles, since the tagger
`is only able to cope with complete sentences (though it is fault-
`tolerant with some grammatical errors). Finally, we split the docu-
`ments into blocks with a fixed number of words each to be able to
`show the behavior of the variable values across the text. The num-
`ber of words per block can be chosen by the user. For this paper, we
`set the number of words per block to 10,000, but similar results are
`obtained for a wide variation of this number as long as the blocks
`are not too small (> 1,000); since some literature analysis measures
`will provide unstable results when applied to short texts. To obtain
`a continuous and split-point independent series of values, we over-
`lap the blocks with the neighboring blocks by about 9,000 words.
`
`let N denote the number of tokens (that is the number of word oc-
`currences which form the sample text, i.e. the text length), V the
`types (the number of lexical units which form the vocabulary in
`the sample, i.e. the number of different words), and Vr the number
`of lexical units that occur exactly r times. A simple measure for
`vocabulary richness is the type-token ratio (R) defined as
`
`.
`
`V N
`
`R =
`
`This measure has one severe disadvantage, namely its dependency
`on the length of the text. A more sophisticated method to mea-
`sure vocabulary richness is the Simpson’s Index (D) that calculates
`the probability that two arbitrarily chosen words will belong to the
`same type. D is calculated by dividing the total number of identical
`pairs by the number of all possible pairs:
`r=1 r(r− 1)Vr
`D = ∑∞
`N(N − 1)
`While the Simpson’s Index takes the complete frequency profile
`into account, there are also measures that focus on just one specific
`part of the profile. For example, [8] reports that Honor´e suggested
`a measure that tests the tendency of an author to choose between a
`word used previously or utilizing a new word instead, which can be
`calculated as
`
`.
`
`R =
`
`100logN
`1−V1/V
`and is based on the number of Hapax Legomena (V1) of a text, that
`means the number of words that exactly occur once. The method
`is said to be stable for texts with N > 1300. Similar to this, the
`Hapax Dislegomena (V2) (the words that occur exactly twice) can
`be used to characterize the style of an author. According to [8],
`Sichel found that the proportion of hapax dislegomena (V2/V ) is
`stable for a particular author for 1,000 < N < 400,000. At first this
`seems counterintuitively but with increasing text length not only
`more words appear twice but also words that formerly occurred
`twice now occur three times and therefore left the set of hapax
`dislegomena.
`Many other methods to measure the vocabulary richness exist. The
`interested reader should consult [8] for a deeper investigation of
`the topic.
`
`Syntax measures
`Syntax-based measures analyze the syntactical structure of the text
`and are based on the syntax tree of the sentences. As the syntactical
`structure contains additional information, syntax measures have
`a high potential in literature analysis and have already been used
`in some projects.
`In [4], an experiment is reported in which a
`new syntax-based approach was tested against some word-based
`methods and was shown to beat them. In another approach [11],
`the authors build up syntax trees and develop different methods
`to analyze the writing style,
`the syntax depth, and functional
`dependencies by evaluating the trees. Note that – to a certain
`extend – the usage of function words also takes the syntax into
`account, because some function words mark the beginning of
`subordinate clauses or connect main-clauses. They therefore allow
`inferences about
`the sentence structure without analyzing the
`syntax directly.
`
`3 AUTHORSHIP ATTRIBUTION
`3.1 The concept of authorship attribution
`The goal of authorship attribution is to determine the authorship of
`a text when it is unknown by whom the text has been written or
`when the authorship is disputed. Authorship attribution can also be
`
`Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 25, 2009 at 22:29 from IEEE Xplore. Restrictions apply.
`
`117
`
`

`

`(a) Function words (First Dimension after PCA)
`
`(b) Function words (Second Dimension after PCA)
`
`(c) Average sentence length
`
`(d) Simpson’s Index
`
`(e) Hapax Legomena
`
`(f) Hapax Dislegomena
`
`Figure 2: Fingerprints of books of Mark Twain and Jack London. Different measures for authorship attribution are tested. If a measure is able
`to discriminate between the two authors, the visualizations of the books that are written by the same author will equal each other more than
`the visualizations of books written by different authors. It can easily be seen that this is not true for every measure (e.g. Hapax Dislegomena).
`Furthermore, it is interesting to observe that the book Huckleberry Finn sticks out in a number of measures as if it is not written by Mark Twain.
`
`This results in a soft blending of the values instead of hard cuts and
`therefore enables the user to easily follow the development of the
`values across the text.
`As visual representation of the results we depict each text block as
`a colored square and line them up from left to right and top to bot-
`tom. Although very simple this is an effective visualization since
`the order of the text blocks is very important and the alignment cor-
`responds to the standard reading direction. We also experimented
`with other shapes such as rounded rectangles, squares with beveled
`
`borders and circles. However, it turned out that the perception of
`a trend is easiest when displayed on a closed area with no borders
`visible. For the comparison of discrete values the other shapes are
`more useful. If a hierarchy has been defined on the text (made up
`of chapters, pages of the book, paragraphs, etc. ), the pixels are vi-
`sually grouped according to that hierarchy. Thereby, the structure
`of the text can be visually perceived and patterns that discern one
`passage of the other become obvious.
`Since function word analysis is known as one of the most suc-
`
`118
`
`Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 25, 2009 at 22:29 from IEEE Xplore. Restrictions apply.
`
`

`

`cessful methods for discriminating the texts of different authors, we
`started our analysis with this measure. We took a list of 52 function
`words that was also used in [5]. For each text block, a feature vector
`was calculated by counting the frequency of each of the function
`words, resulting in a 52-dimensional feature vector. We then ap-
`plied principal component analysis (PCA) to the feature vectors to
`linearly transform the data to a new coordinate system in which the
`first dimension accounts for the largest variance, the second dimen-
`sion for the second largest variance and so on. Figure 2(a) shows
`the values of the first dimension. We use a bipolar, interactively
`adjustable colormap to map the values to color.
`If a measure is
`able to discriminate the two authors, the books of one author will
`be mainly in blue and the books of the other one will be mainly
`in red. It is obvious that this is not the case here. What sticks out
`immediately is Mark Twains The Adventures of Huckleberry Finn.
`This novel seems to differ more from all the other writings of Mark
`Twain than the writings of the two authors differ from each other.
`If we visualize the second dimension of the transformed function
`word vectors we can see that the books of the two authors now
`separate from each other (figure 2(b)) - again with the exception
`of Huckleberry Finn (and this time also the book The Adventures
`of Tom Sawyer) which we would rather attribute to London than to
`Twain if its authorship was unknown. To analyze the strange behav-
`ior of Huckleberry Finn, we tested other variables such as Sentence
`length, Simpson’s Index, the Hapax Legomena measure of Honor´e,
`and the Hapax Dislegoma ratio (see section 2.2 for an introduction
`of the variables). Figures 2(c) - 2(f) show the visualizations for
`the different measures. In fig. 2(e) Huckleberry Finn again clearly
`stands apart. The Simpson’s Index shown in fig. 2(d) would again
`mislead us to attribute the book to Jack London, whereas in 2(c) it
`nicely fits to all the other books of Mark Twain. Finally, the Hapax
`Dislegoma shown in 2(f) seems to have no discriminative power
`and is therefore not useful for the analysis. Taking all analysis mea-
`sures into account, it is clear that there is something special about
`Mark Twain’s The Adventures of Huckleberry Finn. The reasons
`for the exceptional behaviour cannot be answered by our analysis.
`The potential explanations range from language particularities such
`as the southern accent of the novel which may irritate some of the
`measures over the editing of the text in Project Gutenberg to the sur-
`prising speculation that a ghost writer was involved in the creating
`of the novel.
`On the more general side, the figures show that not every variable
`is able to discriminate between the books of Mark Twain and those
`of Jack London, and this is also true if the novel Huckleberry Finn
`is excluded from the study. In fig. 2(f) (Hapax Dislegomena), we
`do not see much of a difference between the texts at all. The state-
`ment of Sichel that the proportion of Hapax Dislegomena in a text
`is specific for an author [8] cannot be verified, at least for these two
`authors. Instead, the sentence length measure (see fig. 2(c)) allows
`a very nice discrimination between the two authors. Mark Twain’s
`books in average have longer sentences than Jack London’s books.
`Only one novel per writer, namely Jerry of the Islands of Jack Lon-
`don and The Adventures of Tom Sawyer of Mark Twain break ranks
`and may be attributed to the other author. The second PCA dimen-
`sion of the function word vector (fig. 2(b)) and the Simpson’s Index
`(fig. 2(d)) also provide very nice results. Based on the Simpson’s
`Index, we can observe a trend to a higher vocabulary richness (less
`repetition) in the writings of Mark Twain than in the books of Jack
`London.
`
`Figure 3: The figure shows the fingerprints of two novels that almost
`have the same average sentence length.
`In the detailed view, the
`different structure of the two novels is revealed. The inhomogeneity
`of the travelogue Following the Equator can be explained with the
`alternation of dialogs, narrative parts and quoted documents.
`
`4.1 Detail analysis of two novels
`In this subsection, we will analyze two books, whose average sen-
`tence length is about the same. The images in Figure 3 show the
`result of splitting the text into overlapping text blocks of 10,000
`words each (with an overlap of 9,000 words) and calculating the
`average sentence length block-wise. The visual fingerprints reveal
`that the structure of the two books is totally different despite their
`identical overall average values. While the average sentence length
`in Jerry of the Islands of Jack London does not differ much across
`the novel (and thus the total average value would be meaningful),
`there are significant variations in Following the Equator of Mark
`Twain. Following the Equator is a non-fiction travelogue that Mark
`Twain wrote as an account of his tour of the British Empire in 1895.
`In fig. 3, some passages stick out as they are in dark blue respec-
`tively dark red. Taking a closer look at the text reveals the reasons:
`The long stripe in dark blue in the first line, for example, repre-
`sents a passage, in which Mark Twain quotes the scientific text of
`a naturalist with rather complex and long sentences. On the other
`hand, in the dark red passages in the second and third line Mark
`Twain noted some conversations that he had during his travel with
`the short sentences of spoken language. The second dialog is di-
`rectly followed by the quotation of a written report about a murder.
`One would rather expect such a report as being characterized by
`long sentences. This is probably why Twain himself utters his sur-
`prise about the text in his book. He says:
`
`“It is a remarkable paper. For brevity, succinctness, and
`concentration, it is perhaps without its peer in the liter-
`ature of murder. There are no waste words in it; there
`is no obtrusion of matter not pertinent to the occasion,
`nor any departure from the dispassionate tone proper to
`a formal business statement.” [13]
`
`The dark blue area in the forth line is due to a historical report of
`the black death and an official report of the trail.
`
`4 DETAIL ANALYSIS OF LITERATURE FINGERPRINTS
`
`The task of authorship attribution is an interesting application of
`our visual literature fingerprint. To reveal their full power, in this
`section we will look at the visual fingerprints in more detail.
`
`4.2 Detail analysis of the bible
`In a second study, we analyzed the visual fingerprint of the
`bible.
`In this case, we used the existing hierarchy of the text to
`define the blocks. While every text has an inherent syntactical
`
`Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 25, 2009 at 22:29 from IEEE Xplore. Restrictions apply.
`
`119
`
`

`

`Figure 5: Visual Fingerprint of the Bible. More detailed view on the bible in which each pixel represents a single verse and verses are grouped
`to chapters. Color is again mapped to verse length. The detailed view reveals some interesting patterns that are camouflaged in the averaged
`version of fig. 4.
`
`Figure 6: The visualization of this paper helped to find the longest sentences and improve the readability by modifying them appropriately.
`
`120
`
`Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 25, 2009 at 22:29 from IEEE Xplore. Restrictions apply.
`
`

`

`can discern details that were camouflaged before. In figure 5, each
`pixel represents a single verse. The verses are grouped into chap-
`ters and the chapters are grouped into books. Again, verse length
`is mapped to color. In the detailed view, we may be able to get ad-
`ditional new insights. Note, for example, that a lot of chapters in
`Job start with a short verse. In the book of Job, Job and his friends
`take turns at giving long monologues. As the chapter borders were
`drawn in between two speeches most of the chapters start with a
`short verse like “Then Job / Bildad / Zophar / Eliphaz answered
`and said”. Another interesting observation is the clear division of
`Nehemiah 10 into two parts. The first one, the one with the short
`verses, consists of a list of persons that signed a treaty whereas the
`second part is a historical report. In the coarse representation of fig.
`4, we were not able to discern that because the average value of the
`whole chapter is not much different from other chapters. Another
`interesting observation is the regular pattern of Numbers 7, which
`appears odd. Looking into the text, we find reports how each tribe
`offers its dedication for the tabernacle. Since the offerings of the
`tribes were all similar, almost the same text is used for every tribe
`and therefore the text is repeated twelve times.
`
`4.3 Analysis of this paper
`Since our visual fingerprinting can be applied to any text, we also
`applied it to a first draft of this paper. Figure 6 shows a visualiza-
`tion of the sentence length of this paper on a sentence-by-sentence
`level. While it was thought to be a fun experiment at first, the vi-
`sualization proved to be useful to improve the paper. We used the
`visualization, for example, to find the extraordinary long sentences
`and rechecked them for readability. Other measures help to find re-
`peated words or other redundancies, detect incomplete or complex
`syntactical structures, or determine homogeneity of the text flow
`which is especially relevant for multiple-author papers.
`
`5 THE FRAMEWORK
`To ease the analysis we set up a framework in which all the mea-
`sures are implemented and that allows us to explore a text inter-
`actively. Figure 7 shows a screenshot of the tool. As can be seen
`multiple texts can be loaded at a time and therefore compared to
`each other or - as in this case - one text can be displayed multi-
`ple times to compare different measures. Interaction with a defined
`hierarchy comes in several ways. First of all, th

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket