
Method and System for Performing Electronic Data-Gathering Across Multiple Data Sources
`3 of 87
`Business Process and Work-flow
`Computer-based System
`Figure I. System Architecture
`Component Details
`Data Target
`1. Overview
`2. Role of the Analyzer
`3. Figure 1. Data Acquisition Flow-chart
`4. Figure 2. UML Diagram of Reference Implementation
`5. Figure 3. Screen Shot of Prototype
`Data Scanner
`1. Overview
`2. Figure 1. Work-flow diagram
`Data Object
`1. Overview
`2. Data Description Markup
`3. Figure 1. UML Diagram of Reference Object
`4. Figure 2. XML Prototype
`Transmission Protocol
`1. Overview
`2. Figure 1. Specification for Reference Implementation
`Server Module
`The Role of Persistent Storage
`Figure 1. Request/Response Flow Chart
`Figure 2. UML Diagram of Reference Object
`Figure 3. Prototypical DB Schema
`Workflow - data gathering
`Workflow - dataretrieval
`Object instances of the GemServer/GemClient data
`. An exampleinteraction between a GemServer and a GemClient
`Client Module
`1, Overview
`2. Module Roles
`Data Viewer/Foldable Windows UI
`1. Overview
`2. Figure 1. Foldable WindowsSpecification
`3. Figure 2. Screenshots of UI Prototype
`Original Specifications
`Workgroup GemsSpecification
`Gemteq White Paper
`Data Target Specification
`¢ ¢ +


`Method, Business Process, and Computer System for
`Performing Electronic Data-Gathering
`Currently, researchers using electronic media as sources are limited to file systems, or
`cumbersome indexing software solutions for the archival of important research data found during
`searches through local networks andfiles and the Internet. No current system allows for the
`automatic capture, storage, and classification of individual data points, as opposed to the entire
`source document, or the acquisition and storage ofthe data attributing data acquired in such a
`way to its original source or author.
`be visible to other researchers working on the same, or even non-related projects. Finally, the data
`Traditional computer systems such as those mentioned above, are composed ofan underlying
`modeldescribing the nature ofthe persistent data used, as well as the functions that are available
`to operate on that data. The computer systemthen implements one or more views, typified in the
`modern GUI interface, allowingthe user to interact with the model.
`Becauseofthis traditional approach, researchers are required to stop interacting with the current
`view ofthe program they are using to perform their research, interact with the view ofthe file
`system browseror other indexing application, and then return to the view they were accessing
`before. This interrupts the workflow oftheir project.
`A new system is required that will allow the researcher to acquire and use important pieces of
`data gleaned from files of various types available electronically without stopping their work to
`interact with an additional application user-interface. Such a system would provide 2 conduit for
`inserting research data into the corresponding systems model while negating or delaying the need
`to interact with a view.
`Finally the system must allow for the search and retrieval ofsuch stored data and allow forits use
`in subsequent documents where the data can automatically generate the required captions,
`foomotes, bibliography, or other entries thatattribute it to its original author.
`Business Process and Work-flow
`In the following discussion, the term ‘Source Document’ meansthefile, application, or Internet
`site from which research data is being collected. The term ‘Output Document’ means the original
`work in which the data gathered from the source documentwill be used.
`Researching a topic using digital sources has becomea time intensive process due to the vast
`quantity ofdata available, as well as legal requirements to track and attribute all such data used to
`its original source. Currently, the process of collecting such data involves a complex series of
`interactions with the source data itself, in addition to computer based or manual filing systems,
`word processors and other authoring tools and applications that make the process non-linear and
`difficult to manage.
`The preferred process would be to organize such research efforts into a linear stream whereall
`relevant materials are collected from source documents at the beginning ofthe project. They
`would then be processed and stored for future review. Further the data stored in this phase would
`4 of 87


`would be withdrawn as needed, along with all proper source andattribution information, during
`the compilation of the output document.
`Lintar Research Work-flow
`In order to allow for a somewhat linear flow through the research process, the system mustallow
`the user to continue work in the source or output documents, without stopping their work to
`interact with an external system.
`Oneillustration ofthis process is the time-honored tradition of students using a marker to
`highlight passages in a textbook. As the textbook is first read, important and relevant topics are
`‘called out’ using the highlighter — this is the gathering phase. Passages thus called outare, in
`essence, stored for future reference — this is the storage phase. Finally, just before an exam, the
`student re-reads the textbook, this time focusing on the called out passages — this is the usage
`This paper-based process can be adapted to electronic media using the following linear workflow:
`Target (alocationon the workarea)
`Because the sources for the type of research. discussed here are digital media, this process
`naturally lends itself to being modeled in a computer software application. One such
`implementation is described in the following section.
`5 of 87


`6 of 87
`A Computer-based System
`This documentdescribes a computer-based method for acquiring, categorizing, routing,
`archiving, sharing and using digital research data in a single step withoutleaving source
`documents or applications.
`Researchers using digital sources may drag or copy specific data onto a virtual ‘hotspot’ on their
`computer screens and haveit categorized and archived in local or internet-based storage without
`being interrupted or stopping their research work in the source document, application, or internet
`Once data has been categorized and stored in such a manner, it may be manipulated and searched.
`Ultimately, the data may be dragged or copiedonto a target document, where an attribution to the
`original source will automatically be made.
`The system fulfills three interrelated roles by managing:
`1. The acquisition, classification, routing, and storage of electronic data.
`2. The searchfor, retrieval of and usageof the data stored above.
`3. The attribution of the data so usedto its original source.
`The components of the data-acquisition system include:
`¢ A data target providing the visual ‘hotspot’ on the computer screen for collecting the gathered
`data. For gathering non-digital data, the data target may have a hardware-based
`_ implementation using hand-held scanning and optical character recognition technologies. (1)
`* A data analyzer/conduit for acquiring meta-data describing the acquired data and routingit
`and the original data into the systems model. (2)
`@ System Modules (3) for:
`Translating, transforming or modifying the data as required.
`¢ Requesting additional information from the user, if required.
`¢ Compiling the data into an encapsulated, routable object.
`¢ Importing pre-compiled content consisting of collections ofthe objects described above
`in disk file format.
`@ A data object consisting of an encapsulation ofthe original data, user defined meta-data, and
`attribution and source data in a single routable object. (4)
`@ A language for describing the data object and packagingit for transmission to a persistent
`storage mechanism. (5)
`¢ A transmission protocol for electronically transmitting objects defined above. (6)
`An agentfor receiving and routing the data packaged as aboveto the appropriate persistent
`storage location according to pre-defined rules. (7)
`An underlying persistent storage mechanism, such as a file system, relational database, object
`databases, or other similar system for storing the data. (8)
`* ¢


`In order to facilitate the retrieval of the data acquired in this manner, and its use in output
`documents with a simple drag-and-drop or copy-and-paste user interaction, a retrieval system
`must be supplied.
`Components of the retrieval system include:
`¢ One or more views (GUI)into the underlying model. These views are designed not to
`interfere with other applications on the users desktop by adopting a ‘folding windows’
`metaphor wherein functionality is made available by successively expanding the UI—atit’s
`smallest, only the Data Target ( (1) above ) is visible. (9)
`¢@ System Modules (3) for:

`Interacting with the data retrieval agent (server process) described below.
`Detecting and handling drag-and-drop or copy-and-paste interactions.
`+ Exporting all or part of the current data collection to disk file format.
`¢ A server process for intercepting and processing requests for the data objects created during
`the acquisition process, extracting them from persistent storage and encoding them for
`transport. (7)
`@ A language for describing the data object and packaging it for transmission. (5)
`A transmission protocol for electronically transmitting objects defined above. (6)
`An agentfor intercepting and decoding the requested object or objects and translating,
`transforming or modifying the data as required. (3)
`+ A system process for posting the original or modified data to the underlying operating
`systems clipboard or other data-sharing mechanism and for making available the attribution
`data provided by the data object. (3)
`7 of 87


`System Architecture
`The components of these sub-systems may be organized into a multi-tier architecture as follows:
`(7) Server Module
`Data object
`(4) (5) Encoded Data Objects
`(6) Data Transmission
`(1) Data Target
`(3) Client Module
`(2) Analyzer/Conduit
`(1) Data Scanner
`Language Module
`Plain/formatted Text
`Graphics Module
`Palm-top Device
`OCR Module
`System Clipboard
`Data Manager
`Data object
`(9) Data Viewer
`‘Folding Windows" UI
`8 of 87


`Component Details:
`Data Target
`Commercial Name: Gem Target
`Thedata target is a free-floating, movable icon symbolizing the active research system. This presentation
`eliminates the need to present additional application windowsto the user, while still allowing the user to
`access key features of the system. The target can provide:
`@ Avisual indication that a larger processing system is running.
`@ A target for dragging and dropping (or copying and pasting) data into the underlying system.
`+ A menu ofoptions for interacting with the underlying system
`* Astarting point for expanding the functionality of the system via the ‘foldable windows’
`of the
`The analyzer component providesa ‘first pass” analysis of the acquired data. The target waits for data to be
`passed into it, and then passes that data to the analyzer, which obtains whatever source and system data is
`available. Examples include: the source application and URL orfile, the date and time the data was
`acquired, the user and machine namethat acquiredthe data, any bibliographic data that can be deciphered,
`keywords, and other indexing data. This data is then and packaged it into an interim object that is passed to
`the underlying system via a ‘capture’ event. From this point on, the target returns to its waiting state, and
`the main processing system takes over.
`1. A flow-chart showing data acquisition.
`2. A UML diagram of one possible implementation, an ActiveX control for the Windows platform.
`3. A Screen Shot of the target component as it would appear on the Windows desktop.
`9 of 87


`” Data Target
`Commercial Name: Gem Target
` Analyzer/Conduit
`The analyzer componentprovides a ‘first pass’ analysis of the acquired data. The target waits for data to be passed
`into it, and then passes that data to the analyzer, whic’ obtains whatever source and system data is available.
`Examples include: the source application and URL orfile, the date and time the data was acquired, the user and
`machine namethat acquired the data, any bibliographic data that can be deciphered, keywords, and other indexing
`‘data. This data is then and packaged it into an interimobjectthat is passed to the underlying system via a ‘capture’
`event. From this point on, the target returnsto its waiting state, and the main processing system takes over. This
`analysis pass consists of these steps:
`User Data -OS API is used to acquire data about the system user whois collecting data, the machine name
`used, and the system date and time.
`Source Data - OS APIis used to determinethe source application and documentproviding the data, for
`example, a web browser pointing to a specific web site. As this data is not always available, a best-guess
`routine is used, which attempts to determine the application name and document nameproviding the data.
`The routine makes a guess based on the application window that was last active before user posts data to the
`Bibliography Data ~ The collected data is scanned to determine author, publication date, documenttitle,
`source URL and other source data. When this data rs not available, a best-guess routine 1s used which
`suggests a value from the source data, for example ‘meta’ tags in HTML,or embedded tags in Rich Text.
`When no data is available to make a guess,thefield is left blank.
`ALULOD Data Acquisition - as discussed for Data Target spec.
`Default Name and Description — The collected data is scanned to suggest a name, typically derived from the |
`first 3 to 4 wordsofassociated text, ‘or pictures or binary data, it defaults to "<data type> from <source
`document name>”.
`KeywordScan - Any text associated with the collected data is scanned and separated into individual words,
`adictionary of ‘smail’ words1s applied to discard words that do not make meaningful keywords. The
`resulting collection of keywords is used to suggest search terms or related concepts for the collected data.
`URL Scan - The data is scanned for any embedded URLs, any URLs foundare stored and used to suggest
`‘related’ web links for the collected data.
`Object Packing — Theresulting collection of data is packaged into an interim abject, that is passed to the
`main system for further processing.
` ———-
`10 of 87


`11 of 87


`a Data Services4? User Services Business Servions
`AutoGenKeywords : Boolean = False |
`FindURLsInText : Boolean = False
`ShowStatusMsg : Boolean = False
`Keywords : Object
`SheliLinks : Object
`<<Event>> Capture()
`<<Event>> Error()
`<<Event>> NewDataAvailable()
`<<Event>> NewForegroundApp()
`<<Event>> NewActiveURL()
` AsciiText : String
`RTFText : String
`HTMLText : String
`Picture : Object
`ShellLinks : Object
`Keywords : Object
`UserName : String
`MachineName: String
`File: C:\WINNT\Profiles\bblackburn\Desktop\Patent\control.mdl
`Three-Tiered Service Mode! Page 1
`Class Diagram: Logical View /
`12 of 87


`We are the Bay Area sailing club for E
`find plenty of information on Encson 2
`following the exploits of one ofthe bas
`The Data Target provides a
`virtual ‘hot spot’, available to all
`applications, allowing the user to
`insert data into the research
`system without stopping work in
`the source document.
`Wiw. nant ac AF Fie Sele TOOG. al
`13 of 87


`Data Scanner
`Commercial Name: Gem Highlighter
`Data Scanner
`The data scanner is a hardware implementation ofthe data target. While the data target enables the
`collection and deposit of data directly into the system, it does not address data that could be acquired from
`paper. Data scanner uses handheld scanning technology to acquire lines oftext from printed materials. If
`the scanner is detached from the main computer, or attached to a palm-device, this data is buffered untilit
`is synchronized with the main system. Once a connection can beestablished, the Analyzer forwards the
`data to the OCR module in the main system, where the data is converted into text and forwarded to the
`server for processing and storage.
`|. Workflow diagram.
`14 of 87


`Hgecey 15 of 87


`Data Object
`Commercial Name: Gem Object
`Data Object Description Language
`Commercial Name: Gem Markup Language
`EncodedDataObjects s
`The data object is an implementation ofthe encapsulated data and functionality required to use, view, and
`manipulate the data captured in the process described above.It’s primary features are:
`¢ The original piece of important research data is ‘packaged’ with additional user-defined data.
`4 The data includes the required information to create a proper reference to the original source ofthe
`data as well as a method for formatting that information in popular bibliographic formats.
`The languagefor transmitting data objects via standard intemet protocolsis designed using XML
`(Extensible Markup Language), a tag-based textual representation ofthe data, meta-data, and bibliographic
`references that make up a data object. Optionally, the raw data may be compressed and then encoded (using
`uuencode or a similar scheme) to represent binary data in text form for inclusion in the XML document.
`1. A UML diagram showing one possible implementation, [Gem.
`2. An XML prototype showing a method for encoding such an object for transport via a network.
`16 of 87


`<<Gat>> Cenwortal) Collecpon
`<<Let>> GemStetia(NewSiatus GemObyectSurusinun)
`<<Get> > GemStetus() GemOtjectStatusinum
`<<let>> name(s: Sting)
`<<Gat>> namet)- Sinng
`<<Let>> GemiD{Ing : Long)
`<<Get> GemiD() Long
`<<Let>> ParentType( str * Stning)
`<<Get> > ParemtType() . Sting
`<<Let>> ParentiD(ing . Long)
`<<Get> > ParentiO() . Long
`<<Let>> ParemtNamelstr : Sinng)
`<<Get> > ParentNamet) « String
`<<Let>> CresteOate(dat : Cate)
`<<Get>> CreaeDate() Date
`<<Get>> CrestedBy()
`<<Let>> Creameday(str *on
`ccLet>> Sourcedpptstr -oa)
`<<Get>> SourceApo() Sinng
`<<cLet>> SourceCat{str . Sting)
`<<Get> > SouroeCat{) : Stnng
`<<Let>> SourceMachineName(str : Sting)
`<<Get> > SourceMachineName() . Stning
`<¢¢Let > UserTagistr « String)
`<<Gem> UserTag() : Stung
`¢<Let> > Desenpbon(ser : Sing)
`<<Get>> Descnption() : Sting
`<<Ust>> AuthorLast{str: Stnog)
`¢<Gat>> AutherLan{) : Stnng
`<<Let>> Source(s : String)
`<<Get>> Souroal) : Sting
`<<>> FromPageling : Long)
`<<Get>> FromPegel) : Long
`<<Let> > ToPage(ing : Long)
`<<Get>> ToPagal) : Long
`<<Let>> PubTite(str : Sinng)
`<<Get>> PubTite() String
`<<Let>> PubPlacelstr - String)
`<cGet>> PubPlacal) - String
`<<Let>> Edfoon(str ; Stnng)
`<<Get> Edrbon{) : Stnng
`s<<clet>> Publesher(str : Stnag)
`<¢<Get>> Pubbsher() : Stnng
`<<Get>> PubbcabonDatel) : Sinng.
`<<let>> PubliceoonDete(si: Stnng)
`: String
`¢¢Let>>Pernimona(sr * Strng)
`Giceeemanerracians String,seTFautter : Stnng, Optional Byv's! CaptonFlag : Byun = 3)
`CaicBiobSaa() * Long
` <<Gat> > Shesilinical)*Collecbon
`User Services
`GacFedsTexat) :ad
`17 of 87


`Object Instances of the GemServer/GemClient data.
`This is an example ofinstances of the objects that may be used to implement the GemServer and
`GemClient concept. There are three types of objects:
`Root - represents the database that contains the tree structure.
`Container - used to represent inner nodes (and empty nodes)ofthe tree structure.
`Item - used to represent leaf nodes ofa tree structure.
`Say there is a GemServer that contains has an example database that contains information aboutscary
`books. The user might see a URI like this:
`The GemClient would parse the URI in the following way:
`The Database, Path, and Object can then be represented as a Root, Container, and Item:
`Root :
`Container :
`Container :
`Item :
`The following sections are examplesofthese four types of objects. Note that the Root is a type ofcontainer
`that currently may only contain Containers. A Container may contain any number of Containers and/or
`18 of 87


` <Root vers="1.0" id="218838" name="Examples" mime="application/x-gemteq-root">
`<rootName>Examples for example</rootName>
`<comments>Various things to play with that show a gem server in action
`<creationDate>11 Aug 1999 14:18 </creationDate>
`<creatorName>James Bondo</creatorName>
`<modificationDate>13 Aug 1999 12:03</modificationDate>
`format="//" />
`<!— all are mime:application/x-gemteq-chest by definition --!>
` <obj id="210243234" name="ScaryBooks" >
`<containerName>Scary Books</containerName>
` </Root>
`<obj id="2832834" name="Various” />
`<obj id="2198348" name="SallyHome" />
`<containerName>Sally's Home Chest</containerName>
`19 of 87


`Containers —
`<Container vers="1.0" id="23444543234" name="ScaryBooks" mime="application/x-gemteq-chest">
`<containerName>SomeScary Books</containerName>
`<comments>A collection of scary books</comments>
`<creationDate>11 Aug 1999 14:19 </creationDate>
`<creatorName>James Bondo</creatorName>
`<modificationDate>13 Aug 1999 12:03 </modificationDate>
`<defaultBib format="//" />
`<obj id="210243234" name="StephanKing" mime="application/x-gemteq-tray">
`<itemName> Misc </itemNarme>
`<obj id="21838848" name="DeanKoontz” mime="application/x-gemteq-tray">
`<itemName>Dean Koontz</itemName>
`20 of 87


`<Container vers="1,0" id="210243234" name="StephanKing" mime="application/x-gemteg-tray">
`<containerName> Stephan King</containerName>
`<comments>Acollection of books by Stephan King</comments>
`<creationDate>11 Aug 1999 14:19 </creationDate>
`<creatorName>James Bondo</creatorName>
`<modificationDate>13 Aug 1999 12:03 </modificationDate>
`<defaultBib format="//bibserver.gemteq.convBibilographyFormats/GenericBook" />
`<obj id="120983838" name="Carrie"_mime="application/x-gemteq-textgem">
`<obj id="65683838" name="TheStand" mime="application/x-gemteq-textgem">
`<itemName>The Stand</itemName>
`<obj id="777983838" name="IT" mime="application/x-gemteq-textgem">
`<obj id="76883838" name=" PetSeminary ” mime="application/x-gemteq-textgem">
`<itemName>Pet Seminary</itemName>
`<obj id="120983838" name="TheLangoliers" mime="application/x-gemteq-textgem">
`<itemName> The Langoliers </itemName>
`<obj id="298948393" name="MiscNotes" mime="application/x-gemteq-tray">
`<itemName>Misc Notes</itemName>
`<obj id="298948393" name="Pic" mime="application/x-gemteq-imagegem">
`<itemName>Picture of Stephan King</itemName>
`21 of 87


`<Item vers="1.0" id="120983838" name="Carrie” mime="application/x-gemteq-textgem">
`<comments>Psi and teenage angst. </comments>
`<creationDate> 12 Aug 1999 15:20 </creationDate>
`<creatorName>Billy Joe Jim Bob</creatorName>
`<modificationDate> 14 Aug 1999 16:20 </modificationDate>
`<bib format="//gemserver.gemteq.cony/BibilographyFormats/MLABook">
`<author>Stephan King</author>
`<publisher>Random House</publisher>
`<location>New York</location>
`<ed> 3 </ed>
`<vol>1 </vol>
`<date> 2 Aug 1978 </date>
`<!-- an image may havethe following attributes
`<data mime="text/plain" compression="" encoding="">
`Carrie was just a normal girl. With an wonderful mother and really good
`friends. For some reason she got very angry when they dumped a bucket
`of goo upon her at the prom.
`22 of 87


`Data Object Transmission Protocol
`Commercial Name: Gem Transmission Protocol
`Data Transmission
`The objects acquired in the research process will need to be stored. Frequently the storage location will be
`network based, and this network may be slow. Although the server component described below handles the
`storage process, a protocol is required for communicating with the server, as well as for the server to
`respondto the client.
`Fora local, single user installation, no transmission protocol is required, as method invocation may be done
`using standard in-process communication mechanisms. For remote data stores maintained by a separate,
`remote server module, some means of requesting data, or operations on data, and for receiving a response is
`required. For this system, the protocol used is an xml-rpe like mechanism whereby requestor objects and
`response objects contain the data to be transferred between client and server.
`1. Example of reference implementation.
`23 of 87


`An example interaction between a GemServer and a GemClient
`Atthe protocol level the GemClient communicates with the GemServer with Remote ProcedureCalls in
`XML (XML-RPC) over HTTP. Examples of the data objects can be found in the Data Object section.
`This section shows an example ofthis method of communication.
`Asin the Data Objects section there is an Examples database on a GemServer that contains information
`about scary books. The user might see a URL likethis:

