`t
`t
`
`EXHIBIT 2087
`Facebook, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00479
`
`
`
`Copyright () Alex:mder Halavais 2009
`
`The right of Alexander 1-tala~--ais to be identifted as Author or this Work has
`been asserted in accordance with the UK CopYright. Designs and Patents Act
`1!)88.
`
`First published in 2009 by Polity Press
`
`Polity Press
`65 Bridge Stteet
`ClmbridgeCR2 rUR. UK.
`
`Polity Press
`150 Main SlTeet
`Malden. MA 02.148. USA
`AU righrs re:sen'ed. Except for the quot:otion or shon pass.lges for the purpose
`or criticism .:md review. no pan of lhis publicnion may be reproduced, stored
`in a rettievaJ system, or ttansmitted. in any fonn or by any means. electronic.
`mechanical. photocopying. recording or otherwise. without the prior
`permission of the publisher.
`
`ISBN· I}: 97&-o-7~56·42.14·7
`ISBN·11: 978-o-'7456·42.15·4 (paperback)
`
`A catalogue record for this book is av.ailable from the British Ubrary.
`
`Typeset in 10..2.5on l) pt FF SClla
`by Servis Filmsetting ltd. Stockpon. Cheshire
`Printed and bound in Grell Rrilain by MPC Books l1d. Bodmin. Cornwall.
`
`The publisher h.ls used its best endeaYOUtS 10 ensure that the URLs for
`enenul websites referred to in this book are correct and active at the lime of
`going to press. However, the publisher bas no responsibility for the voebsites
`and em make no guarantee that a si~e 'AiD remain live or th.n the content is or
`will remain appropriate.
`
`E\W)' effort bJ:s been made to trace aU coprright holders. but if any h.lve been
`inad,·ertently o~--eriooked the publishers 'Aill be pleased to include any
`necessary crediiS in any subsequent reprint or edition.
`
`For further information on Polity. \isit our website: www.polity.m .uk.
`
`
`
`pro\'iders: finding the least expensive airfares for a given route,
`for example.
`Most crawlers make an archival copy of some or aU of a web(cid:173)
`page, and extract the links immediately to find more pages to
`crawL Some crawlers, like the HeritrLx spider employed by the
`Internet Archive, the "wget" program often distributed with
`Unux, a nd web robots buiJt into browsers and other web
`cHents, are pretty much done at this stage. However. most
`crawlers create an archive that is designed to be parsed and
`organized in some way. Some of this processing (like ""scrap(cid:173)
`ing" out links, or storing metadata) can occur within the
`crawler itself, but there is usually some form of processing of
`the te.\1 and code of a webpage afterward to tl)' to obtain struc(cid:173)
`tura1 information about it.
`The most basic fom1 of processing, common to almost every
`modem search engine, is extraction of ke)' terms to create a
`keyword inde.x for the web by an .. indexer."' We are all familiar
`with bow the index of a book works: it takes information about
`which words appear on any given page and reverses it so that
`you may learn which pages contain any given word. In retro(cid:173)
`spea, a full-text inde.x of the web is one of the obvious choices
`for finding material online. but particularly in the early devel(cid:173)
`opment of search engines it was not dear what parts should be
`indexed: the page tides, metadata, hyperlink teX1, or full text
`(Yuwono eta!. 1995). lfindexing the full text of a page. is it pos(cid:173)
`sible to determine which words are most important?
`In practice, even deciding what constitutes a "'word" (or a
`"term") can be difficu1t For most western languages, it is pos(cid:173)
`sible to look for words by finding letters be~veen the spaces
`and punctuation, though this becomes more difficu1t in
`languages like Chinese and Japanese, v.itich have no dear
`markings between terms. In English, contractions and abbre(cid:173)
`viations cause problems. Some spaces mean more than others;
`someone looking for information about "York"' probably has
`little use for pages that mention .. New York," for instance. A
`handfu1 of words like .. the" and •my" are often dismissed as
`
`
`
`"stop words" and not included in the index because they are
`so common. Further application of nahtral language process(cid:173)
`ing (NLP) is capable of determining the parts of speech of
`terms, and synonyms can be identified to provide further d ues
`for searching. At the most extreme end of indexing are efforts
`to allow a computer to in some way understand the genre or
`topic of a given page by "reading" the text to determine its
`meaning.1
`An index works well for a book. Even in a fairly lengthy
`work, it is not difficult to check each occurrence of a keyword,
`but the same is not true of the web. Generally, an exhaustive
`examination of each of the pages containing a particular key(cid:173)
`word is impossible, particularly when much of the material is
`not just unhelpful, but - as in the case of spam - intentionally
`misleading. This is why results must be ranked according to
`perceived relevance, and the process by which a particular
`search engine indexes its content and ranks the results is really
`a large part of what makes it unique. One of the ways Google
`leapt allead of its competitors early on is that it developed an
`algorithm called PageRank that relied on hyperlinks to infer
`the authority of various pages containing a given keyword.
`Some of the problems of Page Rank ·will be examined in chap(cid:173)
`ter 4- Here, it is enough to note that the process by which an
`index is established, and the attributes that are tracked, make
`up a large part of the "secret recipes" of the various search
`engines.
`The crawling of the web and processing of that content hap(cid:173)
`pens behind the scenes, and results in a database of indexed
`material that may then be queried by an individual. The final
`piece of a search engine is its most visible part: the interface, or
`"front end," that accepts a query, processes it, and presents the
`results. The presentation of an initial request can be, and often
`is, very simple: th e search box fotmd in the comer of a web(cid:173)
`page, for example. The sparse home page for the Google
`search engine epitomizes this simplicity. However, providing
`people with an extensive set of tools to tailor their search, and
`
`
`
`The Engines
`
`to refine their search, can lead to interesting challenges, par(cid:173)
`ticularly for large search engines with an extremely diverse set
`of potential users.
`In some ways, the ideal interface anticipates people's behav(cid:173)
`iors, understanding what they expect and helping to reveal
`possibilities without overwhelming them. This can be done in
`a number of ways. Clearly the static design of the user interface
`is important, as is the process, or flow, of a search request.
`Westlaw, among other search engines, provides a thesaurus
`fLmction to help users build more compreh ensive searches.
`Search engines like Yahoo have experimented with auto(cid:173)
`completing searches, anticipating what the person might be
`trying to type in the search box, and providing suggestions in
`real time (Calore 2007). It is not clear how effective these par(cid:173)
`ticular elements are, but they exemplify the aims of a good
`interface: a design that meets the user half-way.
`Once a set of results are created, they are usually rarLked in
`some way to provide a list of topics that present the most sig(cid:173)
`nificant "hits" first. The most common way of displaying
`results is as a simple list, with some form of summary of each
`page. Often th e keywords are presented in the context of the
`surrow1ding text. In some cases, there are options to limit or
`expand the search, to change the search terms, or to alter the
`search in some other way. More recently, some search engines
`provide results in categories, or mapped graphically.
`All of these elements work togeilier to keep a search engine
`continuously updated. The largest search engines are con(cid:173)
`stantly under development to better analyze and present
`searchable databases of the public web. Some of iliis work is
`aimed at malcing search more efficient and usefi.u, but some is
`required just to keep pace. The technologies used on th e web
`ch ange frequently, and, when they do, search engines have to
`ch ange witl1 them. As people employ Adobe Acrobat or Flash,
`search engines need to create tools to make sense of iliese
`formats. The sheer amount of material iliat must be indexed
`increases exponentially each year, requiring substantial
`
`
`
`of these pages. but limited itself to the titles of the tiles.
`Nonetheless. it represented a first effon to reign in a quickly
`growing. chaotic information resource. not by imposing order
`on it from above, but by mapping and indexing the disorder to
`make it more usable.
`The Gopher system was another attempt to bring order to
`the early internet It made browsing files more practical, and
`represented an intennedia.I)' step in the direction of the World
`Wide Web. People could navigate through menus that organ(cid:173)
`ized documents and other files. and made it easier, in theory, to
`find what you might he looking fo< Gopher Jacked hypertext (cid:173)
`you cou)d not indicate a Hnk and have that link automatically
`load another document in quite the same way it can be done on
`the web - but it facilitated working through directory struc4
`tures, and insulated the individual from a command-line inter(cid:173)
`face. Veronica, named after Archie's girlfriend in J94os-era
`comics, was created to provide a broader index of content avail(cid:173)
`able on Gopher servers. Uke Archie, it provided the capability
`of searching tides (actually, menu items). rather than the full
`text of the documents available, but it required a system that
`cou1d crawl through the menu-structured directories of
`•gopherspace• to discover each of the files (Parker 1994).
`In 1991. the World Wide Web first became available, and
`with the popularization of a graphical browser. Mosaic. in 1993.
`it began to grow even more quick1y. The most usefu1 tool for the
`web user of the early 1990s was a good bookmark tile, a collec(cid:173)
`tion of URLs that the person had found to he useful. People
`began publishing their bookmark files to the web as pages. and
`this small gesture has had an enormous impact on how we use
`the web today. The collaborative filtering and tagging sites that
`are popular today descended from this practice, and the updat(cid:173)
`ing and annotating of links to interesting new websites led to
`some of the first proto-blogs. Most importantly, it gave rise to
`the first collaborative directories and search engines.
`The first of these search engines. \Vande.x. was developed by
`Matthew Grey at the Massachusetts Institute of Technology,
`
`
`
`TheE
`
`and was based on the files gathered by his crawler, the World
`Wide Web Wanderer. It was, again, developed to fulfill a partic(cid:173)
`ular need. The web was made for browsing, but perhaps to an
`even greater degree than FTP and Gopher, it had no overarch(cid:173)
`ing structure that would allow people to locate documents
`easily. Many attribute the genesis of the idea of the web to an
`article that had appeared just after the Second World War enti(cid:173)
`tled "As we may think", in which Vannevar Bush (r945) sug(cid:173)
`gests that a future global encyclopedia will allow individuals to
`follow "associative trails" between documents. In practice, the
`web grows in a haphazard fashion, like a library that consists of
`a pile of books that grows as anyone throws anything they wish
`onto the pile. A large part of what an index needed to do was to
`discover these new documents and make sense of them.
`Perhaps more than any previous collection, the web cried out
`for indexing, and that is what Wandex did.
`As with Veronica, the Wanderer had to work out a way to
`follow hyperlinks and crawl this new information resource, and,
`like its predecessors, it limited itself to indexing titles. Brian
`Pinkerton's WebCrawler, developed in I994• was one of the
`first web-available search engines (along with the Repository(cid:173)
`Based Software Engineering ["RBSE"] spider and indexer; see
`Eichmann 1994) to index the content of these pages. This was
`important, Pinkerton suggested, because titles provided little
`for tl1e individual to go on; in fact, a fifth of tile pages on tl1e web
`had no titles at all (Pinkerton 1994). Receiving its milliontil
`query near tl1e end of 1994, it clearly had found an audience on
`the early web, and, by the end of 1994, more than a half-dozen
`search engines were indexing the web.
`
`Searching the web
`
`Throughout tile 1990s, advances in searcll engine tecllnology
`were largely incremental, with a few exceptions. Generally, tile
`competitive advantage of one search engine or another had
`more to do with the comparative size of its database, and how
`
`
`
`Search
`
`Society
`
`quickly that database was updated. The size of the web and
`its phenomenal growth were the most daunting technical
`challenge any search engine designer would have to face. But
`there were some advances that had a significant impact. A
`number of search engines, including SavvySearch, provided
`metasearch: the ability to query multiple search engines at
`once (A.E. Howe & Dreilinger 1997). Several, particularly
`Northern Light, included material tmder license as part of
`their search results, extending access beyond what early web
`authors were willing to release broadly (and without charge) to
`the web. Northern Light was also one of the first to experiment
`with clustering results by topic, something that many search
`engines are now continuing to develop. Ask Jeeves attempted
`to make the query process more user-friendly and intuitive,
`encouraging people to ask fully formed questions rather th an
`use Boolean search queries, and Alta Vista provided some early
`ability to refine results from a search.
`One of the greatest cl1allenges searcl1 engines had to face,
`particularly in the late 1990s, was not just the size of the web,
`bu t the rapid growth of spam and other attempts to manipulate
`search engines in an attempt to draw th e attention of a larger
`audience. A later chapter will address this game of cat-an d.
`mouse in more detail, but it is worth noting here that it repre(cid:173)
`sented a significant teclmical obstacle and resulted in a
`perhaps unintended advantage for Google, which began pro·
`viding searcl1 functionality in 1998. It took some time for
`those wishing to manipulate search engines to understand
`how Google's reliance on hyperlinks as a measure of reputa·
`tion worked, and to develop strategies to influence it.
`At the same time, a number of directories presented a com·
`plementary paradigm for organizing the internet. Yahoo,
`l ookS mart, and others, by using a categorization of the inter(cid:173)
`net, gave their searclles a much smaller scope to begin with.
`The Open Directory Project, by releasing its volunteer.edited,
`collaborative categorization, provided another way of mapping
`the space. Each of these provided the ability to search, in
`
`
`
`It is impossible for me or anyone else to guess why this
`particular posting became especially popular, but every page
`on the web that becomes popular relies at least in part on its
`initial popularity for this to happen. The exact mechanism is
`tmclear, but after some level of success, it appears that popu·
`larity in networked environments becomes "catching" (or
`"glomming"; Balkin 2004). The language of epidemiology is
`intentional. Just as social networks transmit diseases, they can
`also transmit ideas, and the structures that support that distri·
`bution seem to be in many ways homologous.
`Does this mean that this power law distribution of the web is
`an tmavoidable social fact ? The distribution certainly seems
`prevalent, not just in terms of popularity on the web, but in a
`host of distributions that are formed under similar conditions.
`More exactly, the power law distribution appears to encourage
`its own reproduction, by providing an easy and conventional
`path to the most interesting material. And when individuals
`decide to follow this path, they further reinforce this lopsided
`distribution. Individuals choose their destination based on
`popularity, a fully intentional choice, but this results in the
`winner.take.all distribution, an outcome none of the contribu.
`tors desired to reinforce; it is, to borrow a phrase from
`Giddens, "everyone's doing and no one's" (1984, p. ro). This
`sort of distribution existed before search engines began
`mining linkage data, but has been further reinforced and
`accelerated by a system that benefits from its reproduction.
`In the end, the question is probably not whether the web and
`the engines that search it constitute an open, even, playing
`field, or even, as Cooper had it with newspapers, "whether a
`community derives most good or evil, from the institution"
`(Cooper 2004, p. 113). Both questions are fairly settled: some
`information on the web is more visible than other informa·
`tion. We may leave to others whether or not the web is, in sum,
`a good thing; the question has little practical merit as we
`can hardly expect the web to quietly disappear any time soon.
`What we may profitably investigate is how attention is guided
`
`
`
`Attention
`
`differently on the web from how it has been in earlier informa(cid:173)
`tion environments, and who benefits from this.
`
`PageRank
`
`By the end of the 1990s search engines were being taken seri(cid:173)
`ously by people who produced content for the web. This was
`particularly true of one of the most profitable segments of
`th e early web: pornography. Like many advertising-driven
`industries, "free" pornography sites were often advertising(cid:173)
`supported, and in order to be successful they needed to attract
`as many viewers as possible. It did not really matter whether or
`not the viewer was actually looking for pornography - by
`attracting them to the webpages, the site would probably be
`paid by the advertiser for the "hit," or might be able to entice
`th e visitor into making a purchase. The idea of the hawker
`standing on the street trying to entice people into a store is
`hardly a new one. Search engines made the process a bit more
`difficult for those hawkers. In fact, for many search engines,
`securing their own advertising profits required them to
`effectively silence the pornographers' hawkers.
`Search engines were trying to avoid sending people to
`pornography sites, at least unless the searcher wanted that,
`which some significant proportion did. What they especially
`wanted to avoid was having a school-aged child search for
`information on horses for her school report and be sent -
`thanks to aggressive hawking by pornography producers - to
`an explicit site, especially since in the 1990s there was a sig(cid:173)
`nificant amount of panic (particularly in the United States)
`about the immoral nature of the newly popular networks. Most
`advertisers have a vested interest in getting people to come to
`their site, and are willing to do whatever they can in order to
`encourage this. Google became the most popular search
`engine, a title it retains today, by recognizing that links could
`make it possible to understand how the web page was regarded
`by other authors. They were not the first to look to the
`
`
`
`Search
`
`Society
`
`hyperlinked structure of the web to improve their search
`results, but they managed to do so more effectively than other
`search engines had. Along with good coverage of the web, a
`simple user interface, and other design considerations, this
`attention to hyperlink structure served them well.
`Google and others recognized that hyperlinks were more
`than just connections, they could be considered votes. When
`one page linked to another page, it was inclicating that the con(cid:173)
`tent there was worth reacling, worth cliscovering. After all, this
`is most likely how web surfers and the search engine's
`crawlers encountered the page: by following links on the web
`that led there. If a single hyperlin.k constituted an endorse(cid:173)
`ment, a large number of links must suggest that a page was
`particularly interesting or worthy of attention. This logic, prob(cid:173)
`ably reflecting the logic of the day-to-day web user, was ampli(cid:173)
`fied by the search engine. Given the need to sort through
`thousands of hits on most searches, looking at the links proved
`to be helpful.
`Take, for example, a search for "staph infections." At the
`time of writing, Google reports nearly 1.3 million pages include
`those words, and suggests some terms that would help to
`refine that search (a feature of their health-related index). The
`top result, the one that the most people will try, is from
`Columbia University's advice site Go Ask Alice; the hundredth
`result is a page with information about a pain-relieving gel that
`can be used for staph infections; and the last page listed
`(Google only provides the first several hundred hits) is a blog
`posting. Gathering link data from search engines is not partic(cid:173)
`ularly reliable (Thelwalln.d.), but Alta Vista reports that these
`three sites receive 143, 31, and 3 inbound hyperlin.ks, respec(cid:173)
`tively.3 People search for things for very clifferent reasons, but
`the Columbia site represents a concise, authoritative, and gen(cid:173)
`eral overview of the condition. The last result is a blog entry by
`someone who is writing about her hus band's recent infection,
`and while it certainly may be of interest to someone who is
`faced with similar circumstances, it does not represent the sort
`
`
`
`CHAPTER EIGHT
`
`Future Finding
`
`At present, we think of search engines largely as a way to find
`information. In practice, we are already using them to find
`people and places as well. As we move to what has been termed
`an "internet of things," we begin to move beyond an index of
`knowledge, and toward an index of everything. As the sociable
`web grows to include not only the services we are familiar with,
`but collaborative virtual and augmented realities, the central
`position of search engines in social life will continue to gain
`strength.
`What does that future search engine look like? There are
`indications both of technological alternatives to the current
`state of search, and of organizational differences. Experimental
`search engines present information in a map of clustered
`topics, or collect information from our life and use it to infer
`search restrictions. Just as the creation of content has been dis(cid:173)
`tributed in interesting ways over the last few years, there are
`indications that a centralized search engine may be just one of
`a number of alternatives for those engaging the social world
`online.
`A 14-year-old subject in a study by Lewis and Fabos (2005)
`suggested "Everybody does it. I've grown up on it. It's like how
`you felt about stuff when you were growing up." She was talk(cid:173)
`ing about instant messaging, but the same could easily be said
`of search engines. They now feed into the backgrow1d of our
`everyday activities and media use, only of note when they are
`frustratingly absent. Search engines remain in the news
`because of the clashes between the search giants and tradi(cid:173)
`tional sotuces of institutional power. While this has held
`
`
`
`Search
`
`Society
`
`our attention, research into new ways of wrangling the web
`continues. What does the future of search hold?
`In the near term, many wonder what technologies might, as
`one commentator suggested, dethrone Google as the "start
`page of the internet" (Sterling 2007). This final ch apter briefly
`explores some common predictions about the direction of
`search, and what these ch anges might mean for our social lives
`in the next decade.
`
`Everything findable
`
`Eric Brewer (2oor) dreams of a search engine that will let him
`find things in his chaotic office. Because we have learned to
`turn to search, it can be frustrating when something is not
`searchable, but every day more of the world is becoming
`searchable.
`The first step of this is to make all text searchable. The com(cid:173)
`puter brought about two surprises. First, productivity did not
`increase, it decreased, and documents took more work to pre(cid:173)
`pare instead of less. Second, the paperless office never really
`happened, and paper use has increased rather than decreased.
`Gradually, however, things are being born digital and never
`make their way onto paper. People bank, file their taxes, hand in
`their homework, distribute memos, and publish books online.
`All this digital media becomes fodder for search engines. While
`it may not yet be part of the general-purpose search engines like
`Google or Yahoo, eventually the contents of all of these work
`flows are likely to show up there as well, at least for those who
`are permitted to access them. There are even services that will
`open your mail and scan it for you, so that paper never pollutes
`your office. Optical character recognition (OCR) technologies
`are improving to such a degree that they are increasingly able to
`recognize written texts, as well as printed texts, allowing for at
`least partial indexing of hand-written documents for access
`in search engines (Milewski 2oo6). Under those conditions,
`Brewer's searchable office is almost here.
`
`
`
`Future
`
`Things that were recorded in books, on audio tape, and on
`film are gradually being digitized, and often opened up to the
`web. Major book scanning projects by Google, Amazon, and
`the Internet Archive (supported by Microsoft and Yahoo) are
`aiming to wliock hw1dreds of years of printing and make it
`searchable. Once images, audio, and video are scanned, the
`question is how they will be made searchable. Especially now
`that do-it-yourself video is so collllllon on the web, finding a
`way of searching that material - short of relying on creators
`and collllllentators to describe it in some useful way - has
`proven difficult. A great deal of current research is dedicated to
`extracting meaningful features from video, identifying faces,
`and recognizing music and speech.
`A number of companies are working at making sense of the
`continual streams of data that are available. BBN Technologies,
`for example, has created a Broadcast Monitoring System,
`which detects speech and translates it in real time. It makes it
`possible to search for terms in your own language and find
`whether they have been mentioned in a broadcast anywhere in
`the world.
`Some of the greatest innovations over the last few years have
`been in moving the once arcane field of geographical informa(cid:173)
`tion systems into the public eye. Not only is there the possibil(cid:173)
`ity of using a searcher's geographical context to find "what's
`near me," but mapping and visualization products allow for far
`easier searching for locations, and navigating to those loca(cid:173)
`tions. Mapping will continue to improve with higher resolu(cid:173)
`tion and more quickly updated views. Google's Street View
`makes navigating through cities far easier, and no doubt this
`will continue to expand. By identifying the times and places
`videos and photographs were taken, it is possible to compile a
`profile of a location, assembled through a nwnber of disparate
`records. Experimental robotic airships designed to deliver city(cid:173)
`wide wireless connections also promise real-time aerial views
`of a city, a staple of science fiction (Haines 2005; Williams
`2005). Coogle's investment in 23andme, a site that provides
`
`