`
`The Internet Archive Tums 20; A Behind The Scenes Look At Archiving The Web
`
`Forbes
`
`Tech
`
`JAN 18, 1016 @ 10:59 AM
`
`4,256 O
`
`The Internet Archive Turns 20: A Behind The
`
`Scenes Look At Archiving The Web
`
`9.3
`(é.
`
`Kalev Leetaru, CONTRIBUTOR
`I write about the broad intersection of dam and society. FULL BIO V
`Opinions expressed by Forbes Contributors are their own.
`
`
`
`interim Archivcfoundcr Brewster Kellie and some ofthe Archive's servers in 2mm. (AP Photo/Ber: Margot)
`
`To most of the web surfing public, the Internet Archive’s Wayback Machine is the
`
`face of the Archive’s web archiving activities. Via a simple interface, anyone can type
`
`in a URL and see how it has changed over the last 20 years. Yet, behind that simple
`
`search box lies an exquisitely complex assemblage of datasets and partners that
`
`make possible the Archive’s vast repository of the web. How does the Archive really
`
`work, what does its crawl workflow look like, how does it handle issues like
`
`robots.txt, and what can all of this teach us about the future of web archiving?
`
`httszMwonrbes.mrntsitesrkalevleeta rui201 6f01i1 fifth e-intern et-archive-turns-20-a-behind-the-scenes-Iook-at-arch iving-th e-webmlM 306438 3280
`
`18
`
`UNITED THERAPEUTICS, EX. 2118
`WATSON LABORATORIES V. UNITED THERAPEUTICS, |PR2017-0162‘I
`
`Page 1 of 9
`
`
`
`6I19I201B
`
`The Internet Archive Turns 20: A Behind The Scenes Look At Archiving The Web
`
`Perhaps the first and most important detail to understand about the Internet
`
`Archive’s web crawling activities is that it operates far more like a traditional library
`
`archive than a modern commercial search engine. Most large web crawling
`
`operations today operate vast farms of standardized crawlers all operating in unison,
`
`sharing a common set of rules and behaviors. They traditionally operate in
`
`continuous crawling mode, in which the goal is to scour the web 24/7/365 and
`
`attempt to identify and ingest every available URL.
`
`In contrast, the Internet Archive is comprised of a myriad independent datasets,
`
`feeds and crawls, each of which has very different characteristics and rules
`
`governing its construction, with some run by the Archive and others by its many
`
`partners and contributors. In the place of a single standardized continuous crawl
`
`with stable criteria and algorithms, there is a vibrant collage of inputs that all feed
`
`into the Archive's sum holdings. As Mark Graham, Director of the Wayback Machine
`
`put in an email, the Internet Archive’s web materials are comprised of “many
`
`different collections driven by many organizations that have different approaches to
`
`crawling.” At the time of this writing, the primary web holdings of the Archive total
`
`more than 4.1 million items across 7,357 distinct collections, while its Archive-It
`
`program has over 440 partner organizations overseeing specific targeted collections.
`
`Contributors range from middle school students in Battle Ground, WA to the
`
`National Library of France.
`
`Those 4.1 million items comprise a treasure trove covering nearly every imaginable
`
`topic and data type. There are crawls contributed by the Sloan Foundation and
`
`Alexa, crawls run by IA on behalf of NARA and the Internet Memory Foundation,
`
`mirrors of Common Crawl and even DNS inventories containing more than 2.5
`
`billion records from 2013. Many specialty archives preserve the final snapshots of
`now-defunct online communities like GeoCities and Wretch. Dedicated Archive-It
`
`crawls preserve a myriad hand-selected or sponsored websites on an ongoing basis
`
`such as the Wake Forest University Archives. These dedicated Archive~IT crawls can
`
`be accessed directly and in some cases appear to feed into the Wayback Machine,
`
`accounting for why the Wake Forest site is captured almost every Thursday and
`
`Friday over the last two years like clockwork.
`
`Alexa Internet has been a major source of the Archive’s regular crawl data since
`
`1996, with the Archive’s FAQ page stating “much of our archived web data comes
`from our own crawls or from Alexa Internet's crawls
`Internet Archive‘s crawls
`
`tend to find sites that are well linked from other sites Alexa Internet uses its own
`
`methods to discover sites to crawl. It may be helpful to install the free Alexa toolbar
`
`and visit the site you want crawled to make sure they know about it.”
`
`httszMwwIntbes.mnvsitesfltalevleem rui201 6M1” 8M e-intern et-archive-turns-ZO-a-behind-the-scenes-Inok—at-arch iving-th e-webm-t 8064333280
`
`2!9
`
`UNITED THERAPEUTICS, EX. 2118
`WATSON LABORATORIES v. UNITED THERAPEUTICS, |PR2017-O162‘I
`
`Page 2 of 9
`
`
`
`6l1gf2013
`
`The Internet Archive Tums 20; A Behind The Scenes Look At Archiving The Web
`
`Another prominent source is the Archive’s “Worldwide Web Crawls,” which are
`
`described as “Since September 10th, 2010, the Internet Archive has been running
`
`Worldwide Web Crawls of the global web, capturing web elements, pages, sites and
`
`parts of sites. Each Worldwide Web Crawl was initiated from one or more lists of
`
`URLs that are known as ‘Seed Lists’
`
`various rules are also applied to the logic of
`
`each crawl. Those rules define things like the depth the crawler will try to reach for
`
`each host (website) it finds.” With respect to how frequently the Archive crawls each
`
`site, the only available insight is “For the most part a given host will only be
`
`captured once per Worldwide Web Crawl, however it might be captured more
`
`frequently (e.g. once per hour for various news sites) via other crawls.”
`
`The most recent crawl appears to be Wide Crawl Number 13, created on January 9,
`
`2015 and running through present. Few details are available regarding the crawls,
`
`though the March 2011 crawl (Wide 2) states it ran from March 9, 2011 to December
`
`23, 2011, capturing 2.7 billion snapshots of 2.3 billion unique URLs from a total of
`
`29 million unique websites. The documentation notes that it used the Alexa Top 1
`
`Million ranking as its seed list and excluded sites with robotstxt directives. As a
`
`warning for researchers, the collection notes “We also included repeated crawls of
`
`some Argentinian government sites, so looking at results by country will be
`somewhat skewed.”
`
`Augmenting these efforts, the Archive’s No More 404 program provides live feeds
`
`from the GDELT Project, Wikipedia and WordPress. The GDELT Project provides a
`
`daily list of all URLs of online news coverage it monitors around the world, which
`
`the Archive then crawls and archives, vastly expanding the Archive’s reach into the
`
`non-Westem world. The Wikipedia feed monitors the “[W]ikipedia IRC channel for
`
`updated article[s], extracts newly added citations, and feed[s] those URLs for
`
`crawling,” while the WordPress feed scans “WordPress's official blog update stream,
`
`and schedules each permalink URL of new post for crawling." These greatly expand
`
`the Archive’s holdings of news and other material relating to current events.
`
`Some crawls are designed to make a single one-time capture to ensure that at least
`
`one copy of everything on a given site is preserved, while others are designed to
`
`intensively recrawl a small subset of hand-selected sites on a regular interval to
`
`ensure both that new content is found and that all previously-identified content is
`
`checked for any changes and freshly archived. In terms of how frequently the
`
`Archive recrawls a given site Mr. Graham wrote that “it is a function of the hows,
`
`whats and whys of our crawls. The Internet Archive does not crawl all sites equally
`
`nor is our crawl frequency strictly a function of how popular a site is.” He goes on to
`
`caution “I would expect any researcher would be remiss to not take the fluid nature
`
`of the web, and the crawls of the [Internet Archive], into consideration” with respect
`
`to interpreting the highly variable nature of the Archive’s recrawl rate.
`httpszmvwwrorbes.mnvsitesfltalevleemrulzm 6301M 8M1e-internet-archive-turns-ZO-a-behInd-the-seenes-look-at-archiving—the-webWeDMSaBZeO
`
`SEQ
`
`UNITED THERAPEUTICS, EX. 2118
`WATSON LABORATORIES v. UNITED THERAPEUTICS, |PR2017-O1621
`
`Page 3 of 9
`
`
`
`6i1gf2013
`
`The Internet Archive Turns 20; A Behind The Scenes Look At Archiving The Web
`
`Though it acts as the general public's primary gateway to the Archive’s web
`
`materials, the Wayback Machine is merely a public interface to a limited subset of all
`
`these holdings. Only a portion of what the Archive crawls or receives from external
`
`organizations and partners is made available in the Wayback Machine, though as
`
`Mr. Graham noted there is at present “no master flowchart of the source of captures
`
`that are available via the Wayback Machine” so it is difficult to know what percent of
`
`the holdings above can be found through the Wayback Machine’s public interface.
`
`Moreover, large portions of the Archive’s holdings carry notices that access to them
`
`is restricted, often due to embargos, license agreements, or other processes and
`
`policies of the Archive.
`
`In this way, the Archive is essentially a massive global collage of crawls and datasets,
`
`some conducted by the Archive itself, others contributed by partners. Some focus on
`
`the open web, some focus on the foundations of the web’s infrastructure, and others
`
`focus on very narrow slices of the web as defined by contributing sponsors or
`
`Archive staff. Some are obtained through donations, some through targeted
`
`acquisitions, and others compiled by the Archive itself, much in the way a traditional
`
`paper archive operates. Indeed, the Archive is even more similar to traditional
`
`archives in its use of a dark archive in which only a portion of its holdings are
`
`publically accessible, with the rest having various access restrictions and
`
`documentation ranging from detailed descriptions to simple item placeholders.
`
`This is in marked contrast to the description that is often portrayed of the Archive by
`
`outsiders as a traditional centralized continuous crawl infrastructure, with a large
`
`farm of standardized crawlers ingesting the open web and feeding the Wayback
`
`Machine akin to what a traditional commercial search engine might do. The Archive
`
`has essentially taken the traditional model of a library archive and brought it into
`
`the digital era, rather than take the model of a search engine and add a preservation
`
`component to it.
`
`There are likely many reasons for this architectural decision. It is certainly not the
`
`difficulty of building such systems — there are numerous open source infrastructures
`
`and technologies that make it highly tractable to build continuous web-scale
`
`crawlers given the amount of hardware available to the Archive. Indeed, I myself
`
`have been building global web scale crawling systems since 1995 and while still a
`
`senior in high school in 2000 launched a whole-of-web continuous crawling system
`
`with sideband recrawlers and an array of realtime content analysis and web mining
`
`algorithms running at the NSF-supported supercomputing center NCSA.
`
`Why then has the Archive employed such a patchwork approach to web archival,
`rather than the established centralized and standardized model of its commercial
`
`peers? Part of this may go back to the Archive’s roots. When the Internet Archive
`httszMwonrbes.convsitesfltalevleem rul201 6f01l1 fifth e-intern et-archive-turn5-20-a-behind-the-scenes—Iook—at-arch iving-th e-webth 806433 3280
`
`4:9
`
`UNITED THERAPEUTICS, EX. 2118
`WATSON LABORATORIES v. UNITED THERAPEUTICS, |PR2017-O1621
`
`Page 4 of 9
`
`
`
`6l1gf2013
`
`The Internet Archive Tums 20: A Behind The Scenes Look At Archiving The Web
`
`was first formed Alexa Internet was the primary source of its collections, donating
`
`its daily open crawl data. The Archive therefore had little need to run its own whole-
`
`of-web crawls, since it had a large commercial partner providing it such a feed. It
`
`could instead focus on supplementing that general feed with specialized crawls
`
`focusing on particular verticals and partner with other crawling organizations to
`mirror their archives.
`
`From the chronology of datasets that make up its web holdings, the Archive appears
`
`to have evolved in this way as a central repository and custodian of web data, taking
`
`on the role of archivist and curator, rather than trying to build its own centralized
`
`continuous crawl of the entire web. Over time it appears to have taken on an ever-
`
`expanding collection role of its own, running its own general purpose web-scale
`
`crawls and bolstering them with a rapidly growing assortment of specialized crawls.
`
`With all of this data pouring in from across the world, a key question is how the
`
`Internet Archive deals with exclusions, especially the ubiquitous “robots .txt” crawler
`
`exclusion protocol.
`
`The Internet Archive’s Archive—It program appears to strictly enforce robots.txt files,
`
`requiring special permission for a given crawl to ignore them: “By default, the
`
`Archive-It crawler honors and reSpects all robots.txt exclusion requests. On a case
`
`by case basis institutions can set up rules to ignore robots.txt blocks for specific
`
`sites, but this is not available in Archive-It accounts by default. If you think you may
`
`need to ignore robots.txt for a site, please contact the Archive-It team for more
`
`information or to enable this feature for your account.”
`
`In contrast, the Library of Congress uses a strict opt-in process and “notifies each
`
`site that we would like to include in the archive (with the exception of government
`
`websites), prior to archiving. In some cases, the e-mail asks permission to archive or
`
`to provide off~site access to researchers.” The Library uses the Internet Archive to
`
`perform its crawling and ignores robots.txt for those crawls: “The Library of
`
`Cengress has c0ntracted with the Internet Archive to collect content from websites
`
`at regular intervals
`
`the Internet Archive uses the Heritrix crawler to collect
`
`websites on behalf of the Library of Congress. Our crawler is instructed to bypass
`
`robots.txt in order to obtain the most complete and accurate representation of
`
`websites such as yours.” In this case, the Library views the written archival
`
`permission as taking precedent over robots.txt directives: “The Library notifies site
`
`owners before crawling which means we generally ignore robots.txt exclusions.”
`
`The British Library appears to ignore robots.txt in order to preserve page rendering
`
`elements and for selected content deemed culturally important, stating “Do you
`
`respect robots.txt? As a rule, yes: we do follow the robots exclusion protocol.
`https:lev.forbes.comtsitestkalevleetarulzo'!6l01I1antie-intemet-archive-turn5-20-a-behind-the-scenes-Iook-at-archiving-tne-wemeeDMSaBZeo
`
`SEQ
`
`UNITED THERAPEUTICS, EX. 2118
`WATSON LABORATORIES v. UNITED THERAPEUTICS, |PR2017-O1621
`
`Page 5 of 9
`
`
`
`6I19I201B
`
`The Internet Archive Tums 20; A Behind The Scenes Look At Archiving The Web
`
`However, in certain circumstances we may choose to overrule robots.txt. For
`
`instance: if content is necessary to render a page (e.g. Javascript, CSS) or content is
`
`deemed of curatorial value and falls within the bounds of the Legal Deposit Libraries
`
`Act 2003.”
`
`Similarly, the National Library of France states “In accordance with the Heritage
`
`Code (art L132~2-1), the BnF is authorized to disregard the robot exclusion protocol,
`
`also called r0bots.txt.
`
`To accomplish its legal deposit mission, the BnF can choose
`
`to collect some of the files covered by r0bots.txt when they are needed to reconstruct
`
`the original form of the website (particularly in the case of image or style sheet files).
`
`This non-compliance with robots.txt does not conflict with the protection of private
`
`correspondence guaranteed by law, because all data made available on the Internet
`
`are considered to be public, whether they are or are not filtered by robots.txt.”
`
`The Internet Archive’s general approach to handling robots.txt exclusions on the
`
`open web appears to have evolved over time. The first available snapshot of the
`
`Archive’s FAQ, dating to October 4, 2002, states “The Internet Archive is not
`
`interested in preserving or offering access to Web sites or other Internet documents
`
`of persons who do not want their materials in the collection. By placing a simple
`
`robots.txt file on your Web server, you can exclude your site from being crawled as
`
`well as exclude any historical pages from the Wayback Machine.” This statement is
`
`preserved without modification for the next decade, through at least April 2nd,
`
`2013. A few weeks later on April 20th, 2013, the text had been rewritten to state
`
`“You can exclude your site from display in the Wayback Machine by placing a simple
`
`r0bots.txt file on your Web server.” The new language removed the statement “you
`
`can exclude your site from being crawled” and replaced it with “you can exclude your
`
`site from display.” Indeed, this new language has carried through to present.
`
`From its very first snapshot of October 4, 2002 through sometime the week of
`
`November 8th, 2015 the FAQ further stated “Alexa Internet, the company that
`
`crawls the web for the Internet Archive, does respect robots.txt instructions, and
`
`even does so retroactively. If a web site owner decides he / she prefers not to have a
`
`web crawler visiting his / her files and sets up robots.txt on the site, the Alexa
`
`crawlers will stop visiting those files and will make unavailable all files previously
`
`gathered from that site. This means that sometimes, while using the Internet
`
`Archive Wayback Machine, you may find a site that is unavailable due to robots.txt.”
`
`Yet, just a few days later on November 14th, 2015 the FAQ had been revised to state
`
`only “Such sites may have been excluded from the Wayback Machine due to a
`
`robotstxt file on the site or at a site owner’s direct request. The Internet Archive
`
`strives to follow the Oakland Archive Policy for Managing Removal Requests And
`
`Preserving Archival Integrity.” The current FAQ points to an archived copy of the
`https:IMvwv.fo rbesmmrsite sfkalevleem ruf201 61'01f1 Shh e-intern et-archive-turns-ZO— a-behind-the-scenes-Iook-at-archluring-tn e-webflH—‘l 906438 8280
`
`Big
`
`UNITED THERAPEUTICS, EX. 2118
`WATSON LABORATORIES v. UNITED THERAPEUTICS, IPR2017-O162‘I
`
`Page 6 of 9
`
`
`
`6f1gf2013
`
`The Internet Archive Tums 20: A Behind The Scenes Look At Archiving The Web
`
`Oakland Archive Policy from December 2002 that states “To remove a site from the
`
`Wayback Machine, place a robots.txt file at the top level of your site
`
`It will tell the
`
`Internet Archive's crawler not to crawl your site in the future” and notes that
`
`“ia_archiver” is the proper user agent to exclude the Archive’s crawlers from
`
`accessing a site.
`
`The Archive’s evolving stance with respect to robots.txt files appears to explain why
`
`attempting to access the Washington Post through the Wayback Machine yields an
`
`error that it has been blocked due to robots.txt, yet the site is being crawled and
`
`preserved by the Internet Archive every few days over the last four years. Similarly,
`
`accessing USA Today or the Bangkok Post through the Wayback Machine yields the
`
`error message “This URL has been excluded from the Wayback Machine,” but
`
`happily both sites are being preserved through regular snapshots. Here the
`
`robots.txt exclusion appears to be used only to govern display in the Wayback
`
`Machine’s public interface, with excluded sites continuing to be crawled and
`
`preserved in Archive’s dark archive for posterity to ensure they are not lost.
`
`Despite having several programs dedicated to crawling online news, including both
`
`International News Crawls and a special “high—value news sites” collection, not all
`
`news sites are equally represented in the Archive’s stand-alone archives, whether or
`
`not they have r0b0ts.txt exclusions. The Washington Post has over 303 snapshots in
`
`its archive, while the New York Times has 124 and the Daily Mail has 196. Yet, Der
`
`Spiegel has just 34 captures in its stand-alone archive from 2012 to 2014, with none
`
`since. Just two of the five national newspapers of Japan have such archives, Asahi
`
`Shimbun (just 64 snapshots since 2012), Nihon Keizai Shimbun (just 22 snapshots
`
`since 2012), while the other three have no such archives: Mainichi Shimbun, Sankei
`
`Shimbun, and Yomiuri Shimbun. In India, of the top three newspapers by
`
`circulation as of 2013, The Times of India had just 32 snapshots since 2012, The
`
`Hindu does not have its own archive, and the Hindustan Times had 250 snapshots
`
`since 2012. Of the top three newspapers, one is not present at all and The Times of
`
`India has nearly 8 times fewer snapshots than the Hindustan Times, despite having
`
`2.5 times the circulation in 2013.
`
`Each of these newspapers is likely to be captured through any one of the Archive’s
`
`many other crawls and feeds, but the lack of standalone dedicated collections for
`
`these papers and the apparent Western bias in the existence of such standalone
`
`archives suggests further community input may be required. Indeed, it appears that
`
`a number of the Archive’s dedicated site archives are driven by their Alexa Top 1
`
`Million rankings.
`
`Why is it important to understand how web archives work? As I pointed out this
`
`past November, there has been very little information published in public forums
`httszm-vwforbes.ccmlsitesfltalevleetarufzot6:01”Sflhe-internet-archiire-turns-ZO-a-beh|nd-the-scenes-lcok-at-archiving—the-webWeDMSaBZeO
`
`we
`
`UNITED THERAPEUTICS, EX. 2118
`WATSON LABORATORIES v. UNITED THERAPEUTICS, |PR2017-O1621
`
`Page 7 of 9
`
`
`
`6i1gf2013
`
`The Internet Archive Tums 20; A Behind The Scenes Look At Archiving The Web
`
`documenting precisely how our major web archives work and what feeds into them.
`
`As the Internet Archive and its peers begin to expand their support of researcher use
`
`of their collections, it is critically important that we understand how precisely these
`
`archives have been built and the implications of those decisions and their biases for
`
`the findings we are ultimately able to derive. Moreover, given how fast the web is
`
`disappearing before our eyes, having greater transparency and community input
`
`into our web archives will help ensure that they are not overly biased towards the
`
`English-Speaking Western world and are able to capture the web’s most vulnerable
`materials.
`
`Greater insight is not an all-or-none proposition of having petabytes of crawler log
`
`files or no information at all. It is not necessary to have access to a log of every single
`
`action taken by any of the Archive’s crawlers in its history. Yet, it is also the case that
`
`simply treating archives as black boxes without the slightest understanding of how
`
`they were constructed and basing our findings on those hidden biases is no longer
`
`feasible as the scholarly world of data analysis grows up and matures. As web
`
`archives transition from being simple “as-is” preservation and retrieval sites towards
`
`being our only records of society’s online existence and powering an ever-growing
`
`fraction of scholarly research, we need to at least understand how they function at a
`
`high level and what data sources they draw from.
`
`Putting this all together, what can we learn from these findings? Perhaps most
`
`importantly, we have seen that the Internet Archive operates far more like a
`
`traditional library archive than a modern commercial search engine. Rather than a
`
`single centralized and standardized continuous crawling farm, the Archive’s
`
`holdings are comprised of millions of files in thousands of collections from hundreds
`
`of partners, all woven together into a rich collage which the Archive preserves as
`
`custodian and curator. The Wayback Machine is seen to be merely a public interface
`
`to an unknown fraction of these holdings, with the Archive’s real treasure trove of
`
`millions of web materials being scattered across its traditional item collections.
`
`From the standpoint of scholarly research use of the Archive, the patchwork
`
`composition of its web holdings and vast and incredibly diverse landscape of inputs
`
`presents unique challenges that have not been adequately addressed or discussed. At
`
`the same time, those fearful that robots.txt exclusions are leading to whole swaths of
`
`the web being lost can breathe a bit easier given the Archive’s evolving treatment of
`
`them, which appears to be in line with an industry-wide movement towards ignoring
`exclusions when it comes to archival.
`
`In the end, as the Internet Archive turns 20 this year, its evolution over the last two
`
`decades offers a fascinating look back at how the web itself has evolved, from its
`
`changing views on robots.txt to its growing transition from custodian to curator to
`
`collector. Along the way we get an incredible glimpse at just how hard it really is to
`httszMwonrbes.mnvsitesfltalevleem rul201 6M1“ Bflh e-intern et-archive-tums-ZO— a-behind-the-scenes-look-at-archiving—th e-webflM—‘i 906438 8280
`
`SEQ
`
`UNITED THERAPEUTICS, EX. 2118
`WATSON LABORATORIES v. UNITED THERAPEUTICS, |PR2017-O1621
`
`Page 8 of 9
`
`
`
`6t1gf2013
`
`The Inlemet Archive Tums 20; A Behind The Scenes Look At Archiving The Web
`
`try and archive the whole web for perpetuity and the tireless work of the Archive to
`
`build one of the Internet’s most unique collections.
`
`https:fM\Wv.fO r‘besmmrsite smalevleem I‘UFZD‘I 6M1” Sfih e-intern et-archive-turns-ZO-a-behlnd-the-soenes-look-at-archiving—th e-webm44 905438 3280
`
`919
`
`UNITED THERAPEUTICS, EX. 2118
`WATSON LABORATORIES v. UNITED THERAPEUTICS, |PR2017-01621
`
`Page 9 of 9
`
`