The Internet Archive Turns 20: A Behind The Scenes Look At Archiving The Web
JAN 18, 2016
`The Internet Archive Turns 20: A Behind The
`Scenes Look At Archiving The Web
Kalev Leetaru, CONTRIBUTOR

Internet Archive founder Brewster Kahle and some of the Archive's servers in 2006. (AP Photo/Ben Margot)
`To mostof the web surfing public, the Internet Archive’s Wayback Machineis the
`face of the Archive’s web archivingactivities. Via a simple interface, anyone can type
`in a URL and see howit has changedoverthelast 20 years. Yet, behind that simple
`search boxlies an exquisitely complex assemblage of datasets and partnersthat
`makepossible the Archive’s vast repository of the web. How doesthe Archivereally
`work, whatdoesits crawl workflow look like, how doesit handle issueslike
`robots.txt, and whatcanall of this teach us about the future of web archiving?
`The Internet Archive Turns 20: A Behind The Scenes Look At Archiving The Web
`Perhapsthefirst and most importantdetail to understand aboutthe Internet
`Archive’s web crawling activities is that it operates far morelike a traditional library
`archive than a modern commercial search engine. Mostlarge web crawling
`operations today operate vast farmsof standardized crawlersall operating in unison,
`sharing a commonsetof rules and behaviors. They traditionally operate in
`continuous crawling mode, in which the goalis to scour the web 24/7/365 and
`attemptto identify and ingest every available URL.
`In contrast, the Internet Archive is comprised of a myriad independentdatasets,
`feeds and crawls, each of which has very different characteristics and rules
`governing its construction, with somerun bythe Archive andothers by its many
`partners and contributors.In the place of a single standardized continuouscrawl
`with stable criteria and algorithms,there is a vibrant collage of inputs that all feed
`into the Archive’s sum holdings. As Mark Graham, Director of the Wayback Machine
`put in an email, the Internet Archive’s web materials are comprised of “many
`different collections driven by manyorganizations that have different approaches to
`crawling.” At the timeof this writing, the primary web holdingsof the Archivetotal
`morethan 4.1 million items across 7,357 distinct collections, while its Archive-It
`program has over 440 partner organizations overseeing specific targeted collections.
`Contributors range from middle school students in Battle Ground, WAto the
`National Library of France.
`Those 4.1 million items comprise a treasure trove covering nearly every imaginable
`topic and data type. There are crawls contributed by the Sloan Foundation and
`Alexa, crawls run by IA on behalf of NARA andthe Internet Memory Foundation,
`mirrors of Common Crawl and even DNSinventories containing more than 2.5
`billion records from 2013. Manyspecialty archives preserve the final snapshots of
`now-defunct online communities like GeoCities and Wretch. Dedicated Archive-It
`crawls preserve a myriad hand-selected or sponsored websites on an ongoingbasis
`such as the Wake Forest University Archives. These dedicated Archive-IT crawls can
`be accessed directly and in somecases appearto feed into the Wayback Machine,
`accounting for why the WakeForest site is captured almost every Thursday and
`Friday overthe last two yearslike clockwork.
`Alexa Internet has been a major source of the Archive's regular crawl data since
`1996, with the Archive’s FAQpagestating “muchof our archived web data comes
`from our own crawls or from Alexa Internet's crawls ... Internet Archive's crawls
`tendto find sites that are well linked from othersites ... Alexa Internet uses its own
`methodsto discoversites to crawl. It may be helpful to install the free Alexa toolbar
`andvisit the site you want crawled to makesure they know aboutit.”
`Another prominentsourceis the Archive’s “Worldwide Web Crawls,” which are
`described as “Since September10th, 2010, the Internet Archive has been running
`Worldwide Web Crawlsof the global web, capturing web elements, pages, sites and
`parts of sites. Each Worldwide Web Crawl wasinitiated from one or morelists of
`URLs that are known as‘SeedLists’ ... various rules are also applied to the logic of
`each crawl. Thoserules define things like the depth the crawlerwill try to reach for
`each host (website) it finds.” With respect to how frequently the Archive crawls each
`site, the only available insight is “For the mostpart a given hostwill only be
`captured once per Worldwide Web Crawl, howeverit might be captured more
`frequently (e.g. once per hourfor various newssites) via other crawls.”
`The mostrecent crawl appears to be Wide Crawl Number13, created on January 9,
`2015 and running through present. Few details are available regarding the crawls,
`though the March 2011 crawl (Wide 2)states it ran from March 9, 2011 to December
`23, 2011, capturing 2.7 billion snapshotsof2.3 billion unique URLs fromatotal of
`29 million unique websites. The documentation notesthatit used the Alexa Top 1
`Million rankingasits seed list and excludedsites with robots.txt directives. As a
`warning for researchers, the collection notes “Wealso included repeated crawls of
`some Argentinian governmentsites, so looking at results by country will be
`somewhat skewed.”
`Augmenting these efforts, the Archive’s No More 404 program provideslive feeds
`from the GDELTProject, Wikipedia and WordPress. The GDELTProject provides a
`daily list of all URLs of online newscoverage it monitors aroundthe world, which
`the Archive then crawls andarchives, vastly expanding the Archive’s reach into the
`non-Western world. The Wikipedia feed monitors the “[W]ikipedia IRC channelfor
`updated article[s], extracts newly addedcitations, and feed[s] those URLs for
`crawling,” while the WordPress feed scans “WordPress's official blog update stream,
`and schedules each permalink URL of new postfor crawling.” These greatly expand
`the Archive’s holdings of news and other material relating to current events.
`Somecrawls are designed to makea single one-time capture to ensure thatat least
`onecopyof everything ona givensite is preserved, while others are designed to
`intensively recrawl a small subset of hand-selectedsites on a regular interval to
`ensure both that new contentis found and thatall previously-identified content is
`checked for any changesand freshly archived. In terms of how frequently the
`Archive recrawls a given site Mr. Graham wrotethat “it is a function of the hows,
`whats and whysof our crawls. The Internet Archive does not crawl all sites equally
`noris our crawl frequency strictly a function of how populara site is.” He goes on to
`caution “I would expect any researcher would be remissto not take the fluid nature
`of the web,and the crawls of the [Internet Archive], into consideration” with respect
`to interpreting the highly variable nature of the Archive’s recrawlrate.
`Thoughit acts as the general public’s primary gateway to the Archive’s web
`materials, the Wayback Machineis merely a public interface to a limited subsetofall
`these holdings. Only a portion of what the Archive crawls or receives from external
`organizations and partners is madeavailable in the Wayback Machine, though as
`Mr. Graham notedthereis at present “no master flowchart of the source of captures
`that are available via the Wayback Machine”soitis difficult to know whatpercent of
`the holdings above can be found through the Wayback Machine’s public interface.
`Moreover,large portions of the Archive’s holdings carry notices that access to them
`is restricted, often due to embargos, license agreements, or other processes and
`policies of the Archive.
`In this way, the Archiveis essentially a massive global collage of crawls and datasets,
`some conducted by the Archiveitself, others contributed by partners. Some focus on
`the open web, somefocus on the foundationsof the web’s infrastructure, and others
`focus on very narrowslices of the web as defined by contributing sponsors or
`Archive staff. Some are obtained through donations, some through targeted
`acquisitions, and others compiled by the Archiveitself, much in the waya traditional
`paper archive operates. Indeed, the Archive is even moresimilarto traditional
`archivesin its use of a dark archive in which only a portion ofits holdings are
`publically accessible, with the rest having various accessrestrictions and
`documentation ranging from detailed descriptions to simple item placeholders.
`This is in markedcontrastto the description that is often portrayed of the Archive by
`outsidersasatraditional centralized continuous crawlinfrastructure, with a large
`farm of standardized crawlers ingesting the open web and feeding the Wayback
`Machineakin to whata traditional commercial search engine might do. The Archive
`hasessentially taken the traditional modelof a library archive and broughtit into
`the digital era, rather than take the model of a search engine and adda preservation
`There are likely many reasonsfor this architectural decision.It is certainly not the
`difficulty of building such systems — there are numerous open source infrastructures
`and technologies that makeit highly tractable to build continuous web-scale
`crawlers given the amount of hardwareavailable to the Archive. Indeed, I myself
`have been building global web scale crawling systems since 1995 and while still a
`senior in high school in 2000 launched a whole-of-web continuous crawling system
`with sideband recrawlers and an array of realtime content analysis and web mining
`algorithms running at the NSF-supported supercomputing center NCSA.
`Whythen has the Archive employed such a patchwork approachto web archival,
`rather than the established centralized and standardized modelof its commercial
`peers? Part of this may go back to the Archive’s roots. When the Internet Archive
`was first formed Alexa Internet was the primary sourceofits collections, donating
`its daily open crawl data. The Archive therefore hadlittle need to run its own whole-
`of-webcrawls, since it had a large commercial partner providingit such a feed.It
`could instead focus on supplementing that general feed with specialized crawls
`focusing on particularverticals and partner with other crawling organizations to
`mirrortheir archives.
`From the chronology of datasets that make up its web holdings, the Archive appears
`to have evolved in this way as a central repository and custodian of web data, taking
`on therole of archivist and curator, rather than trying to build its own centralized
`continuouscrawlof the entire web. Over time it appears to have taken on an ever-
`expandingcollection role of its own, running its own general purpose web-scale
`crawls and bolstering them with a rapidly growing assortmentof specialized crawls.
`Withall of this data pouring in from across the world, a key question is how the
`Internet Archive deals with exclusions, especially the ubiquitous “robots.txt” crawler
`exclusion protocol.
`The Internet Archive's Archive-It program appearstostrictly enforce robots.txt files,
`requiring special permission for a given crawl to ignore them: “By default, the
`Archive-It crawler honors andrespects all robots.txt exclusion requests. On a case
`by case basis institutions can set up rules to ignore robots.txt blocks for specific
`sites, but this is not available in Archive-It accounts by default. If you think you may
`need to ignore robots.txt for a site, please contact the Archive-It team for more
`information orto enable this feature for your account.”
`In contrast, the Library of Congressusesa strict opt-in process and “notifies each
`site that we wouldlike to includein the archive (with the exception of government
`websites), prior to archiving. In somecases, the e-mail asks permission to archive or
`to provide off-site access to researchers.” The Library uses the Internet Archive to
`perform its crawling and ignores robots.txt for those crawls: “The Library of
`Congresshas contracted with the Internet Archive to collect content from websites
`at regular intervals ... the Internet Archive uses the Heritrix crawlerto collect
`websites on behalf of the Library of Congress. Our crawleris instructed to bypass
`robots.txt in order to obtain the most complete and accurate representation of
`websites such as yours.”In this case, the Library viewsthe written archival
`permission as taking precedent over robots.txt directives: “The Library notifies site
`owners before crawling which means we generally ignore robots.txt exclusions.”
`The British Library appearsto ignore robots.txt in order to preserve page rendering
`elements and for selected content deemedculturally important, stating “Do you
`respect robots.txt? As a rule, yes: we do follow the robots exclusion protocol.
`However,in certain circumstances we may chooseto overrule robots.txt. For
`instance:if content is necessary to rendera page(e.g. Javascript, CSS) or contentis
`deemedofcuratorial value and falls within the boundsof the Legal Deposit Libraries
`Act 2003.”
`Similarly, the National Library of France states “In accordance with the Heritage
`Code (art L132-2-1), the BnF is authorized to disregard the robot exclusion protocol,
`also called robots.txt. ... To accomplishits legal deposit mission, the BnF can choose
`to collect someof the files covered by robots.txt when they are neededto reconstruct
`the original form of the website (particularly in the case of imageorstyle sheetfiles).
`This non-compliancewith robots.txt does not conflict with the protection of private
`correspondenceguaranteed by law, becauseall data madeavailable on the Internet
`are considered to be public, whethertheyare or are notfiltered by robots.txt.”
`The Internet Archive’s general approachto handling robots.txt exclusions on the
`open web appearsto have evolved overtime. Thefirst available snapshot of the
`Archive’s FAQ, dating to October 4, 2002, states “The Internet Archiveis not
`interested in preserving or offering access to Websites or other Internet documents
`of persons whodo not wanttheir materials in the collection. By placing a simple
`robots.txt file on your Web server, you can excludeyoursite from being crawled as
`well as exclude anyhistorical pages from the Wayback Machine.” This statementis
`preserved without modification for the next decade, throughat least April 2nd,
`2013. A few weeks later on April 20th, 2013, the text had been rewritten to state
`“You can exclude yoursite from display in the Wayback Machinebyplacing a simple
`robots.txt file on your Web server.” The new language removedthe statement “you
`can excludeyoursite from being crawled” and replacedit with “you can exclude your
`site from display.” Indeed, this new language hascarried through to present.
`From its very first snapshot of October 4, 2002 through sometime the week of
`November8th, 2015 the FAQ further stated “Alexa Internet, the company that
`crawls the web for the Internet Archive, does respect robots.txt instructions, and
`even doesso retroactively. If a web site owner decides he / she prefers not to have a
`webcrawlervisiting his / her files and sets up robots.txt on the site, the Alexa
`crawlers will stop visiting those files and will make unavailable all files previously
`gathered from that site. This means that sometimes, while using the Internet
`Archive Wayback Machine, you mayfind a site that is unavailable due to robots.txt.”
`Yet, just a few days later on November14th, 2015 the FAQ hadbeenrevisedto state
`only “Such sites may have been excluded from the Wayback Machine dueto a
`robots.txt file on the site or at a site owner’s direct request. The Internet Archive
`strives to follow the Oakland Archive Policy for Managing Removal Requests And
`Preserving Archival Integrity.” The current FAQ points to an archived copyof the
`Oakland Archive Policy from December 2002 that states “To removea site from the
`Wayback Machine,place a robots.txt file at the top level of yoursite ... It will tell the
`Internet Archive's crawler not to crawl yoursite in the future” and notesthat
`“ja_archiver”is the properuser agent to exclude the Archive’s crawlers from
`accessinga site.
`TheArchive's evolving stance with respect to robots.txt files appears to explain why
`attempting to access the Washington Post through the Wayback Machineyields an
`errorthat it has been blocked dueto robots.txt, yet the site is being crawled and
`preserved by the Internet Archive every few days overthelast four years. Similarly,
`accessing USA Todayor the Bangkok Post through the Wayback Machineyields the
`error message “This URL has been excluded from the Wayback Machine,” but
`happily both sites are being preserved through regular snapshots. Here the
`robots.txt exclusion appearsto be used only to govern display in the Wayback
`Machine’s public interface, with excludedsites continuing to be crawled and
`preserved in Archive’s dark archivefor posterity to ensure theyare notlost.
`Despite having several programsdedicated to crawling online news,including both
`International News Crawls anda special “high-value newssites”collection, notall
`newssites are equally represented in the Archive’s stand-alone archives, whether or
`not they have robots.txt exclusions. The Washington Post has over 303 snapshots in
`its archive, while the New York Timeshas 124 and the Daily Mail has 196. Yet, Der
`Spiegel has just 34 capturesin its stand-alone archive from 2012 to 2014, with none
`since. Just two ofthe five national newspapers of Japan have such archives, Asahi
`Shimbun (just 64 snapshots since 2012), Nihon Keizai Shimbun (just 22 snapshots
`since 2012), while the other three have nosuch archives: Mainichi Shimbun, Sankei
`Shimbun, and Yomiuri Shimbun.In India, of the top three newspapers by
`circulation as of 2013, The Timesof India had just 32 snapshots since 2012, The
`Hindu doesnot haveits own archive, and the Hindustan Times had 250 snapshots
`since 2012. Of the top three newspapers,oneis not presentat all and The Timesof
`India has nearly 8 times fewer snapshots than the Hindustan Times, despite having
`2.5 timesthe circulation in 2013.
`Eachof these newspapersis likely to be captured through anyoneof the Archive’s
`manyother crawls and feeds, but the lack of standalone dedicated collections for
`these papers and the apparent Western biasin the existence of such standalone
`archives suggests further community input may be required. Indeed,it appears that
`a numberof the Archive’s dedicated site archives are driven by their Alexa Top 1
`Million rankings.
`Whyis it important to understand how webarchives work? As I pointed outthis
`past November,there has beenvery little information published in public forums
`documenting precisely how our major web archives work and whatfeeds into them.
`As the Internet Archive andits peers begin to expandtheir support of researcher use
`of their collections,it is critically important that we understand howprecisely these
`archives have beenbuilt and the implications of those decisions and their biases for
`the findings we are ultimately able to derive. Moreover, given how fast the web is
`disappearing before our eyes, having greater transparency and community input
`into our webarchiveswill help ensure that they are not overly biased towards the
`English-speaking Western world andare able to capture the web’s mostvulnerable
`Greaterinsight is not an all-or-none proposition of having petabytes of crawlerlog
`files or no informationatall. It is not necessary to have access to a log of every single
`action taken by any of the Archive’s crawlersin its history. Yet, it is also the case that
`simply treating archives as black boxes withoutthe slightest understanding of how
`they were constructed andbasing ourfindings on those hiddenbiasesis no longer
`feasible as the scholarly world of data analysis grows up and matures. As web
`archivestransition from being simple “as-is” preservation andretrieval sites towards
`being ouronly recordsof society's online existence and powering an ever-growing
`fraction of scholarly research, we needto at least understand howthey function at a
`high level and what data sources they draw from.
`Putting this all together, what can welearn from thesefindings? Perhaps most
`importantly, we have seen that the Internet Archive operates far morelike a
`traditional library archive than a modern commercial search engine. Rather than a
`single centralized and standardized continuous crawling farm, the Archive’s
`holdings are comprisedof millionsoffiles in thousandsofcollections from hundreds
`of partners,all woven togetherinto a rich collage which the Archive preserves as
`custodian and curator. The Wayback Machineis seen to be merely a public interface
`to an unknown fraction of these holdings, with the Archive’s real treasure trove of
`millions of web materials being scattered acrossits traditional item collections.
`From the standpointof scholarly research use of the Archive, the patchwork
`composition of its web holdings and vast and incredibly diverse landscape of inputs
`presents unique challenges that have not been adequately addressed or discussed. At
`the sametime,those fearful that robots.txt exclusions are leading to whole swathsof
`the web beinglost can breathea bit easier given the Archive’s evolving treatment of
`them, which appearsto be in line with an industry-wide movement towards ignoring
`exclusions when it comesto archival.
`In the end,as the Internet Archive turns 20this year, its evolution overthe last two
`decadesoffers a fascinating look back at howthe webitself has evolved, from its
`changing viewson robots.txt to its growing transition from custodian to curator to
`collector. Along the way weget an incredible glimpseat just how harditreally is to
`try and archive the whole webfor perpetuity and the tireless work of the Archive to
`build oneof the Internet’s most uniquecollections.
