`Peter Mika
`Yahoo! Research
`Diagonal 177
`Barcelona, Spain
`Tim Potter
`Yahoo! Research
`Diagonal 177
`Barcelona, Spain
`We provide an analysis of the adoption of metadata stan-
`dards on the Web based a large crawl of the Web. In par-
`ticular, we look at what forms of syntax and vocabularies
`publishers are using to mark up data inside HTML pages.
`We also describe the process that we have followed and the
`difficulties involved in web data extraction.
`Embedding metadata inside HTML pages is one of the
`ways to publish structured data on the Web, often pre-
`ferred by publishers and consumers over other methods of
`exposing structured data, such as publishing data feeds,
`SPARQL endpoints or RDF/XML documents. Publishers
`prefer this method due to the ease of implementation and
`maintenance: since most webpages are dynamically gener-
`ated, adding markup simply requires extending the template
`that produces the pages. Consumers such as search engines
`are already accustomed to processing HTML and extraction
`fits naturally in their processing pipelines. The close cou-
`pling of the raw data and the HTML presentation of the
`data has other advantages, among others it makes sure that
`the the raw data and the end-user presentation show the
`In this paper, we describe the method by which we ex-
`tracted metadata from a large web corpus and present some
`statistics. Results from similar experiments have been al-
`ready published, so we also discuss the difficulty in compar-
`ing numbers across the various studies.
`Previous studies have reported results on the usage of em-
`bedded metadata, including Bizer et al. at http://www.
`webdatacommons.org/. We also published an earlier analy-
`sis on a different corpus collected by Yahoo! Search 1. There
`Copyright is held by the author/owner(s).
`LDOW2012, April 16, 2012, Lyon, France.
`are a number of factors that complicate the comparison of
`results. First, different studies use different web corpora.
`Our earlier study used a corpus collected by Yahoo!’s web
`crawler, while the current study uses a dataset collected by
`the Bing crawler. Bizer et al. analyze the data collected by
`http://www.commoncrawl.org, which has the obvious ad-
`vantage that it is publicly available. Second, the extraction
`methods may differ. For example, there are a multitude of
`microformats (one for each object type) and although most
`search engines and extraction libraries support the popular
`ones, different processors may recognize a different subset.
`Unlike the specifications of microdata and RDFa published
`by the RDFa, the microformat specifications are also rather
`informal and thus different processors may extract different
`information from the same page. Further, even if the same
`information is extracted, the conversion of this information
`to RDF may differ across implementations. Third, different
`extractors may be lenient in accepting particular mistakes in
`the markup, leading to more or less information extracted.
`We take as our starting point a sufficiently large sample
`of the web crawl produced by Bing’s web crawler during
`January, 2012. After retaining information resources with
`a content type that includes text/html, we get a data set
`of 3,230,928,609 records with only the three fields required
`for analysis, the URL of the page, the content type and the
`downloaded content. In case the crawler arrived to a page
`by following a (chain of) redirects, we considered the target
`of the redirect as the URL.
`We perform our analysis in two steps. First, we use reg-
`ular expression patterns to detect metadata in web pages.
`We use the same patterns proposed by Bizer et al., but we
`strengthen the pattern for detecting RDFa. In the form pro-
`posed by the authors it allows any page that contains about
`followed by whitespace and an equal sign;we limit this pat-
`tern to require that the equal sign be followed by whitespace
`and a single or double quote. We also introduce a new pat-
`tern to specifically detect webpages using the Open Graph
`Protocol Second, identified by the word property followed by
`optional whitespace, single or double quote, optional whites-
`pace and og:. For this analysis, we filter out pages larger
`than 3MB and where the character set can not be identi-
`fied. The total number of URLs in the output is thus slightly
`lower than in the input.
`Table 1 shows the prevalence of each format both in terms
`of URLs that use that format, and in terms of effective top-
`level domains (eTLD), sometimes called pay-level domains
`(PLD)2. For computing PLDs, we used the Guava library
`version 11.0.2. For a small number of URLs we failed to
`determine the PLD, e.g. because they contain an IP ad-
`dress instead of a domain name, but we believe this does
`not influence the results significantly.
`In a second step, we actually extract RDFa data from
`these pages using the Any23 library (version 0.7) as sug-
`gested by Bizer et al., and using the same set of extractor
`plugins. We use this library with the default configuration
`except for setting metadata nesting3 to off, because micro-
`format extraction generates a substantial number of addi-
`tional triples in the default setting. Before passing the con-
`tent to Any23, we read the char set of the page from the
`content-type and recode the page content to UTF-8 (we ex-
`clude pages where the character set can not be identified).
`We also modify each input page that we expect to contain
`OGP markup to define the og prefix. Without this, much of
`OGP data would not be extracted by Any23’s RDFa parser
`and there is also no specific extractor for OGP data. To
`speed up the process of extraction, we exclude some extreme
`cases: webpages larger than 3 MB, pages , pages contain-
`ing more than 200 VCard objects, and also pages where the
`result of the extraction exceeds 64 MB. We write the data
`in a quintet format: subject, predicate, object, context and
`the name of the extractor that produced that quad.
`To read the data, we use the same NxParser library that
`we use to write the data. Unfortunately, there are invalid
`lines in the output that we are not able to read back (var-
`ious exceptions reported by NxParser). Further, some in-
`put lines cause the parsing to enter an infinite loop. As a
`temporary measure until we find the source of these bugs,
`we run the parser in a separate thread and terminate this
`thread after 500ms. We also limit the size of each input line
`to 5KB and do not even attempt to parse lines longer than
`that. Due to these problems, we loose some data: the output
`contains 671,454,122 URLs compared to 973,539,519 URLs
`that we would expect to contain some data based on regular
`expressions. In total, we extract 17,443,606,947 triples. Ta-
`bles reftbl:topsites-rdfa and 3 and 4 show the top 10 sites as
`measured by the number of triples using RDFa, microdata,
`or hcard, respectively. The number of triples is an aggre-
`gate that reflects both the number of indexed pages in the
`crawl (a proxy for the importance of the domain) and the
`amount of data published per page. Again, we note that
`these lists are not exclusive. For example, youtube.com uses
`both microformats, microdata and RDFa within the same
`In terms of vocabulary usage, we show the most commonly
`used namespaces in RDFa data in Table 5. We also show the
`most frequently used classes in terms of the number of URLs
`and PLDs in Table 6 and Table 7, respectively. We omit
`the http protocol identifier, because all namespaces start
`with this protocol identifier, except for a facebook names-
`pace that appears with both http and https. The first table
`confirms that the vast majority of RDFa data on the Web is
`due to Facebook’s OGP markup. Unfortunately, OGP does
`not always conform with the letter and intent of RDFa. For
`example, type information in OGP is given using the og:type
`predicate, and not the RDF built-in rdf:type predicate. This
`explains the difference between Table 5 vs Table 6 and Ta-
`Triple count
`Table 2: Top sites by number of triples, RDFa only
`Triple count
`Table 3: Top sites by number of triples, microdata
`no data
`Abs PLD Pct PLD
`Abs URL Pct URL
`25.08 % 1,306,827
`22.45 % 1,140,880
`7.16 %
`8.60 % 1,755,733
`4.27 % 1,700,377
`69.29 % 30,809,476
`Table 1: Results from pattern-based analysis NU RL = 3, 169, 743, 997, NP LD = 32, 339, 522
`Triple count
`Table 4: Top sites by number of triples, hcard only
`ble 7: most OGP data does not define instances of any RDF
`class. As already mentioned above, most users of OGP also
`ignore the declaration of the og prefix (a problem we deal
`with in the extraction) and we can also see a number of varia-
`tions to the current standard namespace (a problem we have
`not dealt with). Further, OGP assigns additional meaning
`to the RDFa syntax that is not reflected in the RDFa stan-
`dard. As an example, the order in which triples are written
`on the page matters in OGP, but not in RDFa. For all these
`reasons, we believe that Any23 should be extended with a
`specific processor for OGP markup that is able to deal with
`these peculiarities.
`Besides OGP, a smaller amount of data can be attributed
`to efforts by Google’s Rich Snippet program and Yahoo’s
`retired SearchMonkey program. Social markup in the form
`of FOAF and SIOC is also present in a large number of do-
`mains as shown in Table 7. The fact that these vocabularies
`do not show up as prominently in Table 6 means that they
`are used more in the less deeply crawled part of the web.
`For microdata, we only list the top namespaces in Ta-
`ble 8 and Table 9, because Any23’s microdata extractor in-
`corporates the class name into the namespace.
`In micro-
`data, only two vocabularies (schema.org and Google’s data-
`vocabulary.org) have gained significant traction so far, and
`the latter is expected to be replaced by the former.
`It holds for both RDFa and microdata that the types of
`objects that are marked up is biased by the use case of search
`engine optimization, i.e. site owners prefer to mark up data
`that is used by the search engines to enrich search result
`presentation (e.g reviews, business listings). Schemas for
`these types of objects have also existed longer. We also ob-
`serve a natural preference to mark up simple types of objects
`(e.g. breadcrumbs), though we did not formally investigate
`the relationship between the complexity of markup and its
`We presented metadata statistics from the analysis of a
`large, recent sample of the Web, which has been extracted
`from the crawl of a search engine and therefore provides a
`search-engine centric view on the Web. Current web search
`engines are biased toward authoritative, head sites with valu-
`able textual content, and are not specifically looking for data
`on the Web. We expect that a search engine specifically built
`for data would give less weight to authority and textual con-
`tent and perform deeper crawling on sites that provide large
`and valuable data, by some measure of quantity and quality.
`Nonetheless, our work shows an impressive progress in the
`adoption of markup on the Web with over 30% of our col-
`lection containing some microformat, RDFa or microdata
`markup. Microformats and RDFa are the most popular
`choices of syntax. The level of microformats usage seems to
`be flat, while RDFa adoption has grown significantly com-
`pared to previous studies. This is due almost exclusively to
`OGP markup, though there is a variety of usage in the long
`tail, in particular social vocabularies. On the other hand,
`the adoption of microdata is driven so far only by the success
`of schema.org.
`There is significant future work to be done in order to
`evaluate the quality and practical usefulness of data embed-
`ded in HTML, with respect to some existing or novel tasks.
`In previous work, we have looked at the extent to which em-
`bedded metadata could be used to enrich web search results
`[1], but data on the Web is likely to be useful in a much
`broader array of applications.
`[1] K. Haas, P. Mika, P. Tarjan, and R. Blanco. Enhanced
`results for web search. In W.-Y. Ma, J.-Y. Nie, R. A.
`Baeza-Yates, T.-S. Chua, and W. B. Croft, editors,
`SIGIR, pages 725–734. ACM, 2011.
`Table 5: Top namespaces in RDFa as measured by the number of URLs
`Table 6: Top classes in RDFa as measured by the number of URLs with at least one instance
`Table 7: Top classes in RDFa as measured by the number of PLDs with at least one instance
`Table 8: Top namespaces in microdata as measured by the number of URLs
`Table 9: Top namespaces in microdata as measured by the number of PLDs
