IPR2023-00458, No. 1021 Exhibit - Ex 1021 Mika (P.T.A.B. Jan. 31, 2023)

Metadata Statistics for a Large Web Corpus
`
`Peter Mika
`Yahoo! Research
`Diagonal 177
`Barcelona, Spain
`pmika@yahoo-inc.com
`
`Tim Potter
`Yahoo! Research
`Diagonal 177
`Barcelona, Spain
`tep@yahoo-inc.com
`
`ABSTRACT
`We provide an analysis of the adoption of metadata stan-
`dards on the Web based a large crawl of the Web. In par-
`ticular, we look at what forms of syntax and vocabularies
`publishers are using to mark up data inside HTML pages.
`We also describe the process that we have followed and the
`diﬃculties involved in web data extraction.
`
`1.
`
`INTRODUCTION
`Embedding metadata inside HTML pages is one of the
`ways to publish structured data on the Web, often pre-
`ferred by publishers and consumers over other methods of
`exposing structured data, such as publishing data feeds,
`SPARQL endpoints or RDF/XML documents. Publishers
`prefer this method due to the ease of implementation and
`maintenance: since most webpages are dynamically gener-
`ated, adding markup simply requires extending the template
`that produces the pages. Consumers such as search engines
`are already accustomed to processing HTML and extraction
`ﬁts naturally in their processing pipelines. The close cou-
`pling of the raw data and the HTML presentation of the
`data has other advantages, among others it makes sure that
`the the raw data and the end-user presentation show the
`same.
`In this paper, we describe the method by which we ex-
`tracted metadata from a large web corpus and present some
`statistics. Results from similar experiments have been al-
`ready published, so we also discuss the diﬃculty in compar-
`ing numbers across the various studies.
`
`2. RELATED WORK
`Previous studies have reported results on the usage of em-
`bedded metadata, including Bizer et al. at http://www.
`webdatacommons.org/. We also published an earlier analy-
`sis on a diﬀerent corpus collected by Yahoo! Search 1. There
`
`1http://tripletalk.wordpress.com/2011/01/25/
`rdfa-deployment-across-the-web/
`
`Copyright is held by the author/owner(s).
`LDOW2012, April 16, 2012, Lyon, France.
`
`are a number of factors that complicate the comparison of
`results. First, diﬀerent studies use diﬀerent web corpora.
`Our earlier study used a corpus collected by Yahoo!’s web
`crawler, while the current study uses a dataset collected by
`the Bing crawler. Bizer et al. analyze the data collected by
`http://www.commoncrawl.org, which has the obvious ad-
`vantage that it is publicly available. Second, the extraction
`methods may diﬀer. For example, there are a multitude of
`microformats (one for each object type) and although most
`search engines and extraction libraries support the popular
`ones, diﬀerent processors may recognize a diﬀerent subset.
`Unlike the speciﬁcations of microdata and RDFa published
`by the RDFa, the microformat speciﬁcations are also rather
`informal and thus diﬀerent processors may extract diﬀerent
`information from the same page. Further, even if the same
`information is extracted, the conversion of this information
`to RDF may diﬀer across implementations. Third, diﬀerent
`extractors may be lenient in accepting particular mistakes in
`the markup, leading to more or less information extracted.
`
`3. ANALYSIS
`We take as our starting point a suﬃciently large sample
`of the web crawl produced by Bing’s web crawler during
`January, 2012. After retaining information resources with
`a content type that includes text/html, we get a data set
`of 3,230,928,609 records with only the three ﬁelds required
`for analysis, the URL of the page, the content type and the
`downloaded content. In case the crawler arrived to a page
`by following a (chain of) redirects, we considered the target
`of the redirect as the URL.
`We perform our analysis in two steps. First, we use reg-
`ular expression patterns to detect metadata in web pages.
`We use the same patterns proposed by Bizer et al., but we
`strengthen the pattern for detecting RDFa. In the form pro-
`posed by the authors it allows any page that contains about
`followed by whitespace and an equal sign;we limit this pat-
`tern to require that the equal sign be followed by whitespace
`and a single or double quote. We also introduce a new pat-
`tern to speciﬁcally detect webpages using the Open Graph
`Protocol Second, identiﬁed by the word property followed by
`optional whitespace, single or double quote, optional whites-
`pace and og:. For this analysis, we ﬁlter out pages larger
`than 3MB and where the character set can not be identi-
`ﬁed. The total number of URLs in the output is thus slightly
`lower than in the input.
`Table 1 shows the prevalence of each format both in terms
`of URLs that use that format, and in terms of eﬀective top-
`level domains (eTLD), sometimes called pay-level domains
`
`Page 1 of 6
`
`Netskope Exhibit 1021
`
`

`(PLD)2. For computing PLDs, we used the Guava library
`version 11.0.2. For a small number of URLs we failed to
`determine the PLD, e.g. because they contain an IP ad-
`dress instead of a domain name, but we believe this does
`not inﬂuence the results signiﬁcantly.
`In a second step, we actually extract RDFa data from
`these pages using the Any23 library (version 0.7) as sug-
`gested by Bizer et al., and using the same set of extractor
`plugins. We use this library with the default conﬁguration
`except for setting metadata nesting3 to oﬀ, because micro-
`format extraction generates a substantial number of addi-
`tional triples in the default setting. Before passing the con-
`tent to Any23, we read the char set of the page from the
`content-type and recode the page content to UTF-8 (we ex-
`clude pages where the character set can not be identiﬁed).
`We also modify each input page that we expect to contain
`OGP markup to deﬁne the og preﬁx. Without this, much of
`OGP data would not be extracted by Any23’s RDFa parser
`and there is also no speciﬁc extractor for OGP data. To
`speed up the process of extraction, we exclude some extreme
`cases: webpages larger than 3 MB, pages , pages contain-
`ing more than 200 VCard objects, and also pages where the
`result of the extraction exceeds 64 MB. We write the data
`in a quintet format: subject, predicate, object, context and
`the name of the extractor that produced that quad.
`To read the data, we use the same NxParser library that
`we use to write the data. Unfortunately, there are invalid
`lines in the output that we are not able to read back (var-
`ious exceptions reported by NxParser). Further, some in-
`put lines cause the parsing to enter an inﬁnite loop. As a
`temporary measure until we ﬁnd the source of these bugs,
`we run the parser in a separate thread and terminate this
`thread after 500ms. We also limit the size of each input line
`to 5KB and do not even attempt to parse lines longer than
`that. Due to these problems, we loose some data: the output
`contains 671,454,122 URLs compared to 973,539,519 URLs
`that we would expect to contain some data based on regular
`expressions. In total, we extract 17,443,606,947 triples. Ta-
`bles reftbl:topsites-rdfa and 3 and 4 show the top 10 sites as
`measured by the number of triples using RDFa, microdata,
`or hcard, respectively. The number of triples is an aggre-
`gate that reﬂects both the number of indexed pages in the
`crawl (a proxy for the importance of the domain) and the
`amount of data published per page. Again, we note that
`these lists are not exclusive. For example, youtube.com uses
`both microformats, microdata and RDFa within the same
`pages.
`In terms of vocabulary usage, we show the most commonly
`used namespaces in RDFa data in Table 5. We also show the
`most frequently used classes in terms of the number of URLs
`and PLDs in Table 6 and Table 7, respectively. We omit
`the http protocol identiﬁer, because all namespaces start
`with this protocol identiﬁer, except for a facebook names-
`pace that appears with both http and https. The ﬁrst table
`conﬁrms that the vast majority of RDFa data on the Web is
`due to Facebook’s OGP markup. Unfortunately, OGP does
`not always conform with the letter and intent of RDFa. For
`example, type information in OGP is given using the og:type
`predicate, and not the RDF built-in rdf:type predicate. This
`explains the diﬀerence between Table 5 vs Table 6 and Ta-
`
`2http://en.wikipedia.org/wiki/Public_Suffix_List
`3any23.extraction.metadata.nesting
`
`Site
`facebook.com
`tabelog.com
`venere.com
`yahoo.com
`tripadvisor.co.uk
`tripadvisor.it
`tripadvisor.com
`tripadvisor.fr
`tripadvisor.jp
`tripadvisor.es
`tripadvisor.de
`answers.com
`myspace.com
`tripadvisor.in
`daodao.com
`tripadvisor.com.tw
`tripadvisor.ru
`imdb.com
`youtube.com
`bestbuy.com
`
`Triple count
`1,739,664,342
`662,028,717
`366,531,732
`223,125,828
`195,314,434
`183,603,052
`179,970,956
`134,442,146
`125,976,435
`124,845,123
`96,635,499
`86,721,016
`79,984,056
`69,763,161
`66,014,882
`63,430,680
`41,199,304
`40,537,631
`39,942,197
`35,910,433
`
`Table 2: Top sites by number of triples, RDFa only
`
`Site
`myspace.com
`yelp.com
`bbb.org
`imdb.com
`thefreelibrary.com
`powells.com
`youtube.com
`homeﬁnder.com
`reverbnation.com
`kino-teatr.ru
`eventful.com
`cylex.de
`goodreads.com
`bandcamp.com
`bizrate.com
`businesswire.com
`wat.tv
`avvo.com
`barnesandnoble.com
`patch.com
`
`Triple count
`133,287,800
`94,149,823
`85,225,323
`37,925,513
`37,208,120
`31,056,409
`26,299,315
`25,118,391
`20,331,369
`15,550,954
`15,078,003
`14,288,282
`12,484,280
`11,372,475
`10,716,450
`9,488,095
`9,280,173
`9,113,367
`8,444,559
`8,157,515
`
`Table 3: Top sites by number of triples, microdata
`only
`
`Page 2 of 6
`
`Netskope Exhibit 1021
`
`

`Format
`RDFa
`OGP
`microdata
`microformat
`XFN
`no data
`
`Abs PLD Pct PLD
`Abs URL Pct URL
`795,081,604
`25.08 % 1,306,827
`4.04%
`711,747,491
`22.45 % 1,140,880
`3.53%
`226,913,004
`7.16 %
`93,463
`0.29%
`272,470,501
`8.60 % 1,755,733
`5.43%
`35,344,618
`4.27 % 1,700,377
`5.26%
`2,196,204,478
`69.29 % 30,809,476
`95.27%
`
`Table 1: Results from pattern-based analysis NU RL = 3, 169, 743, 997, NP LD = 32, 339, 522
`
`Site
`yahoo.com
`twitter.com
`linkedin.com
`yellowpages.com
`tvtrip.com
`youtube.com
`myspace.com
`nii.ac.jp
`nj.com
`patch.com
`chow.com
`minecraftforum.net
`oregonlive.com
`everycarlisted.com
`nydailynews.com
`last.fm
`citysearch.com
`washingtonpost.com
`nieuwsblad.be
`cleveland.com
`
`Triple count
`572,687,378
`534,336,425
`252,481,792
`97,624,187
`53,746,582
`43,330,641
`41,110,226
`40,752,988
`38,202,997
`38,003,049
`37,705,040
`35,891,626
`33,159,011
`32,75,0040
`32,211,122
`30,302,919
`28,444,466
`27,926,328
`27,497,607
`26,998,847
`
`Table 4: Top sites by number of triples, hcard only
`
`ble 7: most OGP data does not deﬁne instances of any RDF
`class. As already mentioned above, most users of OGP also
`ignore the declaration of the og preﬁx (a problem we deal
`with in the extraction) and we can also see a number of varia-
`tions to the current standard namespace (a problem we have
`not dealt with). Further, OGP assigns additional meaning
`to the RDFa syntax that is not reﬂected in the RDFa stan-
`dard. As an example, the order in which triples are written
`on the page matters in OGP, but not in RDFa. For all these
`reasons, we believe that Any23 should be extended with a
`speciﬁc processor for OGP markup that is able to deal with
`these peculiarities.
`Besides OGP, a smaller amount of data can be attributed
`to eﬀorts by Google’s Rich Snippet program and Yahoo’s
`retired SearchMonkey program. Social markup in the form
`of FOAF and SIOC is also present in a large number of do-
`mains as shown in Table 7. The fact that these vocabularies
`do not show up as prominently in Table 6 means that they
`are used more in the less deeply crawled part of the web.
`For microdata, we only list the top namespaces in Ta-
`ble 8 and Table 9, because Any23’s microdata extractor in-
`corporates the class name into the namespace.
`In micro-
`data, only two vocabularies (schema.org and Google’s data-
`vocabulary.org) have gained signiﬁcant traction so far, and
`the latter is expected to be replaced by the former.
`It holds for both RDFa and microdata that the types of
`
`objects that are marked up is biased by the use case of search
`engine optimization, i.e. site owners prefer to mark up data
`that is used by the search engines to enrich search result
`presentation (e.g reviews, business listings). Schemas for
`these types of objects have also existed longer. We also ob-
`serve a natural preference to mark up simple types of objects
`(e.g. breadcrumbs), though we did not formally investigate
`the relationship between the complexity of markup and its
`adoption.
`
`4. CONCLUSIONS
`We presented metadata statistics from the analysis of a
`large, recent sample of the Web, which has been extracted
`from the crawl of a search engine and therefore provides a
`search-engine centric view on the Web. Current web search
`engines are biased toward authoritative, head sites with valu-
`able textual content, and are not speciﬁcally looking for data
`on the Web. We expect that a search engine speciﬁcally built
`for data would give less weight to authority and textual con-
`tent and perform deeper crawling on sites that provide large
`and valuable data, by some measure of quantity and quality.
`Nonetheless, our work shows an impressive progress in the
`adoption of markup on the Web with over 30% of our col-
`lection containing some microformat, RDFa or microdata
`markup. Microformats and RDFa are the most popular
`choices of syntax. The level of microformats usage seems to
`be ﬂat, while RDFa adoption has grown signiﬁcantly com-
`pared to previous studies. This is due almost exclusively to
`OGP markup, though there is a variety of usage in the long
`tail, in particular social vocabularies. On the other hand,
`the adoption of microdata is driven so far only by the success
`of schema.org.
`There is signiﬁcant future work to be done in order to
`evaluate the quality and practical usefulness of data embed-
`ded in HTML, with respect to some existing or novel tasks.
`In previous work, we have looked at the extent to which em-
`bedded metadata could be used to enrich web search results
`[1], but data on the Web is likely to be useful in a much
`broader array of applications.
`
`5. REFERENCES
`[1] K. Haas, P. Mika, P. Tarjan, and R. Blanco. Enhanced
`results for web search. In W.-Y. Ma, J.-Y. Nie, R. A.
`Baeza-Yates, T.-S. Chua, and W. B. Croft, editors,
`SIGIR, pages 725–734. ACM, 2011.
`
`Page 3 of 6
`
`Netskope Exhibit 1021
`
`

`Namespace
`ogp.me/ns#
`www.facebook.com/2008/
`www.w3.org/1999/02/22-rdf-syntax-ns#
`rdf.data-vocabulary.org/#
`purl.org/dc/terms/
`https://www.facebook.com/2008/
`mixi-platform.com/ns#
`ogp.me/ns/fb#
`creativecommons.org/ns#
`www.w3.org/2006/vcard/ns#
`http://
`http://www.facebook.com/
`http://www.w3.org/2000/01/rdf-schema#
`http://developers.facebook.com/schema/
`http://search.yahoo.com/searchmonkey/commerce/
`http://purl.org/dc/elements/1.1/
`http://opengraphprotocol.org/schema/
`http://search.yahoo.com/searchmonkey/media/
`http://oexchange.org/spec/0.8/rel/
`http://xmlns.com/foaf/0.1/
`
`URLs
`493,443,016
`150,246,016
`26,402,165
`19,413,470
`16,424,800
`7,472,815
`6,323,861
`4,636,260
`4,622,272
`4,205,037
`3,881,321
`3,126,045
`3,042,839
`2,720,567
`2,664,743
`2,642,796
`2,293,024
`2,095,577
`2,034,467
`1,837,749
`
`Table 5: Top namespaces in RDFa as measured by the number of URLs
`
`Class
`rdf.data-vocabulary.org/#Breadcrumb
`rdf.data-vocabulary.org/#Review-aggregate
`rdf.data-vocabulary.org/#Organization
`www.w3.org/2006/vcard/ns#VCard
`search.yahoo.com/searchmonkey/commerce/Business
`rdf.data-vocabulary.org/#Review
`rdf.data-vocabulary.org/#Rating
`rdf.data-vocabulary.org/#review-aggregate
`xmlns.com/foaf/0.1/Image
`search.yahoo.com/searchmonkey/product/Product
`http://rdf.data-vocabulary.org/#Address
`http://www.purl.org/stuﬀ/rev#Review
`http://rdf.data-vocabulary.org/#Product
`http://purl.org/goodrelations/v1#UnitPriceSpeciﬁcation
`http://purl.org/goodrelations/v1#Oﬀering
`http://xmlns.com/foaf/0.1/Agent
`http://xmlns.com/foaf/0.1/Document
`http://www.w3.org/2004/02/skos/core#Concept
`http://xmlns.com/foaf/0.1/Group
`http://rdfs.org/sioc/ns#Item
`
`URLs
`11,336,922
`5,571,178
`3,678,229
`2,858,916
`2,727,213
`1,980,811
`1,714,996
`1,453,439
`1,446,290
`1,202,002
`1,087,380
`746,858
`673,079
`648,598
`599,703
`517,089
`441,694
`406,776
`369,176
`363,308
`
`Table 6: Top classes in RDFa as measured by the number of URLs with at least one instance
`
`Page 4 of 6
`
`Netskope Exhibit 1021
`
`

`Class
`xmlns.com/foaf/0.1/Image
`xmlns.com/foaf/0.1/Document
`rdfs.org/sioc/ns#Item
`rdfs.org/sioc/ns#UserAccount
`www.w3.org/2004/02/skos/core#Concept
`rdf.data-vocabulary.org/#Breadcrumb
`rdfs.org/sioc/ns#Post
`rdf.data-vocabulary.org/#Review-aggregate
`rdfs.org/sioc/types#BlogPost
`rdfs.org/sioc/types#Comment
`http://rdf.data-vocabulary.org/#Rating
`http://rdf.data-vocabulary.org/#Organization
`http://www.w3.org/2006/vcard/ns#Address
`http://purl.org/goodrelations/v1#BusinessEntity
`http://purl.org/goodrelations/v1#UnitPriceSpeciﬁcation
`http://rdf.data-vocabulary.org/#Review
`http://rdf.data-vocabulary.org/#Product
`http://purl.org/goodrelations/v1#QuantitativeValue
`http://rdf.data-vocabulary.org/#Address
`http://purl.org/goodrelations/v1#Oﬀering
`
`PLDs
`30,903
`25,090
`19,583
`15,058
`9,757
`5,427
`5,342
`3,307
`2,970
`2,695
`2,114
`1,759
`1,655
`1,608
`1,385
`1,294
`1,246
`1,051
`932
`787
`
`Table 7: Top classes in RDFa as measured by the number of PLDs with at least one instance
`
`Namespace
`www.w3.org/1999/xhtml/microdata#
`www.w3.org/1999/02/22-rdf-syntax-ns#
`purl.org/dc/terms/
`data-vocabulary.org/Breadcrumb/
`schema.org/MusicGroup/
`schema.org/MusicRecording/
`schema.org/Person/
`schema.org/Product/
`schema.org/VideoObject/
`http://schema.org/Article/
`http://schema.org/WebPage/
`http://data-vocabulary.org/Product/
`http://schema.org/PostalAddress/
`http://schema.org/Oﬀer/
`http://data-vocabulary.org/Review-aggregate/
`http://schema.org/AggregateRating/
`http://schema.org/LocalBusiness/
`http://schema.org/Organization/
`http://data-vocabulary.org/Oﬀer/
`http://schema.org/Review/
`
`URLs
`67,087,467
`66,745,726
`46,675,266
`19,368,347
`6,699,903
`6,591,236
`4,650,659
`3,667,023
`3,228,156
`3,052,457
`2,928,410
`2,742,977
`2,736,213
`2,553,617
`2,152,533
`2,048,232
`2,043,005
`1,640,501
`1,628,027
`1,281,548
`
`Table 8: Top namespaces in microdata as measured by the number of URLs
`
`Page 5 of 6
`
`Netskope Exhibit 1021
`
`

`Namespace
`data-vocabulary.org/Breadcrumb
`schema.org/PostalAddress
`schema.org/LocalBusiness
`schema.org/Product
`data-vocabulary.org/Organization
`schema.org/Oﬀer
`schema.org/Organization
`data-vocabulary.org/Address
`schema.org/Article
`schema.org/MusicGroup
`http://schema.org/MusicAlbum
`http://www.schema.org/MusicRecording
`http://schema.org/Person
`http://data-vocabulary.org/Product
`http://data-vocabulary.org/Review-aggregate
`http://schema.org/AggregateRating
`http://schema.org/WebPage
`http://data-vocabulary.org/Rating
`http://schema.org/GeoCoordinates
`http://schema.org/Place
`
`PLDs
`14,623
`11,476
`8,820
`6,817
`3,765
`3,654
`3,614
`3,529
`3,283
`3,253
`2,974
`2,941
`2,676
`2,596
`2,450
`2,380
`2,132
`1,947
`1,651
`1,634
`
`Table 9: Top namespaces in microdata as measured by the number of PLDs
`
`Page 6 of 6
`
`Netskope Exhibit 1021
`
`

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go Monthly

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases