`0:10 MALE SPEAKER: Good afternoon, my dear
`0:12 colleagues, dear friends.
`0:14 It's my real privilege and pleasure to welcome here in
`0:18 the Czech Technical University Mr. Douglas Merrill, who is
`0:24 currently a vice president of Go ogle.
`0:27 You may see that Google is just rolling over the Czech
`-0~ 31 'I'eehnical University, because we have got- an excellent -
`0:34 opportunity in April when Vinton Cerfwas here and he
`0:39 was speaking about the internet on Mars, while
`0:43 Douglas is going to contribute on the
`0:46 search side of the Universe.
`0:49 So it means his lecture is about the search
`0:53 possibilities, and I think he is going to be quite excellent
`0:57 in this field, because he's not far away from your age who
`1 :01 are sitting over here, and he's of your mentality.
`1 :06 And I think the main point is, please, this lecture will be
`1: 11 around 45 minutes.
`1: 13 Then, of course, it is expected at least hundreds of
`1: 17 questions rising.
`1: 19 Please write them down on the paper to smooth the process,
`1 :23 and send them to the girls who will be going down and up.
`1 :27 So please do it in this way.
`1:30 And as well, if there is some presence paper, please sign
`1:36 it, because it is fine to know who is really interested in
`1:39 such a field.
`1 :40 And in fact, I'm not really to be upon your time, Douglas.
`1 :44 It's your floor, and possibly even your microphone.
`1:48 You have got it.
`1:50 So the floor is yours.
`1:52 DOUGLAS MERRILL: Thank you very much.
`1:53
`2:02 Hi, thanks for coming.
`2:03 It's a great honor to get to come to talk to a university
`2:10 that's 300 years old about a little tiny company that was
`2: 13 founded eight years ago, nine years ago by two crazy
`2: 18 graduate students.
`2:19 And Stanford University, where Larry and Sergey were
`2:25 students, has a bunch of classrooms that
`2:27lookjust like this.
`2:29 And so I guess my deepest hope is that the next Larry and
`2:33 Sergey are sitting in the audience right now, and will
`2:35 be inspired by something stupid that I say during this
`2:38lecture to go out and prove me wrong.
`2:40 So that's my challenge to all of you.
`2:42 Find what I say that's wrong and fix it.
`
`EXHIBIT 2078
`Facebook, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00480
`
`
`
`2:45 Thank you so much.
`2:46 My name is Douglas Merrill.
`2:47 I'm a Vice President of Engineering at Google.
`2:49 So just for those of you who are in the front, I recommend
`2:51 that you loosen up your neck a little bit, just relax.
`2:54 I pace a lot.
`2:56 And so down here, you guys, you're going to
`2:58 get a little seasick.
`2:59 It's OK.
`3:00 If you feel seasick, close your eyes and
`3:02 breathe, it's OK.
`3:03 Up there, you guys are going to forget who I look like.
`3:04 It's all fine.
`3:05 You cannot see me up there anyway, so it's irrelevant.
`3:09 This is Alex.
`3:10 Alex is right now having a nightmare that he has a test.
`3:14 Do you guys all have that nightmare where you're in the
`3:16 front of class, and you have a test,
`3:17 and you haven't prepared?
`3:18 Alex is unprepared.
`3:19 Next slide.
`3:20
`3:24 Well done, sir.
`3:26 So Larry and Sergey met at Stanford in the computer
`3:29 science school.
`3:31 They were both students in an information theory class in
`3:36 about 1998.
`3:37 And they didn't like each other.
`3:40 Larry thought that Sergey was argumentative, and Sergey
`3:45 thought that Larry was arrogant.
`3:47 They were probably both right.
`3:50 However in their class project, they came up with the
`3:55 idea to try and apply some basic principles of
`4:00 information theory to unstructured web search.
`4:03 Now, it's 1998.
`4:04 Keep in mind at the time, web search is a solved problem.
`4:09 Everybody knows how to do search.
`4:12 There's no questions left to be worked on.
`4:15 So these guys said, oh, but wait, there is.
`4:18 And they're really kind of interesting questions.
`4:20 And they set themselves a goal to organize all the world's
`4:23 information and make it universally
`4:24 accessible and useful.
`4:27 That's a kind of a small goal.
`4:29 These guys didn't shoot high.
`4:30 All the world's information, universally
`
`
`
`4:32 accessible and useful.
`4:34 What I want to talk of it today is a little bit of what
`4:36 the web looked like in 1998, before most of you were born,
`4:43 and what it looks like today, and what we think it's going
`4:48 to look like in the next 10 years.
`4:49 Next slide.
`4:50
`4:53 In 1998, there were a couple of dominant search engines.
`4:58 Neither one of them exists anymore, don't worry about
`5:00 their names.
`5:03 And they knew how to do search.
`5:05 Here's what they did.
`5:07 They had these people who sat in these big rooms, kind of
`5:09 like this, with computers in front of them, kind of like
`5:11 all of you.
`5:13 And they were surfing the web-- kind of like all of you.
`5:16
`5:19 And what they would do is they would find a page and they
`5:21 would read the page.
`5:22 And they would say, oh, you know what?
`5:23 This particular page is about soccer.
`5:26 I'm American, I know you guys call the game something else.
`5:29 Sorry.
`5:31 And they would have this little toolbar that they would
`5:33 pull down and they label "soccer." And then that page
`5:35 would be indexed.
`5:37 And they knew that this was going to work.
`5:41 They were wrong.
`5:44 They were wrong because the world changes too fast. On
`5:50 average, 10% of the web changes every month.
`5:56 Here's the interactive portion, boys and girls.
`5:58 If 10% changes every month, 10 times 12,
`6:02 carry the one, right.
`6:04 Likely everything changes every year, which means that
`6:07 these poor horrible people with this awful job of surfing
`6:11 the web and making down each page have to look at every
`6:13 page every year.
`6:15 Additionally, the web is doubling at this point in
`6:17 history about every four or five months.
`6:21 So twice a year or so you've doubled the size, everything
`6:25 that's already existed has changed at least once, and
`6:29 keep in mind that it turns out that the web's
`6:32 not entirely in English.
`6:35 Who knew?
`6:36
`6:38 So now you have to have rooms full of people who speak all
`
`
`
`6:41 these languages.
`6:42 Not a scalable model.
`6:44 Next slide, please.
`6:47 Really not scalable today.
`6:49
`6:52 So this slide is more shocking to Americans, because it turns
`6:55 out that Americans think that no one else
`6:56 in the world exists.
`6:59 You guys have all heard the joke, right?
`7:00 If you see three languages, you're trilingual, if you
`7:02 speak two, you're bilingual, if you
`7:03 speak one, you're American.
`7:05 [LAUGHTER]
`7:13 It turns out most of us aren't American.
`7:17 So the approach used by the search engines in 1998 would
`7:20 not have gotten us today.
`7:21 They would not have gotten us here.
`7:23 What got us here was an insight that Larry--
`7:27 mostly Larry had, but Larry and Sergey had together--
`7:29 called Page Rank.
`7:32 So what does Page Rank do?
`7:34 Page Rank allows you to figure out whether a particular web
`7:39 page is interesting or not.
`7:40 That makes sense.
`7:41 Is this particular page useful?
`7:43 So it's called Page Rank.
`7:45 Obviously it's named because you're ranking web pages.
`7:50 No, Larry named it after himself.
`7:52 Larry's last name is Page.
`7:54 What is today's lesson, boys and girls?
`7:56 Computer scientists are not funny.
`7:59 Next slide, please.
`8:01
`8:09 In a second, I'm going to talk about how the
`8:11 stuff actually works.
`8:13 I'm hoping that's more interesting to you.
`8:14 But first, I want to pull back a little bit, and I want to
`8:17 talk about the context.
`8:19 So I mentioned that web search is about more than web pages
`8:23 in English.
`8:26 It matters a lot to actually understand the context within
`8:31 which you are working.
`8:33 So for example, if you do a search for BMW on google.cz,
`8:39 you ought to get different results than if you do a
`8:42 search for BMW on google.com.
`8:45 And indeed you will.
`
`
`
`8:46 We'll recognize that probably you want to go to
`8:48 the .cz site instead.
`8:50 Part of our ranking signals are more
`8:51 than just page ranked.
`8:52 It also is about the context from which you come.
`8:54 Next slide, please.
`8:56
`8:58 Google publishes--
`8:59 [LAUGHTER]
`9:01 I'll give you a second to enjoy the list. The previous
`9:13 slide was called "Being Local Matters." It turns out it only
`9:19 matters in certain regards.
`9:21 So we publish a list called the Zeitgeist. The Zeitgeist
`9:24 captures the most actively growing and most popular
`9:28 queries, and we do it by country and by language and a
`9:31 bunch of things.
`9:32 And there's a couple of truths.
`9:34 Apparently, they are universal.
`9:36 The most popular search in every country
`9:38 is a beautiful woman.
`9:39
`9:41 And apparently game shows and television are also pretty
`9:46 popular to every one.
`9:47 Prison Break, for those of you who don't know, is a really
`9:49 bad American television show.
`9:50 So fundamentally, it matters, if you're going to do search
`9:54 right, you need to understand that the web is growing too
`9:58 fast, it's changing too fast, and it's not all in English.
`10:04 So the lesson that I want to talk about is,
`10:06 how do we do that?
`10:07 And I'm hoping, again, to reiterate that you guys-- one
`10:11 of you, or two of you, or 10 of you-- are going to hear
`10:13 something that I get wrong.
`10:15 And you're going to say, hey, I have a better
`10:16 idea and go try it.
`10:19 OK, how does it work?
`10:22 All right, let's build a search engine.
`10:24 This is the first of the interactive portions of
`10:26 today's talk, boys and girls.
`10:28 How many of you have had to build a search engine in a
`10:33 computer science class?
`10:34
`10:37 How many of them were any good?
`10:41 Oh, good.
`10:41 OK, so back in the day, when the web was first created, the
`10:51 terms were all coined by Tim Berners-Lee.
`
`
`
`10:53 And he talked about the fact that these pages were all
`10:56 inter-linked like a world wide web.
`11:00 What goes on webs?
`11:03 Spiders.
`11:04 What do spiders do?
`11:05 They crawl.
`11:06 Hence the term of art for finding information in web
`11:09 search is called crawling.
`11:12 This would be my second instantiation of how computer
`11:14 scientists are not funny.
`11:16 That is supposed to be a joke.
`11:18 So we're going to start out by crawling information.
`11:21 How does a crawler work?
`11:23 Simple kinds of crawlers start from a known web page like
`11:27 aol.com or pick your favorite portal.
`11:32 And they go through each link, and they essentially click on
`11:36 each link, and that expands to more web pages.
`11:38 Each of those pages has links.
`11:40 You click on each link from there, and you keep doing
`11:42 depth-first work recursion until you run out of time,
`11:44 space, or the Universe ends.
`11:48 Crawling sounds easy, right?
`11:50 That's probably, what, 10 lines of Python.
`11:54 What's hard?
`11:57 Remember, everything changes.
`11:58 So you've got to recrawl a lot.
`12:00 How often do you have to recrawl?
`12:03 If 10% of it changes every month, you have to recrawl the
`12:05 natural log of 10 times the number of months since the
`12:07 last time you completed a crawl.
`12:09 That's a very big number.
`12:10 You've got to crawl a lot, is the answer.
`12:14 Second thing that's hard.
`12:16 How do you know if you've already seen a page?
`12:20 Oh, that's easy, right?
`12:22 Take a hash of the URL.
`12:24 That would work, wouldn't it?
`12:26 What happens if they change the title of the page?
`12:28
`12:31 What happens if it's of a copy of the page?
`12:34 Oh right, the hash of the URL won't work.
`12:36 OK, still no problem.
`12:38 I'll take a hash of all the content of the page.
`12:40 That will work, right?
`12:41 Won't it work?
`12:42 What happens if they've got a space?
`
`
`
`12:45 What happens if they misspelled a
`12:46 word in their copy?
`12:47 What happens if they inserted a picture in different spots?
`12:50 Naive crawlers get roughly 25 to 40% percent of their
`12:54 content is content they have already seen before.
`12:56 Which means that on average, you're
`12:57 wasting one byte in four.
`12:59 If you're crawling to the end of the world, you want those
`13:02 bytes back.
`13:03 Crawls are hard.
`13:04 Additionally, crawls are hard because how do you store the
`13:07 data once you've got it?
`13:09
`13:12 How many you have had a database class?
`13:15 Come on, guys, I know you're out there.
`13:17 I hear you breathing.
`13:17 Come on!
`13:19
`13:20 How would you store a page in a database?
`13:23 It's hard work.
`13:24 Databases aren't optimized for this.
`13:27 And you're not going to need to do joins.
`13:28 There's no concept of query structures here.
`13:33 So we ended up having to build a file system called the
`13:36 Google File System.
`13:37 And it's that technology called Big Table that I'll
`13:39 talk about in a second.
`13:40 If you're interested, all the papers are hung off of
`13:43 google.com, and they're publicly available.
`13:46 Precisely to let us grab a piece of information, take
`13:51 what's called a hashmap of it-- which is a hash that has
`13:54 an error code in it, so that if you add spaces or move
`13:57 words around, I notice it--
`13:58 and then store them in a way which is redundant.
`14:02 Because then you always have the operational side as well.
`14:04 What happens if you lose a machine?
`14:06 Crawling seems easy.
`14:08 It's hard.
`14:08 And it's the easiest thing on this slide.
`14:11 After I crawl, remember we've got the crawl that's running
`14:15 until the end of time, until you run out of space--
`14:17 I can't remember the joke I made before, but rewind a
`14:20 little bit.
`14:21 For those of you who are surfing the web, just go find
`14:23 a crawl paper.
`14:25 Then you have to index everything you just crawled so
`
`
`
`14:29 you can find it later.
`14:31 What's the right index structure?
`14:33 Come on, this is easy.
`14:34 Come on.
`14:36 It's not easy?
`14:37 What's the right index structure?
`14:39 You could index every single word on the page.
`14:43
`14:46 Easier, you could index every character on the page.
`14:49 Pop quiz--
`14:50 what's the most common character
`14:52 in the English language?
`14:54 Space.
`14:55
`14:59 What's the second most common character
`15:01 in the English language?
`15:02
`15:07 You're going to end up with a lot of index entries for
`15:08 space, aren't you?
`15:09
`15:16 OK, so you can index every character.
`15:18 It's not very useful.
`15:19 Why is it not very useful?
`15:22 Because every single time you get a query, you're going to
`15:24 have to go through and reassemble all those
`15:26 characters into words and then map against all the documents.
`15:30 Probably the wrong index
`15:32 structure, but pretty flexible.
`15:33
`15:35 You could index trigrams, index three words at a time.
`15:40 Douglas C. Merrill, you can index that, right?
`15:44 Would that be better or worse?
`15:47 Well, different.
`15:48 What happens if you do a search for Douglas Merrill?
`15:51 Or worse, you shorten my name, which annoys the hell out of
`15:53 me, and do a search for Doug Merrill.
`15:58 A trigram index is going to break because you're not going
`16:00 to have that entry.
`16:04 If you look at all the search engines in the world today,
`16:07 they all have one or more of these index structures.
`16:10 No, I'm not going to tell you what ours is.
`16:12 But it's in the space of somewhere between characters
`16:14 and trigrams. And the index structure is going to have
`16:18 huge implications on the stuff I'm going to
`16:20 talk about in a second.
`16:23 And so far, we're still in the easy stuff.
`
`
`
`16:27 Then you get a query.
`16:30 So you go to google.cz, you key some words into the box,
`16:34 you hit enter, you get a bunch of results back.
`16:35 Simple, right?
`16:37 No problem.
`16:38 On average, we return 10 results in 400 milliseconds,
`16:43 half a second.
`16:45 That's not too bad.
`16:47 What's the speed of light, latency, from a query served
`16:51 here, if it's served from, say, Northern California?
`16:54
`16:57 About 2/3 of that time.
`16:58 So clearly we can't serve everything
`17:00 from one data center.
`17:02 Leave aside the storage and power, et cetera,
`17:04 et cetera, et cetera.
`17:06 And then there's all the fun of actually doing the ranking
`17:09 and picking out which result goes to the top, et cetera.
`17:13 It turns out it's harder to build a search engine than it
`17:15 seems. Next slide, please.
`17:18
`17:21 We want to give you the right answer at the top every time.
`17:27 So there's a lip right here.
`17:30 I'm taking bets about the odds that I go head over heels over
`17:33 the lip at some point during this talk.
`17:35 I'm currently giving 5:1 that I end up on my face, just FYI.
`17:38 Any takers?
`17:40 OK, we want to give the right answer every time at the top.
`17:46 This is the art of ranking.
`17:48 How do you know what the right result is?
`17:53 Larry and Sergey came up with the concept of Page Rank.
`17:57 So have any of you read the Page Rank paper?
`18:00
`18:03 Wow.
`18:03 What classes have assign it, or are you guys just
`18:05 over-achievers?
`18:07
`18:09 There's a lot of you.
`18:09 That's creepy.
`18:10 OK, usually like one person raises their hand, and it's
`18:14 the person you don't like.
`18:15 There are, like, 30 of you.
`18:17 Wow.
`18:18 This is cool.
`18:21 Core concept of Page Rank.
`18:24 How many of you have met me?
`
`
`
`18:26
`18:29 Come on, you guys have met me?
`18:30 Stephanie--
`18:33 the Google people should raise their hands, geez!
`18:36 And so none of you have any idea who I am.
`18:38 Why are you here?
`18:40
`18:42 Oh, right.
`18:42 You're here because somebody you trust--
`18:47 or, well--
`18:48 [LAUGHTER]
`18:52 OK, let's just pretend.
`18:56 You're here because somebody you trust said that I was
`18:59 worth listening to.
`19:00 You're here to listen to me-- and I make it--
`19:02 you're here to listen to me because somebody else
`19:05 suggested that I would have content--
`19:09 how are you--
`19:11 twice, I made it-- that I would have content worth
`19:14 listening to.
`19:16 Fundamentally, you're trusting that I have useful content
`19:18 because someone you trust said so.
`19:21 Page Rank is the same idea.
`19:24 Some arbitrary page on the web is most likely garbage.
`19:27
`19:30 However, if someone you like links to that page, basically
`19:35 saying this page isn't garbage, it's more likely that
`19:40 page is useful.
`19:41 Page Rank is simply a sum of the vertices
`19:50 of a directed graph.
`19:53 Start from a top page, make a graph downward of links.
`19:57 Edges are links, nodes are pages.
`20:00 Take a sum of the weights across those links,
`20:02 you get Page Rank.
`20:03 Thus, the more linked something is, the
`20:05 higher its Page Rank.
`20:07 Thus, the more a page is connected across the web, the
`20:10 more likely that page is good.
`20:15 What's wrong with this algorithm?
`20:16
`20:20 What if the links are garbage?
`20:23 So say for example, you have a blog, and your
`20:28 blog has open comments.
`20:31 And I write a bot that goes and finds your blog with all
`20:34 of its open comments and inserts a comment which is a
`20:37 link back to this page.
`
`
`
`20:40 Page Rank will see that as a link, and thus will think, oh,
`20:43 this page is better.
`20:45 Do you think that link is a useful signal?
`20:48 Probably not.
`20:50 So Page Rank was our first ranking algorithm designed to
`20:53 get the right results at the top every time.
`20:55 We now use something more than 200.
`20:57 Spam is an arms race.
`21:00 Every day, we have hundreds of engineers that work on trying
`21:03 to figure out what the person who's trying to gain the
`21:06 system is going to do next.
`21:07 Now there's a fun job.
`21:09 Every day, you get to go to battle with the bad guys.
`21:13 Next slide, please.
`21:14
`21:18 And then you start thinking about, in addition to crawling
`21:29 the web, indexing the web, ranking the pages, maybe you
`21:35 ought to be nice to your users.
`21:37 Those pesky users.
`21:40 Some languages, like English, are relatively easy to enter
`21:43 search terms on.
`21:44 English doesn't have accents, I don't think.
`21:48 Do we have any?
`21:48 I don't think so.
`21:51 English doesn't have diacriticals.
`21:52
`21:55 So my English keyboard has one mode.
`21:59 Full stop.
`22:00 Not true for you guys.
`22:03 But as search engine get better and better coverage,
`22:07 they can get smarter and smarter, and they can start
`22:10 noticing things.
`22:11 For example, we can notice errors in user entries,
`22:18 specifically like you dropped the diacriticals, and we know
`22:22 it, so we can just add them back for you.
`22:25 How do we do that?
`22:26
`22:29 Come on, somebody guess.
`22:29 There's an obvious guess.
`22:30 Come on.
`22:32
`22:35 OK, I'll come out here and I'll guess.
`22:36 OK.
`22:37 OK, I'm going to come sit right next to you, and I'm
`22:38 going to guess.
`22:39 OK.
`
`
`
`22:41 I think you do it by having a bunch of
`22:42 people who speak Czech.
`22:44
`22:47 Four times, I made it without falling.
`22:48 AUDIENCE: [INAUDIBLE]
`22:50
`22:54 DOUGLAS MERRILL: That's a great guess, much
`22:56 better than my guess.
`22:57 Not right, but much better.
`22:59 Much better.
`23:00 So my guess is dumb.
`23:01 Why is my guess dumb?
`23:03 Because it doesn't scale.
`23:05 Your guess makes a lot of sense.
`23:07 Except it means that I have to teach the crawler and the
`23:11 indexer what is a diacritical.
`23:13 AUDIENCE: Is that hard?
`23:15 DOUGLAS MERRILL: Not as hard as doing it by hand.
`23:17 But you know what's easier still?
`23:20 What's easier still is watching your users.
`23:23 You take anonymized search traffic, and I can see people
`23:26 who start with that entry up top, and then go, ugh, and
`23:31 retype the entry below.
`23:35 And I can do statistical machine learning that says, oh
`23:38 right, these two are probably actually the same word.
`23:41 And then I don't have to teach it about diacriticals, I don't
`23:43 have to teach it about language, I just have to watch
`23:46 anonymized user traffic.
`23:48 AUDIENCE: Are there any users that [INAUDIBLE]
`23:51 DOUGLAS MERRILL: Say again?
`23:51 AUDIENCE: Are there any users that use diacriticals
`23:54 [UNINTELLIGIBLE] when searching?
`23:55 Because I never do.
`23:56 I always type it out, whatever it is.
`23:59 DOUGLAS MERRILL: Thank you for helping to
`24:00 improve our search quality.
`24:03 The answer oddly enough is yes, but fewer and fewer
`24:06 because we did the right thing.
`24:07 Next slide, please.
`24:08 But the next slide's the same--
`24:10 this is even better.
`24:11 This is the same problem only done from the other side.
`24:14
`24:17 We can do the same thing I just talked about, about
`24:19 diacritics and provide spell checking.
`24:22 How do we do it?
`
`
`
`24:23 The same way I just talked about.
`24:25 You see people starting at the top, which is the word for
`24:27 gym, right?
`24:28 For gymnasium.
`24:29 Apparently they're tired because they skipped a letter.
`24:32 So there's some sort of weird--
`24:34 But we can notice that you typed that word in, you
`24:39 probably will get a few results.
`24:40 In general, the other grand truth of the internet-- so
`24:43 grand truth number one was that the top-rated search is
`24:45 always about some woman.
`24:46 Grand truth two is no matter how badly you misspell a word,
`24:51 somebody's got a page that spelled it that way.
`24:53
`24:57 Anyway, it's never the right page.
`25:02 And so we always find that a couple minutes later, or a
`25:05 couple seconds later, more often, you redo the search.
`25:07 And so by doing statistical machine learning, I can learn
`25:10 how to spell in almost every language on the planet without
`25:16 having any notion of morphology, without having any
`25:18 generative grammar, without having any of the stuff that
`25:20 Steven Pinker talks about.
`25:23 All I've got is spell correction,
`25:25 which is pretty useful.
`25:26 In fact, it's so useful in English that I use it to
`25:29 actually spell check my words, because there are all these
`25:31 words I can't figure out how to spell, so it will teach me.
`25:34 And all done simply with statistical machine learning.
`25:38 So how many of you have had a statistical machine learning
`25:40 class, or has [UNINTELLIGIBLE] a topic in a class?
`25:42 Pay attention next time.
`25:43 It's important.
`25:44 Next slide.
`25:46
`25:50 OK, however to your question, it was in there someplace, I
`25:58 lost where.
`25:58 I apologize.
`25:59 Who actually does searches with diacriticals?
`26:01 Good point.
`26:02 We do, however, have more sources of
`26:04 data than just search.
`26:05 And those sources are the local products we've released
`26:08 in the market.
`26:11 The more content that gets created, the better off the
`26:15 internet is.
`26:17 But what's the interesting story of the internet?
`
`
`
`26:19 It's not actually Google or Seznam or Yahoo.
`26:22 That's not the interesting part.
`26:23 The interesting part of the story is the democratization
`26:25 of information creation.
`26:28 History has always been written by the winners.
`26:32 400 years ago, about 2% of the people could read or write.
`26:37
`26:39 And apparently all of them went to this university.
`26:41
`26:45 Now 200 years ago, between 10 20% of the people in the world
`26:53 could read or write, depending on your perspective.
`26:55 Nowadays, it's more than that.
`27:00 I hope a lot more, but I don't actually--
`27:03 have you ever read an American newspaper?
`27:05 It might surprise you.
`27:06 Anyway, leaving that aside, what the internet and tools
`27:12 like that have let us do is they have let everyone tell
`27:14 their story.
`27:15 So instead of history being written only by the winners,
`27:18 it's written by everyone.
`27:19 Everyone gets to tell their story, which is cool.
`27:22 Pop quiz, what's the difference between a
`27:25 revolution and a civil war?
`27:28 Who won.
`27:31 Because if the reigning government won,
`27:34 it's a civil war.
`27:35 If the reigning government lost, it's a revolution.
`27:37
`27:40 We built a bunch of tools to help people tell their story.
`27:43 We built a bunch of tools that help people tell their story
`27:45 in Czech, which allows me to improve my search quality even
`27:49 if, in fact, no one searches with diacriticals, because I'm
`27:52 getting content created that I can index.
`27:55 Next slide, please.
`27:57
`28:00 I don't really have anything to say on this slide, but it's
`28:02 a pretty picture.
`28:03
`28:06 So pretty?
`28:07 Yes?
`28:09 Anyone have any comments on this slide?
`28:11 Me neither.
`28:11 Next slide, please.
`28:13
`28:18 So the next time you have a class assignment to build a
`28:20 search engine, you know what you have to figure out.
`
`
`
`28:27 You have to figure out how to do a crawl and recognize that
`28:31 you've seen a page before and find an efficient way to store
`28:35 the page, find an efficient way to figure out if you've
`28:38 seen it before.
`28:40 And then you have to decide on an indexing scheme.
`28:42 You have to index characters, or maybe words, or maybe
`28:45 bigrams.
`28:48 You have to figure out a ranking system.
`28:49 Maybe you'll use Page Rank.
`28:51 Or maybe you'll be like us and you'll do hundreds of
`28:54 different things, some of which are fascinating computer
`28:57 science, and some of which are funny little hats.
`29:03 But all of the things will then ultimately result in a
`29:07 search which works well in one context.
`29:11 Here's the place where I hope all of you are
`29:12 actually paying attention.
`29:14 So everyone who's asleep, please wake up.
`29:18 The last 10 years have been fascinating.
`29:21 We've done such great things worldwide in search.
`29:23 Seznam's done great things.
`29:25 We've done interesting stuff.
`29:26 There have been great companies doing
`29:28 great work for 10 years.
`29:29 The future's much harder, and much more interesting.
`29:34 Next slide, please.
`29:35
`29:37 So our mission was all the world's information
`29:39 universally accessible and useful.
`29:41 All the world's information universally
`29:44 accessible and useful.
`29:46 There are at least four huge computer science problems to
`29:51 solve in that context.
`29:53 For those of you who are interested in winning Turing
`29:55 Awards, pay attention.
`29:56 There's at least 30 of them on the next couple of slides.
`29:59 Next slide.
`30:01 Audience participation part number whatever--
`30:05 three, four, five, whatever number I'm on.
`30:08 What is this?
`30:09 AUDIENCE: The world.
`30:10 DOUGLAS MERRILL: OK.
`30:11
`30:14 OK, fair point.
`30:14 Yes, it's the world.
`30:16 I did actually give this talk once and I showed this slide,
`30:18 and someone said it's a photograph of the Earth.
`
`
`
`30:19
`30:22 And I was sort of intrigued by this.
`30:24 So how do you take a picture of the Earth and
`30:26 have it all be dark?
`30:28 But let's ignore that for now.
`30:31 Fair enough.
`30:32 It's not a photograph of the Earth, but it is
`30:33 a map of the world.
`30:34 What are the spots on it?
`30:35 What's changing?
`30:38 AUDIENCE: The number of searches conducted?
`30:39 DOUGLAS MERRILL: How did you know that?
`30:42 Nobody gets that right.
`30:43 Hey, you get out of here.
`30:44
`30:48 Well done.
`30:49 So pretend he's not here.
`30:53 Everybody says, hey look, it's city lights at night.
`30:56 It's not.
`30:58 OK, what we did is we took our query traffic for a day, and
`31:04 we put a little white dot every place that
`31:09 a query came from.
`31:12 So we geo-located the source of a query and we plotted it
`31:17 on the map over time.
`31:18 And you see some things, like you can see the United States
`31:22 pretty clearly.
`31:23 You can see Western Europe pretty clearly.
`31:25 You can see Tokyo over there, it's [UNINTELLIGIBLE], a
`31:27 little bit of China.
`31:29 And you can see it's clearly temporal, because remember,
`31:31 time is flowing in this diagram.
`31:33 And although I've taken the scale off it, it turns out the
`31:36 people seem to search a lot in the morning and the night,
`31:38 which makes sense because we all work for a living, except
`31:41 all of you.
`31:42 But anyway what else is interesting about this slide?
`31:49 Where is Africa?
`31:50
`31:55 I flew over it a couple of days ago.
`31:57 It was there.
`32:00 Really.
`32:01 So what's going on?
`32:02 What's going on is it turns out that the continent of
`32:05 Africa is served by basically two very large
`32:08 wired internet cables.
`32:09 Two.
`
`
`
`32:10 One runs down the east coast, one runs down the west coast.
`32:13 Remarkable how that works.
`32:15 Each of those internet cables is connected to the ground by
`32:19 things called points of presence.
`32:20 Those points of presence, there are about 10 of them,
`32:22 land in governmentally controlled centers.
`32:27 What is true about the internet
`32:29 everywhere in the world?
`32:31 One, it destabilizes authoritarian governments, and
`32:35 two, it's a great source of tax revenue.
`32:40 So what does that suggest is going to be the case for the
`32:43 wired internet in Africa?
`32:45
`32:49 AUDIENCE: [INAUDIBLE]
`32:50 controlled by government.
`32:51 DOUGLAS MERRILL: Oh, well done, sir.
`32:53 It's going to be controlled by the government.
`32:54 It's going to be really, really darn spendy.
`32:56 In fact, in some parts of sub-Saharan Africa, the cost
`33:01 of an hour's internet time in an internet cafe is about the
`33:05 same as one month's total salary on average.
`33:10 That suggests there ain't going to be a whole lot of
`33:13 wired internet use, right?
`33:14 So there are about 100,000 plus/minus wired internet
`33:17 connections in Africa.
`33:19 But you know what else there are?
`33:21 10 million internet-enabled mobile phones.
`33:26 Let's say your mission is all the world's information
`33:28 universally accessible and useful.
`33:30 What would you be working on?
`33:32 Search on mobile devices.
`33:33 Next slide, please.
`33:35
`33:38 So how many of you are carrying a laptop?
`33:41 It should be almost all of you, right?
`33:44 OK, how many of you are carrying a phone?
`33:46
`33:49 Even in a classroom, there are probably 50%
`33:53 more phones than laptops.
`33:55 Imagine what it's like in places that aren't scho