[repost ]The Mathematics Behind Google’s PageRank


original:http://www4.ncsu.edu/~ipsen/ps/slides_man.pdf

The Mathematics
Behind Google’s PageRank
Ilse Ipsen
Department of Mathematics
North Carolina State University
Raleigh, USA
Joint work with Rebecca Wills
Man – p.1Two Factors
Determine where Google displays a web page on
the Search Engine Results Page:
1. PageRank (links)
A page has high PageRank
if many pages with high PageRank link to it
2. Hypertext Analysis (page contents)
Text, fonts, subdivisions, location of words,
contents of neighbouring pages
Man – p.2PageRank
An objective measure of the citation importance of
a web page [Brin & Page 1998]
• Assigns a rank to every web page
• Influences the order in which Google displays
search results
• Based on link structure of the web graph
• Does not depend on contents of web pages
• Does not depend on query
Man – p.3More PageRank More Visitors
Man – p.4PageRank
. . . continues to provide the basis for all of our web
search tools http://www.google.com/technology/
• “Links are the currency of the web”
• Exchanging & buying of links
• BO (backlink obsession)
• Search engine optimization
Man – p.5Overview
• Mathematical Model of Internet
• Computation of PageRank
• Is the Ranking Correct?
• Floating Point Arithmetic Issues
Man – p.6Mathematical Model of Internet
1. Represent internet as graph
2. Represent graph as stochastic matrix
3. Make stochastic matrix more convenient
=⇒ Google matrix
4. dominant eigenvector of Google matrix
=⇒ PageRank
Man – p.7The Internet as a Graph
Link from one web page to another web page
Web graph:
Web pages = nodes
Links = edges
Man – p.8The Internet as a Graph
Man – p.9The Web Graph as a Matrix
3
2
55
4
1
S =







0
1
2
0
1
2
0
0 0
1
3
1
3
1
3
0 0 0 1 0
0 0 0 0 1
1 0 0 0 0







Links = nonzero elements in matrix
Man – p.10Properties of Matrix S
• Row i of S: Links from page i to other pages
• Column i of S: Links into page i
• S is a stochastic matrix:
All elements in [0, 1]
Elements in each row sum to 1
• Dominant left eigenvector:
ω
T
S = ω
T
ω ≥ 0 kωk1 = 1
• ωi
is probability of visiting page i
• But: ω not unique
Man – p.11Google Matrix
Convex combination
G = αS + (1 − α)11v
T
| {z }
rank 1
• Stochastic matrix S
• Damping factor 0 ≤ α < 1
e.g. α = .85
• Column vector of all ones 11
• Personalization vector v ≥ 0 kvk1 = 1
Models teleportation
Man – p.12PageRank
G = αS + (1 − α)11v
T
• G is stochastic, with eigenvalues:
1 > α|λ2(S)| ≥ α|λ3(S)| ≥ . . .
• Unique dominant left eigenvector:
π
TG = π
T
π ≥ 0 kπk1 = 1
• πi
is PageRank of web page i
[Haveliwala & Kamvar 2003, Eldén 2003,
Serra-Capizzano 2005] Man – p.13How Google Ranks Web Pages
• Model:
Internet → web graph → stochastic matrix G
• Computation:
PageRank π is eigenvector of G
πi
is PageRank of page i
• Display:
If πi > πk then
page i may

be displayed before page k

depending on hypertext analysis
Man – p.14Facts
• The anatomy of a large-scale hypertextual web
search engine [Brin & Page 1998]
• US patent for PageRank granted in 2001
• Google indexes 10’s of billions of web pages
(1 billion = 10
9
)
• Google serves ≥ 200 million queries per day
• Each query processed by ≥ 1000 machines
• All search engines combined process more
than 500 million queries per day
[Desikan, 26 October 2006] Man – p.15Computation of PageRank
The world’s largest matrix computation
[Moler 2002]
• Eigenvector
• Matrix dimension is 10’s of billions
• The matrix changes often
250,000 new domain names every day
• Fortunately: Matrix is sparse
Man – p.16Power Method
Want: π such that π
TG = π
T
Power method:
Pick an initial guess x
(0)
Repeat
[x
(k+1)
]
T
:= [x
(k)
]
TG
until “termination criterion satisfied”
Each iteration is a matrix vector multiply
Man – p.17Matrix Vector Multiply
x
TG = x
T

αS + (1 − α)11v
T

Cost: # non-zero elements in S
A power method iteration is cheap
Man – p.18Error Reduction in 1 Iteration
π
TG = π
T G = αS + (1 − α)11v
T
[x
(k+1) − π]
T = [x
(k)
]
TG − π
TG
= α[x
(k) − π]
T
S
Error: kx
(k+1) − πk1
| {z }
iteration k+1
≤ α kx
(k) − πk1
| {z }
iteration k
Man – p.19Error in Power Method
π
TG = π
T G = αS + (1 − α)11v
T
Error after k iterations:
kx
(k) − πk1 ≤ α
k
kx
(0) − πk1
| {z }
≤2
[Bianchini, Gori & Scarselli 2003]
Error bound does not depend on matrix dimension
Man – p.20Advantages of Power Method
• Simple implementation (few decisions)
• Cheap iterations (sparse matvec)
• Minimal storage (a few vectors)
• Robust convergence behaviour
• Convergence rate independent of matrix
dimension
• Numerically reliable and accurate
(no subtractions, no overflow)
But: can be slow
Man – p.21PageRank Computation
• Power method
Page, Brin, Motwani & Winograd 1999
Bianchini, Gori & Scarselli 2003
• Acceleration of power method
Kamvar, Haveliwala, Manning & Golub 2003
Haveliwala, Kamvar, Klein, Manning & Golub 2003
Brezinski & Redivo-Zaglia 2004, 2006
Brezinski, Redivo-Zaglia & Serra-Capizzano 2005
• Aggregation/Disaggregation
Langville & Meyer 2002, 2003, 2006
Ipsen & Kirkland 2006
Man – p.22PageRank Computation
• Methods that adapt to web graph
Broder, Lempel, Maghoul & Pedersen 2004 Kamvar,
Haveliwala & Golub 2004
Haveliwala, Kamvar, Manning & Golub 2003
Lee, Golub & Zenios 2003
Lu, Zhang, Xi, Chen, Liu, Lyu & Ma 2004
Ipsen & Selee 2006
• Krylov methods
Golub & Greif 2004
Del Corso, Gullí, Romani 2006
Man – p.23PageRank Computation
• Schwarz & asynchronous methods
Bru, Pedroche & Szyld 2005
Kollias, Gallopoulos & Szyld 2006
• Linear system solution
Arasu, Novak, Tomkins & Tomlin 2002
Arasu, Novak & Tomkins 2003
Bianchini, Gori & Scarselli 2003
Gleich, Zukov & Berkin 2004
Del Corso, Gullí & Romani 2004
Langville & Meyer 2006
Man – p.24PageRank Computation
• Surveys of numerical methods:
Langville & Meyer 2004
Berkhin 2005
Langville & Meyer 2006 (book)
Man – p.25Is the Ranking Correct?
π
T =

.23 .24 .26 .27

• x
T =

.27 .26 .24 .23

kx − πk∞ = .04
Small error, but incorrect ranking
• y
T =

0 .001 .002 .997

ky − πk∞ = .727
Large error, but correct ranking
Man – p.26

[repost]Google and Netflix Strategy: Use Partial Responses to Reduce Request Sizes


This strategy targets reducing the amount of protocol data in packets by sending only the attributes that are needed. Google calls this Partial Response and Partial Update.

Netflix posted about adopting this strategy in their recent Netflix API redesign. We’ve seen previously how Netflix improved performance by creating less chatty protocols.

As a consequence packet sizes rise as more data is being stuffed into each packet in order to reduce the number of round trips. But we don’t like large packets either (memory usage and packet processing overhead), so we have to think of creative ways to shrink them back down.

The change Netflx is making is to conceptualize their API as a database. What does this mean?

A partial response is like a SQL select statement where you can specify only the fields you want back. Only the attributes of interest are requested.  Previously all fields for objects were returned, even if the client didn’t need them. So the goal is reduce payload sizes by being more selective about what data is returned.

An example Google uses is:

GET /myFeed?fields=id,entry(author)

The fields parameter selects the attributes to return out of a much larger feeds resource.

A partial update works similarly, you send only the fields you want to update.

Synchronization Issues?

Clearly this will reduce the number of attributes in flight, but are there any problems with this strategy? A major problem I’ve experienced when using partial data are synchronization issues.

Thinking of an API as a database moves the problem to be in the general problem class of keeping two distributed models in sync when both sides are changing and the connection between them is unreliable. Now each model is receiving only partial data,  if any data is lost or retransmitted between requests, then resources get out of sync, which can lead to integrity problems.

A user account model on the client side, for example, could ask for the user’s preferences just once, preferences could change via another client or via some backend system, and the first client would never pick up changes to those preferences again, during that interval the client could be making a lot of bad choices. Then the user could make a change to those preferences on the client and overwrite any updates that have happened since the last request. If you are dealing with alarms and alerts this all gets a lot worse. With a state synchronization model where the entire resources is returned the window for these errors is much smaller as updated preferences are always being returned.

This brings up an entire rats nest of workarounds, like doing a complete resource sync before writes, but the system gets a lot more complicated. Another direction to go is to drop responses completely. In a database sync model there is really no need for direct responses anymore.  What’s needed are for all aggregated changes to sync back to clients so the client can bring the client side model back in sync with the Netflix model.

Just some things to think about if you are considering this approach.

original:

Google and Netflix Strategy: Use Partial Responses to Reduce Request Sizes