September 03, 2008

Day-to-Day Tall Head of URL Exploration

This is the 4th post on the statistics of URL exploration. In the previous three (The Long Tail of URL Exploration, What does the Nth Explorer of the Web Find? and The Tall Head of URL Exploration) I looked at how adding users grows the long tail and tall head of URLs for a single day.  Today, the data covers 20 days with a relatively constant population.

To get some idea how the tall head evolves, compare the tall head on day 0 with 19 successive days. The plot below shows the Top 10, Top 50, Top 100, Top 500 and Top 1000 URLs for day zero and the fraction of the top URLs on day zero appearing in the tall head on the nth day.

 

Tall Head URLs by day
Figure 1.  Red-Top 10 URLs; Blue Top 50 URLs;
Green-Top 100 URLs; Cyan-Top 500 URLs; Yellow-Top
1000 URLs. (URLs ranked by visits).
 

While the Top 10 and Top 50 URLs show stability day after day, the Top 500 and Top 1000 roll over at a fairly constant rate after day one.  The plot can be used to estimate the size of the persistent tall head of URLs for this population and the rate at which the tall head evolves.

First, look for a change in behavior from maintaining a constant fraction of the day-0 URLs to a steady decline from day to day. By this heuristic, estimate the persistent tall head to be between 50 and 100 URLs.

Secondly, to estimate the turnover of the tall head, choose the approximate desired tall head, e.g., the Top 500 URLs (cyan), and look at the slope of the line for days 1-19. (Alternately, choose a timescale for which the tall head should turn over to a given fraction remaining, say, 75%, giving a timescale of approximately 15 days.)

Tall Head Top 500 Fit


Figure 2.  Red-Top 500 URLs; Blue Top 1000 URLs, shown
for comparison; Green-Fit to Top 500 URLs. (Days 1-20,
URLs ranked by visits).

The plot above shows the Top 500 URLs rollover about 0.5% per day from days 1 to 20.

September 01, 2008

Read in the last 100 days...

I really enjoyed My Name is Red. It is a complicated mystery set in Istanbul and covers a lot of background on the history art in the Muslim world. It is very well done. Leinad Zeraus' book is a fun ride, a mystery novel written especially for geeks.

Energize Your Heart: In 4 di... Bair, Puran and Susanna
My Name Is Red... Pamuk, Orhan
Open Road, The: The global j... Iyer, Pico
Daemon... Zeraus, Leinad
History of Last Night's Drea... Kamenetz, Rodger
End of Certainty, The: Time,... Prigogine, Ilya
Bus Driver Who Wanted To Be ... Keret, Etgar
Gnostic Gospels, The... Pagels, Elaine
Difference, The: How the pow... Page, Scott E.

August 29, 2008

The Tall Head of URL Exploration

In The Long Tail of URL Exploration, I looked at the distribution of URL visits by 102K people in a day.  In What does the Nth Explorer of the Web Find?, I looked at how adding users grows the long tail and the number of unique URLs explored.  In this 3rd (of 4) post, I look at the how the tall head changes as we add explorers.

The tall head is the short list of sites that get the most visits.  We could call it the Top 10, Top 100 or whatever we think is relevant.  Later I propose a couple of heuristics for determining tall head membership in real time.

The tall head is made up of URLs that much of the population visits.  These are the "winner take all" URLs of user attention--URLs like cnn.com, google.com, or facebook.com.  For this reason, we might expect that while the long tail is growing with more unique URLs and the number of URLs with 2-3 hits is growing rapidly, the tall head is relatively stable.

This is the case.

One way to think about how stable the tall head might be is to ask how well a subset of the population predicts membership in the tall head for the entire sample.  The data from the last two posts is well-suited to look at this question.

Below is a plot of the accuracy of various subsets of the population (the same subsets we used previously) in predicting the Top 10, Top 20, Top 50 and Top 100 URLs of the entire population.  Just over 40% percent of the population predicts the Top 100 URLs with 90% accuracy.  The Top 10 are predicted to 90% accuracy by 10% of the population.

predicting the tall head


Figure.  Red-Top 10 URLs by visit; Blue Top 20
URLs by visit; Green-Top50 URLs by visits; Cyan-Top
100 URLs by visits.

The composition of the tall head depends relatively weakly on the subset of the population doing the predicting.

How can I predict the tall head for the day by 9 am in the morning? This is the real-time problem of long tail distributions.  The dynamics of the system are that real time Web exploration data appears as a time-ordered list of URLs from whatever users happen to be surfing.  This means that a real time heuristic for determining top URLs for the day has to rely on the properties of the time series including a small surfer sample size and recent counts of visits.

Fortunately, by the results illustrated above, a small sample size is a pretty good bet for determining tall head URLs. What we are still missing is metrics or intuition for how the long tail distribution evolves over time.

We do know that for a URL to end up in the tall head, it must be visited by many Web explorers.  This means that we can rule out all URLs that are visited by only one or two users.  This assumption also leads to a heuristic based on the time between visits--URLs visited by many people should have the same visit/time distribution as the users/time distribution of the entire sample.  More specifically, we might guess that if the time between visits has an average near 1 day/number of visits and relatively low variance, it is likely to be in the tall head.  A project for a little later...

Next post: How does the composition of the tall head change from day to day?

August 28, 2008

Worth Your Weight in Gold - Link

Evil Mad Scientist has fun analysis of this question for all kinds of materials from different forms of money (dimes = quarters by weight) to flour and culminating in antimatter (BTW, very valuable by weight!).  EMS gives the question a few entertaining twists.  Here is an excerpt:

Kopi Luwak coffee costs approximately the same amount per pound as human blood. (Knowing where it comes from, I think I'd rather drink the blood. It's been pointed out before that printer ink is also up there, but I'd rather not drink that either.)

Would you have guessed that peacock feathers can be worth more than their weight in dollar bills? Or that a fancy steak costs twice as much as its weight in dollar coins?

August 27, 2008

What does the Nth explorer of the Web find?

In The Long Tail of URL Exploration, I looked at the distribution of URLs and visits. This was on the way to trying to answer questions like:

  • How much overlap is there between the URLs 10 people visit and those in the 11th person's click stream?
  • How about the 100th or 100,000th person? Does the millionth user explore any unique URLs at all?
  • Can we build a model to answer How many people are required to crawl 10% of the Web?


The second part of the answer is to look at how the model of URLs and visits evolves as we add users.  To get samples with of different sizes using the same click stream data set, randomly select a subset of the users and run the analysis from the previous post.  Through everyone back into the pot and randomly select a slightly larger set.  Repeat.

I reran the model for 3%, 4%, 5%, 7%, 9%, 11%, 15%, 20%, 30%, 40%, 50%, 60%, 70%, 80% and 90% of the overall users in the 1-day data set. The sample sized ranged from 3,000 to 91,200 users. For the entire data set, the average user made 184 URL visits during the day. In the randomly chosen subsets, users made an average of between 181 and 187 URL visits with most of the variation in the smaller sample size as expected.

Do I expect the number of unique URLs be linearly proportional to the number of users? Or if users are visiting many of the same URLs and URLs tend to have the "winner take all" properties we looked at before, we might expect the number of unique URLs added by the 90,000th user to be fewer than the number of unique URLs added by the 1,000th user.

I first plotted the number of unique URLs against the number of users in each sample.  The curve looks straight but may be slightly concave downward.  It is very subtly.  I needed to look at the data in a way the amplified the change over the various subsamples.

Below is a plot of the number of unique URLs/user vs. the number of users. This line is flat if the number of URLs is growing linearly with the number of users.

URLs per user


The blue curve is the best fit to another power function ( f(x)=ax^k ). The first few thousand users are contribute more original URLs (>90 URLs per user) to the sample than the 100,000th (83 URLs).  If you are the first explorer of the a new world, all of your discoveries are original; when you are a late comer, your contributions are around the margins. It may be surprising how much original content being explored by the 100,000th explorer.

Does the long tail get relatively longer or shorter?  For simplicity, I use the URLs with only one visit to represent the long tail.  Then ratio of 1-visit URLs to unique URLs decreases subtly. For the smallest samples size 70.0% of the unique URLs are hit only once; for the overall data set, the ratio is 69.2%. To amplify this change like above, the plot of 1-visit URLs per user is shown below.

long tail per user


At 100,000 users, the long tail is growing at 57 URLs per additional user.  The decrease with each additional user is slowing.  The blue curve is the best fit to another power function.

If the power function is the best explanation of the underlying dynamics, the number of unique URLs and the long tail both continue to grow no matter how many people are exploring.  Since an increasing number of people need to explore to keep the exploration rate constant, the cost of exploration per URL goes up as explorers are added.

Does anything interesting happen in the tall head where the big winners are? That will have to wait for another post.

 

The Long Tail of URL Exploration


Unlike robot crawlers, people visit only the links they think will be interesting.  People follow news stories, follow links from friends, follow links to videos or pictures.  People much more rarely follow links to boring stuff like privacy policies, lists of data or company mission statements.

In order to understand the pattern or people discovering content on the Web, I looked at the click streams of real Web users.  (For privacy hounds, I have real data, but the user names and URLs are hashed so I can't personally identify anyone or look up the actual URLs they visited.)

How much overlap is there between the URLs 10 people visit and those in the 11th person's click stream?  How about the 100th or 100,000th person? Does the millionth user explore any unique URLs at all? Can we build a model to answer How many people are required to crawl 10% of the Web?

The first part of the answer is to look at the distribution of URLs created by a group of users.  The sample has 102,000 user's click streams for 1 day.  I use only 1 day, because using multiple days complicates the estimates due to nearly all users having a set of URLs they visit daily. In the sample, users make 18.6M visits to just under 8.4M unique URLs.

When the URLs are ranked by number of visits, the distribution of visits over the 8.4M URLs shows that a few URLs get many hits while the tail of the distribution (way out to the right) is a long flat curve with many URLs getting only a handful of hits.

The distribution of visits can be fitted fairly accurately with a power law, p(x)=ax^k.  I don't plot the curve, because the head is so tall compared to the tail and the distribution falls off so quickly that the plot is a very sharp "L" shape and we don't get much from looking at it.  It is more useful to look at the cumulative distribution function (CDF) of the distribution.  This is the sum of the probabilities over the rank from highest to lowest.  Summing the probabilities to the lowest ranked URL, gives 100% of the visits recorded. Using the CDF perspective gives insight we can apply to practical situations.

Below is a plot of the CDF.  The red dots are the data points calculated from the sample while the blue line is the best fit to the CDF of the proposed power law distribution.  (The fit parameters are a=0.00219 and k=-0.690.)


CDF of URL Visits



One conclusion that comes out of this view of the data is that URL visits follow the so-called "80/20 Rule." This predicts 80% of the visits for the day went to roughly 20% of the URLs. Actually, for this data, the proportion is about 80/50--80% of the traffic when to the top 4.5M URLs or the top 57% of URLs.

This view shows that the tall head of big visit winners takes a significant fraction of the overall attention of the group.  These are likely URLs like www.google.com or www.cnn.com.

What does the long tail look like? The tail is just as surprising. For this data set, 5.8M URLs or 69% of the URLs visited during the day were visited only once. The number of URLs visited twice is 1.4M.

How do these numbers scale with the size of the group? That's coming in the next post. 

August 13, 2008

Synchronizing Fireflies Link

This is a really cool project.  It combines a great example of collective effects like the weakly coupled pendulum experiment with a simple extension of the Programmable LED project.  Nice work Tinkerlog!