tags:

In The Long Tail of URL Exploration, I looked at the distribution of URLs and visits. This was on the way to trying to answer questions like:

• How much overlap is there between the URLs 10 people visit and those in the 11th person’s click stream?
• How about the 100th or 100,000th person? Does the millionth user explore any unique URLs at all?
• Can we build a model to answer How many people are required to crawl 10% of the Web?

The second part of the answer is to look at how the model of URLs and visits evolves as we add users.  To get samples with of different sizes using the same click stream data set, randomly select a subset of the users and run the analysis from the previous post.  Through everyone back into the pot and randomly select a slightly larger set.  Repeat.

I reran the model for 3%, 4%, 5%, 7%, 9%, 11%, 15%, 20%, 30%, 40%, 50%, 60%, 70%, 80% and 90% of the overall users in the 1-day data set. The sample sized ranged from 3,000 to 91,200 users. For the entire data set, the average user made 184 URL visits during the day. In the randomly chosen subsets, users made an average of between 181 and 187 URL visits with most of the variation in the smaller sample size as expected.

Do I expect the number of unique URLs be linearly proportional to the number of users? Or if users are visiting many of the same URLs and URLs tend to have the "winner take all" properties we looked at before, we might expect the number of unique URLs added by the 90,000th user to be fewer than the number of unique URLs added by the 1,000th user.

I first plotted the number of unique URLs against the number of users in each sample.  The curve looks straight but may be slightly concave downward.  It is very subtly.  I needed to look at the data in a way the amplified the change over the various subsamples.

Below is a plot of the number of unique URLs/user vs. the number of users. This line is flat if the number of URLs is growing linearly with the number of users.

The blue curve is the best fit to another power function ( f(x)=ax^k ). The first few thousand users are contribute more original URLs (>90 URLs per user) to the sample than the 100,000th (83 URLs).  If you are the first explorer of the a new world, all of your discoveries are original; when you are a late comer, your contributions are around the margins. It may be surprising how much original content being explored by the 100,000th explorer.

Does the long tail get relatively longer or shorter?  For simplicity, I use the URLs with only one visit to represent the long tail.  Then ratio of 1-visit URLs to unique URLs decreases subtly. For the smallest samples size 70.0% of the unique URLs are hit only once; for the overall data set, the ratio is 69.2%. To amplify this change like above, the plot of 1-visit URLs per user is shown below.

At 100,000 users, the long tail is growing at 57 URLs per additional user.  The decrease with each additional user is slowing.  The blue curve is the best fit to another power function.

If the power function is the best explanation of the underlying dynamics, the number of unique URLs and the long tail both continue to grow no matter how many people are exploring.  Since an increasing number of people need to explore to keep the exploration rate constant, the cost of exploration per URL goes up as explorers are added.

Does anything interesting happen in the tall head where the big winners are? That will have to wait for another post.