Scene 4 of Distributed Systems in Core Technical Interview Questions for Software Engineers
By Amin Ariana — February 2011
Problem Statement
You are given the data set of all crawled URLs on the Internet. Assume there are N of them. Obviously N is so incredibly and astonishingly large that the set cannot be stored on one machine; so we've employed S very cheap machines and sharded the data set across them. Each data point (i.e. URL string) is only stored on one, perhaps randomly chosen, machine. Since the capacity of each of these cheap machines is relatively low, assume S itself is an incredibly large number. You're given a development machine of your own.
Design an algorithm to find the median of all known Internet URL lengths.
Note: 15 minutes into the problem I'm going to have to remind you to read all little assumptions carefully. So do this ahead of time. For example, I shouldn't have to tell you "Median and mean are different things."
References
Paul is an engineer from Google. He asked me this question as a friendly challenge.