MapReduce parallel processing of all known URLs for median length

Scene 4 of Distributed Systems in Core Technical Interview Questions for Software Engineers

By Amin Ariana — February 2011

You are given the data set of all Google crawled URLs on the Internet -- that's a very large set. Write an algorithm to find the median URL length.

Contents

Problem Statement

References

Problem Statement

You are given the data set of all crawled URLs on the Internet. Assume there are N of them. Obviously N is so incredibly and astonishingly large that the set cannot be stored on one machine; so we've employed S very cheap machines and sharded the data set across them. Each data point (i.e. URL string) is only stored on one, perhaps randomly chosen, machine. Since the capacity of each of these cheap machines is relatively low, assume S itself is an incredibly large number. You're given a development machine of your own.

Design an algorithm to find the median of all known Internet URL lengths.

Note: 15 minutes into the problem I'm going to have to remind you to read all little assumptions carefully. So do this ahead of time. For example, I shouldn't have to tell you "Median and mean are different things."

References

Paul is an engineer from Google. He asked me this question as a friendly challenge.

MapReduce parallel processing of all known URLs for median length

Problem Statement

References

Core Technical Interview Questions for Software Engineers

Google search results for the top 1000 highest ranking URLs

Map Reduce Data Mining and Ranking

Sort a million 32-bit integers

Extreme Recursion with Binary Tree Topologies

Prisoner's Dilemma With Blindfolds: a math teaser

Language Vocabulary Sampling

Find the odd ball among eight balls

Getting paid for seven days with a single bar of gold

Return true if input is a Prime number

Find the missing element in a randomized range of numbers