The Wikimedia Foundation
The Wikimedia Foundation, Inc. (http://wikimediafoundation.org/) is a nonprofit charitable organization dedicated to encouraging the growth, development and distribution of free, multilingual, educational content, and to providing the full content of these wiki-‐based projects to the public free of charge. The Wikimedia Foundation operates some of the largest collaboratively edited reference projects in the world; you are probably most familiar with Wikipedia which is a free encyclopedia and is available in over 50 languages (see https://meta.wikimedia.org/wiki/List_of_Wikipedias for a list of languages).
Information on all the projects that are the core of the Wikimedia Foundation available at
http://wikimediafoundation.org/wiki/Our_projects.
Aggregated page view statistics for Wikimedia projects is available at http://dumps.wikimedia.org/other/pagecounts-‐raw/. This page gives access to files that contain the total hourly page views for Wikimedia project pages by page. Information on the file format is given on this page view statistics page.
Required Tasks
The task of this assignment is twofold:
- 1. Use HDFS and MapReduce to identify the popularity of Wikipedia projects by the number of pages of each Wikipedia site which were accessed over an x hour period. Your job should allow you to directly identify from the output the most popular Wikipedia sites accessed over the time period selected. You can choose whichever x hour period you wish from the files available on the page view statistics page, with the constraint that x>=6.
- 2. Use HDFS and MapReduce to identify the average page count per language over the same period, ordered by page count.
Are you overwhelmed by your class schedule and need help completing this assignment? You deserve the best professional and plagiarism-free writing services. Allow us to take the weight off your shoulders by clicking this button.
Get help