assignment7.html

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
    <meta name="description" content="Course homepage for CS 489 Big Data Infrastructure (Winter 2017) at the University of Waterloo">
    <meta name="author" content="Jimmy Lin">
    <title>Big Data Infrastructure</title>

    <!-- Bootstrap -->
    <link href="css/bootstrap.min.css" rel="stylesheet">

    <!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
    <link href="css/ie10-viewport-bug-workaround.css" rel="stylesheet">

    <style>
      body {
        padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
      }
    </style>

    <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
    <!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
    <!--[if lt IE 9]>
      <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
      <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
    <![endif]-->
  </head>


  <body>

    <nav class="navbar navbar-inverse navbar-fixed-top">
      <div class="container">
        <div class="navbar-header">
          <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
            <span class="sr-only">Toggle navigation</span>
            <span class="icon-bar"></span>
            <span class="icon-bar"></span>
            <span class="icon-bar"></span>
          </button>
        </div>
        <div id="navbar" class="collapse navbar-collapse">
          <ul class="nav navbar-nav">
            <li><a href="index.html">Overview</a></li>
            <li><a href="organization.html">Organization</a></li>
            <li><a href="syllabus.html">Syllabus</a></li>
            <li class="active"><a href="assignments.html">Assignments</a></li>
            <li><a href="software.html">Software</a></li>
          </ul>
        </div><!--/.nav-collapse -->
      </div>
    </nav>

    <div class="container">


  <div class="page-header">
    <div style="float: right"/><img src="images/waterloo_logo.png"/></div>
    <h1>Assignments <small>CS 489/698 Big Data Infrastructure (Winter 2017)</small></h1>
  </div>

  <div class="subnav">
    <ul class="nav nav-pills">
      <li><a href="assignment0.html">0</a></li>
      <li><a href="assignment1.html">1</a></li>
      <li><a href="assignment2.html">2</a></li>
      <li><a href="assignment3.html">3</a></li>
      <li><a href="assignment4.html">4</a></li>
      <li><a href="assignment5.html">5</a></li>
      <li><a href="assignment6.html">6</a></li>
      <li><a href="assignment7.html">7</a></li>
      <li><a href="project.html">Final Project</a></li>
    </ul>
  </div>

<section style="padding-top:0px">
<div>
<h3>Assignment 7: Inverted Indexing (Redux) <small>due 1:00pm April 3</small></h3>

<p>In this assignment you'll revisit the inverted indexing and boolean
retrieval program in <a href="assignment3.html">assignment 3</a>. In
assignment 3, your indexer program wrote postings to HDFS
in <code>MapFile</code>s and your boolean retrieval program read
postings from those <code>MapFile</code>s. In this assignment, you'll
write postings to and read postings from HBase instead. In other
words, the core program logic should not change, except for the backend
storage that you are using. This assignment is to be completed using
MapReduce in Java.</p>

<p><b>Note</b>: Due to the complexities of setting up HBase in
Altiscale, we have not been able to stand up an HBase cluster yet. For
now, work in the Linux Student CS environment (more details below). If
we manage to get a stable HBase cluster running on Altiscale, you will
complete the Altiscale portion of the assignment; otherwise, you will
not be responsible for getting code to run on Altiscale.</p>

<p>We have stood up a single-node HBase cluster in the Linux Student
CS environment. Check
out <a href="http://ubuntu1404-010.student.cs.uwaterloo.ca:16010/master-status"><code>http://ubuntu1404-010.student.cs.uwaterloo.ca:16010/master-status</code></a>. You
won't be able to play with HBase on sizeable collections, although the
assignment will still give you some experience of what developing
against HBase "feels like". For this assignment ssh
into <code>ubuntu1404-010.student.cs.uwaterloo.ca</code> and work
specifically on that host.</p>

<h4 style="padding-top: 10px">HBase Word Count</h4>

<p>To start, pull the Bespin repo and make sure you have the latest
version of the code. Take a look at <code>HBaseWordCount</code> in
package <code>io.bespin.java.mapreduce.wordcount</code>.
The <code>HBaseWordCount</code> program is like the basic word count
demo, except that it stores the output in an HBase table. That is, the
reducer output is directly written to an HBase table: the word serves
as the row key, "c" is the column family, "count" is the column
qualifier, and the value is the actual count.</p>

<p>The <code>HBaseWordCountFetch</code> program in the same package
illustrates how you can fetch these counts out of HBase and shows you
how to use the basic HBase "get" API.</p>

<p>Study these two programs to make sure you understand how they
work. The two sample program should give you a good introduction to
the HBase APIs. A free
online <a href="https://hbase.apache.org/book.html">HBase
book</a> is a good source of additional details.</p>

<p>Make sure you can run both
programs. Running <code>HBaseWordCount</code>:</p>

<pre>
hadoop jar target/bespin-0.1.0-SNAPSHOT.jar io.bespin.java.mapreduce.wordcount.HBaseWordCount \
 -config /u0/cs489/hbase-site.xml \
 -input data/Shakespeare.txt -table lintool-wc-shakes -reducers 5
</pre>

<p>Use the <code>-config</code> option to specify the HBase config
file: point to a version that we've prepared for you. This config file
tells the program how to connect to the HBase cluster. Use
the <code>-table</code> option to name the table you're inserting the
word counts into. The other options should be straightforward to
understand.</p>

<p><B>Note:</b> Since HBase is a shared resource, please make your
tables unique by using your username as part of the table name, per
above.</p>

<p>You should then be able to fetch the word counts from HBase:</p>

<pre>
hadoop jar target/bespin-0.1.0-SNAPSHOT.jar io.bespin.java.mapreduce.wordcount.HBaseWordCountFetch \
 -config /u0/cs489/hbase-site.xml \
 -table lintool-wc-shakes -term love
</pre>

<p>If everything works, you'll discover that the term "love" appears
2053 times in the Shakespeare collection.</p>

<h4 style="padding-top: 10px">HBase Storage</h4>

<p>Now it's time to write some code! Following past procedures,
everything should go into the package namespace
<code>ca.uwaterloo.cs.bigdata2017w.assignment7</code>. Note we're back
to coding in Java for this assignment.</p>

<p>You will write three programs: <code>BuildInvertedIndexHBase</code>,
<code>InsertCollectionHBase</code>,
and <code>BooleanRetrievalHBase</code>. Feel free to use code from
Bespin or assignment 3 starting points. Note that you don't need to
worry about index compression for this assignment.</p>

<p>The <code>BuildInvertedIndexHBase</code> program is the HBase
version of <code>BuildInvertedIndex</code> from the Bespin
demo. Instead of writing the index to HDFS, you will write the index
to an HBase table. Use the following table structure: the term will be
the row key. Your table will have a single column family called
"p". In the column family, each document id will be a column
qualifier. The value will be the term frequency.</p>

<p>We will run your code as follows:</p>

<pre>
hadoop jar target/bigdata2017w-0.1.0-SNAPSHOT.jar \
  ca.uwaterloo.cs.bigdata2017w.assignment7.BuildInvertedIndexHBase \
  -config /u0/cs489/hbase-site.xml \
  -input data/Shakespeare.txt \
  -index cs489-2017w-lintool-a7-index-shakespeare -reducers 4
</pre>

<p>The <code>-index</code> option specifies the name of the HBase
table. If it exists already, your program should drop the table and
recreate it.</p>

<p>The <code>InsertCollectionHBase</code> program will insert the
collection into HBase, where the row key is the long offset of each
line, what we've been using as the document id. The simplest
implementation is to use an identity mapper and insert into HBase in
the reducers.</p>

<p>We will run your code as follows:</p>

<pre>
hadoop jar target/bigdata2017w-0.1.0-SNAPSHOT.jar \
  ca.uwaterloo.cs.bigdata2017w.assignment7.InsertCollectionHBase \
  -config /u0/cs489/hbase-site.xml \
  -input data/Shakespeare.txt \
  -index cs489-2017w-lintool-a7-collection-shakespeare -reducers 4
</pre>

<p>The <code>BooleanRetrievalHBase</code> program is the HBase version
of <code>BooleanRetrieval</code> from the Bespin demo. This program
should read postings and documents from HBase.</p>

<p>We will run your code as follows:</p>

<pre>
hadoop jar target/bigdata2017w-0.1.0-SNAPSHOT.jar \
  ca.uwaterloo.cs.bigdata2017w.assignment7.BooleanRetrievalHBase \
  -config /u0/cs489/hbase-site.xml \
  -index cs489-2017w-lintool-a7-index-shakespeare \
  -collection cs489-2017w-lintool-a7-collection-shakespeare \
  -query "outrageous fortune AND"
</pre>

<p>Note that <code>-index</code> and <code>-collection</code> specify
HBase tables (the result of <code>BuildInvertedIndexHBase</code> and
<code>InsertCollectionHBase</code>, respectively). You should verify
that all the sample queries (from assignment 3) work. Your output
should match the output of assignment 3.</p>

<h4 style="padding-top: 10px">Search API Endpoint</h4>

<p>Finally, write a search API REST endpoint in a program
named <code>HBaseSearchEndpoint</code>, wrapping around
the <code>BooleanRetrievalHBase</code> program above. We'll start up
the server as follows:</p>

<pre>
hadoop jar target/bigdata2017w-0.1.0-SNAPSHOT.jar \
  ca.uwaterloo.cs.bigdata2017w.assignment7.HBaseSearchEndpoint \
  -config /u0/cs489/hbase-site.xml \
  -index cs489-2017w-lintool-a7-index-shakespeare \
  -collection cs489-2017w-lintool-a7-collection-shakespeare \
  -port 7890
</pre>

<p>The <code>-port</code> option specifies the port number that the
services listen on. When you are testing the server, please use random
port numbers, because otherwise everyone's service will collide.</p>

<p>The search API endpoint should behave as follows, returning
JSON.</p>

<pre>
$ curl http://localhost:7890/search?query=fair+nature+AND
[
  {"docid": 425450, "text": "  CELIA. No; when Nature hath made a fair creature, may she not by"},
  {"docid": 1553567, "text": "    Disguise fair nature with hard-favour'd rage;"},
  {"docid": 5327159, "text": "  Showing fair nature is both kind and tame;"}
]
$ curl http://localhost:7890/search?query=outrageous+fortune+AND
[
  {"docid": 1073319, "text": "    The slings and arrows of outrageous fortune"}
]
</pre>

<p>The reference solution uses Jetty, but you are welcome to use any
framework for the service, so long as the API endpoint conforms to the
behavior above.</p>

<h4 style="padding-top: 10px">Turning in the Assignment</h4>

<p>Please follow these instructions carefully!</p>

<p>Make sure your repo has the following items:</p>

<ul>

<li>If you have any notes you wish to convey to us, put it
in <code>bigdata2017w/assignment7.md</code>. Otherwise, please create
an empty file&mdash;following previous assignments, this is where your
grade with go.</li>

<li>Your implementations should go in
package <code>ca.uwaterloo.cs.bigdata2017w.assignment7</code>. At the
minimum, you should have <code>BuildInvertedIndexHBase</code>,
<code>InsertCollectionHBase</code>, <code>BooleanRetrievalHBase</code>,
and <code>HBaseSearchEndpoint</code>. Feel free to include helper
classes also.</li>

</ul>

<p>The following check script is provided for you:</p>

<ul>

<li><a href="assignments/check_assignment7_linux.py"><code>check_assignment7_linux.py</code></a></li>

</ul>

<p>Note that the check script does not check the behavior of your
search endpoint.</p>

<p>When you've done everything, commit to your repo and remember to
push back to origin. You should be able to see your edits in the web
interface. Before you consider the assignment "complete", we would
recommend that you verify everything above works by performing a clean
clone of your repo and run the public check scripts.</p>

<p>That's it!</p>

<h4 style="padding-top: 10px">Grading</h4>

<p>The entire assignment is worth 50 points:</p>

<ul>

  <li>The implementation of <code>BuildInvertedIndexHBase</code>,
  <code>InsertCollectionHBase</code>, <code>BooleanRetrievalHBase</code>,
  and <code>HBaseSearchEndpoint</code> are each worth 5 points.

  <li>Linux Student CS environment: Getting your code to run on sample
  queries (the same as the ones in assignment 3) is worth 10
  points. That is, to earn all 10 points, we should be able to run
  your code on the Shakespeare collection, following exactly the
  procedure above.</li>

  <li>Altiscale: Getting your code to run on sample queries (the same
  as the ones in assignment 3) is worth 10 points. That is, to earn
  all 10 points, we should be able to run your code on the Wikipedia
  collection, following exactly the procedure above.</li>

  <li>Another 10 points is allotted to us verifying the behavior and
  output of your program in ways that we will not tell you. We're
  giving you the "public" versions of the check scripts; we'll run a
  "private" version to examine your output further (i.e., think blind
  test cases).</li>

</ul>


<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>


<p style="padding-top:100px" />

    </div><!-- /.container -->


    <!-- jQuery (necessary for Bootstrap's JavaScript plugins) -->
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
    <!-- Include all compiled plugins (below), or include individual files as needed -->
    <script src="js/bootstrap.min.js"></script>

    <!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
    <script src="js/ie10-viewport-bug-workaround.js"></script>
  </body>

</html>