-
Notifications
You must be signed in to change notification settings - Fork 24
/
Copy pathassignment7.html
363 lines (285 loc) · 14.1 KB
/
assignment7.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="Course homepage for CS 489 Big Data Infrastructure (Winter 2017) at the University of Waterloo">
<meta name="author" content="Jimmy Lin">
<title>Big Data Infrastructure</title>
<!-- Bootstrap -->
<link href="css/bootstrap.min.css" rel="stylesheet">
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<link href="css/ie10-viewport-bug-workaround.css" rel="stylesheet">
<style>
body {
padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
}
</style>
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
<nav class="navbar navbar-inverse navbar-fixed-top">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
</div>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li><a href="index.html">Overview</a></li>
<li><a href="organization.html">Organization</a></li>
<li><a href="syllabus.html">Syllabus</a></li>
<li class="active"><a href="assignments.html">Assignments</a></li>
<li><a href="software.html">Software</a></li>
</ul>
</div><!--/.nav-collapse -->
</div>
</nav>
<div class="container">
<div class="page-header">
<div style="float: right"/><img src="images/waterloo_logo.png"/></div>
<h1>Assignments <small>CS 489/698 Big Data Infrastructure (Winter 2017)</small></h1>
</div>
<div class="subnav">
<ul class="nav nav-pills">
<li><a href="assignment0.html">0</a></li>
<li><a href="assignment1.html">1</a></li>
<li><a href="assignment2.html">2</a></li>
<li><a href="assignment3.html">3</a></li>
<li><a href="assignment4.html">4</a></li>
<li><a href="assignment5.html">5</a></li>
<li><a href="assignment6.html">6</a></li>
<li><a href="assignment7.html">7</a></li>
<li><a href="project.html">Final Project</a></li>
</ul>
</div>
<section style="padding-top:0px">
<div>
<h3>Assignment 7: Inverted Indexing (Redux) <small>due 1:00pm April 3</small></h3>
<p>In this assignment you'll revisit the inverted indexing and boolean
retrieval program in <a href="assignment3.html">assignment 3</a>. In
assignment 3, your indexer program wrote postings to HDFS
in <code>MapFile</code>s and your boolean retrieval program read
postings from those <code>MapFile</code>s. In this assignment, you'll
write postings to and read postings from HBase instead. In other
words, the core program logic should not change, except for the backend
storage that you are using. This assignment is to be completed using
MapReduce in Java.</p>
<p><b>Note</b>: Due to the complexities of setting up HBase in
Altiscale, we have not been able to stand up an HBase cluster yet. For
now, work in the Linux Student CS environment (more details below). If
we manage to get a stable HBase cluster running on Altiscale, you will
complete the Altiscale portion of the assignment; otherwise, you will
not be responsible for getting code to run on Altiscale.</p>
<p>We have stood up a single-node HBase cluster in the Linux Student
CS environment. Check
out <a href="http://ubuntu1404-010.student.cs.uwaterloo.ca:16010/master-status"><code>http://ubuntu1404-010.student.cs.uwaterloo.ca:16010/master-status</code></a>. You
won't be able to play with HBase on sizeable collections, although the
assignment will still give you some experience of what developing
against HBase "feels like". For this assignment ssh
into <code>ubuntu1404-010.student.cs.uwaterloo.ca</code> and work
specifically on that host.</p>
<h4 style="padding-top: 10px">HBase Word Count</h4>
<p>To start, pull the Bespin repo and make sure you have the latest
version of the code. Take a look at <code>HBaseWordCount</code> in
package <code>io.bespin.java.mapreduce.wordcount</code>.
The <code>HBaseWordCount</code> program is like the basic word count
demo, except that it stores the output in an HBase table. That is, the
reducer output is directly written to an HBase table: the word serves
as the row key, "c" is the column family, "count" is the column
qualifier, and the value is the actual count.</p>
<p>The <code>HBaseWordCountFetch</code> program in the same package
illustrates how you can fetch these counts out of HBase and shows you
how to use the basic HBase "get" API.</p>
<p>Study these two programs to make sure you understand how they
work. The two sample program should give you a good introduction to
the HBase APIs. A free
online <a href="https://hbase.apache.org/book.html">HBase
book</a> is a good source of additional details.</p>
<p>Make sure you can run both
programs. Running <code>HBaseWordCount</code>:</p>
<pre>
hadoop jar target/bespin-0.1.0-SNAPSHOT.jar io.bespin.java.mapreduce.wordcount.HBaseWordCount \
-config /u0/cs489/hbase-site.xml \
-input data/Shakespeare.txt -table lintool-wc-shakes -reducers 5
</pre>
<p>Use the <code>-config</code> option to specify the HBase config
file: point to a version that we've prepared for you. This config file
tells the program how to connect to the HBase cluster. Use
the <code>-table</code> option to name the table you're inserting the
word counts into. The other options should be straightforward to
understand.</p>
<p><B>Note:</b> Since HBase is a shared resource, please make your
tables unique by using your username as part of the table name, per
above.</p>
<p>You should then be able to fetch the word counts from HBase:</p>
<pre>
hadoop jar target/bespin-0.1.0-SNAPSHOT.jar io.bespin.java.mapreduce.wordcount.HBaseWordCountFetch \
-config /u0/cs489/hbase-site.xml \
-table lintool-wc-shakes -term love
</pre>
<p>If everything works, you'll discover that the term "love" appears
2053 times in the Shakespeare collection.</p>
<h4 style="padding-top: 10px">HBase Storage</h4>
<p>Now it's time to write some code! Following past procedures,
everything should go into the package namespace
<code>ca.uwaterloo.cs.bigdata2017w.assignment7</code>. Note we're back
to coding in Java for this assignment.</p>
<p>You will write three programs: <code>BuildInvertedIndexHBase</code>,
<code>InsertCollectionHBase</code>,
and <code>BooleanRetrievalHBase</code>. Feel free to use code from
Bespin or assignment 3 starting points. Note that you don't need to
worry about index compression for this assignment.</p>
<p>The <code>BuildInvertedIndexHBase</code> program is the HBase
version of <code>BuildInvertedIndex</code> from the Bespin
demo. Instead of writing the index to HDFS, you will write the index
to an HBase table. Use the following table structure: the term will be
the row key. Your table will have a single column family called
"p". In the column family, each document id will be a column
qualifier. The value will be the term frequency.</p>
<p>We will run your code as follows:</p>
<pre>
hadoop jar target/bigdata2017w-0.1.0-SNAPSHOT.jar \
ca.uwaterloo.cs.bigdata2017w.assignment7.BuildInvertedIndexHBase \
-config /u0/cs489/hbase-site.xml \
-input data/Shakespeare.txt \
-index cs489-2017w-lintool-a7-index-shakespeare -reducers 4
</pre>
<p>The <code>-index</code> option specifies the name of the HBase
table. If it exists already, your program should drop the table and
recreate it.</p>
<p>The <code>InsertCollectionHBase</code> program will insert the
collection into HBase, where the row key is the long offset of each
line, what we've been using as the document id. The simplest
implementation is to use an identity mapper and insert into HBase in
the reducers.</p>
<p>We will run your code as follows:</p>
<pre>
hadoop jar target/bigdata2017w-0.1.0-SNAPSHOT.jar \
ca.uwaterloo.cs.bigdata2017w.assignment7.InsertCollectionHBase \
-config /u0/cs489/hbase-site.xml \
-input data/Shakespeare.txt \
-index cs489-2017w-lintool-a7-collection-shakespeare -reducers 4
</pre>
<p>The <code>BooleanRetrievalHBase</code> program is the HBase version
of <code>BooleanRetrieval</code> from the Bespin demo. This program
should read postings and documents from HBase.</p>
<p>We will run your code as follows:</p>
<pre>
hadoop jar target/bigdata2017w-0.1.0-SNAPSHOT.jar \
ca.uwaterloo.cs.bigdata2017w.assignment7.BooleanRetrievalHBase \
-config /u0/cs489/hbase-site.xml \
-index cs489-2017w-lintool-a7-index-shakespeare \
-collection cs489-2017w-lintool-a7-collection-shakespeare \
-query "outrageous fortune AND"
</pre>
<p>Note that <code>-index</code> and <code>-collection</code> specify
HBase tables (the result of <code>BuildInvertedIndexHBase</code> and
<code>InsertCollectionHBase</code>, respectively). You should verify
that all the sample queries (from assignment 3) work. Your output
should match the output of assignment 3.</p>
<h4 style="padding-top: 10px">Search API Endpoint</h4>
<p>Finally, write a search API REST endpoint in a program
named <code>HBaseSearchEndpoint</code>, wrapping around
the <code>BooleanRetrievalHBase</code> program above. We'll start up
the server as follows:</p>
<pre>
hadoop jar target/bigdata2017w-0.1.0-SNAPSHOT.jar \
ca.uwaterloo.cs.bigdata2017w.assignment7.HBaseSearchEndpoint \
-config /u0/cs489/hbase-site.xml \
-index cs489-2017w-lintool-a7-index-shakespeare \
-collection cs489-2017w-lintool-a7-collection-shakespeare \
-port 7890
</pre>
<p>The <code>-port</code> option specifies the port number that the
services listen on. When you are testing the server, please use random
port numbers, because otherwise everyone's service will collide.</p>
<p>The search API endpoint should behave as follows, returning
JSON.</p>
<pre>
$ curl http://localhost:7890/search?query=fair+nature+AND
[
{"docid": 425450, "text": " CELIA. No; when Nature hath made a fair creature, may she not by"},
{"docid": 1553567, "text": " Disguise fair nature with hard-favour'd rage;"},
{"docid": 5327159, "text": " Showing fair nature is both kind and tame;"}
]
$ curl http://localhost:7890/search?query=outrageous+fortune+AND
[
{"docid": 1073319, "text": " The slings and arrows of outrageous fortune"}
]
</pre>
<p>The reference solution uses Jetty, but you are welcome to use any
framework for the service, so long as the API endpoint conforms to the
behavior above.</p>
<h4 style="padding-top: 10px">Turning in the Assignment</h4>
<p>Please follow these instructions carefully!</p>
<p>Make sure your repo has the following items:</p>
<ul>
<li>If you have any notes you wish to convey to us, put it
in <code>bigdata2017w/assignment7.md</code>. Otherwise, please create
an empty file—following previous assignments, this is where your
grade with go.</li>
<li>Your implementations should go in
package <code>ca.uwaterloo.cs.bigdata2017w.assignment7</code>. At the
minimum, you should have <code>BuildInvertedIndexHBase</code>,
<code>InsertCollectionHBase</code>, <code>BooleanRetrievalHBase</code>,
and <code>HBaseSearchEndpoint</code>. Feel free to include helper
classes also.</li>
</ul>
<p>The following check script is provided for you:</p>
<ul>
<li><a href="assignments/check_assignment7_linux.py"><code>check_assignment7_linux.py</code></a></li>
</ul>
<p>Note that the check script does not check the behavior of your
search endpoint.</p>
<p>When you've done everything, commit to your repo and remember to
push back to origin. You should be able to see your edits in the web
interface. Before you consider the assignment "complete", we would
recommend that you verify everything above works by performing a clean
clone of your repo and run the public check scripts.</p>
<p>That's it!</p>
<h4 style="padding-top: 10px">Grading</h4>
<p>The entire assignment is worth 50 points:</p>
<ul>
<li>The implementation of <code>BuildInvertedIndexHBase</code>,
<code>InsertCollectionHBase</code>, <code>BooleanRetrievalHBase</code>,
and <code>HBaseSearchEndpoint</code> are each worth 5 points.
<li>Linux Student CS environment: Getting your code to run on sample
queries (the same as the ones in assignment 3) is worth 10
points. That is, to earn all 10 points, we should be able to run
your code on the Shakespeare collection, following exactly the
procedure above.</li>
<li>Altiscale: Getting your code to run on sample queries (the same
as the ones in assignment 3) is worth 10 points. That is, to earn
all 10 points, we should be able to run your code on the Wikipedia
collection, following exactly the procedure above.</li>
<li>Another 10 points is allotted to us verifying the behavior and
output of your program in ways that we will not tell you. We're
giving you the "public" versions of the check scripts; we'll run a
"private" version to examine your output further (i.e., think blind
test cases).</li>
</ul>
<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>
<p style="padding-top:100px" />
</div><!-- /.container -->
<!-- jQuery (necessary for Bootstrap's JavaScript plugins) -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
<!-- Include all compiled plugins (below), or include individual files as needed -->
<script src="js/bootstrap.min.js"></script>
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<script src="js/ie10-viewport-bug-workaround.js"></script>
</body>
</html>