-
Notifications
You must be signed in to change notification settings - Fork 24
/
Copy pathassignment3.html
349 lines (275 loc) · 14.1 KB
/
assignment3.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="Course homepage for CS 489 Big Data Infrastructure (Winter 2017) at the University of Waterloo">
<meta name="author" content="Jimmy Lin">
<title>Big Data Infrastructure</title>
<!-- Bootstrap -->
<link href="css/bootstrap.min.css" rel="stylesheet">
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<link href="css/ie10-viewport-bug-workaround.css" rel="stylesheet">
<style>
body {
padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
}
</style>
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
<nav class="navbar navbar-inverse navbar-fixed-top">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
</div>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li><a href="index.html">Overview</a></li>
<li><a href="organization.html">Organization</a></li>
<li><a href="syllabus.html">Syllabus</a></li>
<li class="active"><a href="assignments.html">Assignments</a></li>
<li><a href="software.html">Software</a></li>
</ul>
</div><!--/.nav-collapse -->
</div>
</nav>
<div class="container">
<div class="page-header">
<div style="float: right"/><img src="images/waterloo_logo.png"/></div>
<h1>Assignments <small>CS 489/698 Big Data Infrastructure (Winter 2017)</small></h1>
</div>
<div class="subnav">
<ul class="nav nav-pills">
<li><a href="assignment0.html">0</a></li>
<li><a href="assignment1.html">1</a></li>
<li><a href="assignment2.html">2</a></li>
<li><a href="assignment3.html">3</a></li>
<li><a href="assignment4.html">4</a></li>
<li><a href="assignment5.html">5</a></li>
<li><a href="assignment6.html">6</a></li>
<li><a href="assignment7.html">7</a></li>
<li><a href="project.html">Final Project</a></li>
</ul>
</div>
<section style="padding-top:0px">
<div>
<h3>Assignment 3: Inverted Indexing <small>due 1:00pm February 2</small></h3>
<p>This assignment is to be completed in MapReduce in Java. You will
be working in the same repo as before, except that everything should
go into the package namespace
<code>ca.uwaterloo.cs.bigdata2017w.assignment3</code>.</p>
<p>Look at the inverted indexing and boolean retrieval implementation
in <a href="http://bespin.io">Bespin</a>. Make sure you understand the
code. Starting from the inverted indexing
baseline <code>BuildInvertedIndex</code>, modify the indexer code in
the following ways:</p>
<p><b>1. Index Compression.</b> The index should be compressed using
<code>VInts</code>:
see <code>org.apache.hadoop.io.WritableUtils</code>. You should also
use gap-compression (i.e., delta-compression) techniques as appropriate.</p>
<p><b>2. Buffering postings.</b> The baseline indexer implementation
currently buffers and sorts postings in the reducer, which as we
discussed in class is not a scalable solution. Address this
scalability bottleneck using techniques we discussed in class and in
the textbook.</p>
<p><b>3. Term partitioning.</b> The baseline indexer implementation
currently uses only one reducer and therefore all postings lists are
shuffled to the same node and written to HDFS in a single
partition. Change this so we can specify the number of reducers
(hence, partitions) as a command-line argument. This is, of course,
easy to do, but we need to make sure that the searcher understands
this partitioning also.</p>
<p><b>Note:</b> The major scalability issue is
buffering <i>uncompressed</i> postings in memory. In your solution,
you'll still end up buffering each postings list, but
in <i>compressed</i> form (raw bytes, no additional object
overhead). This is fine because if you use the right compression
technique, the postings lists are quite small. As a data point, on a
collection of 50 million web pages, 2GB heap is more than enough for a
full <i>positional</i> index (and in this assignment you're not asked
to store positional information in your postings).</p>
<p>To go into a bit more detail: in the reference implementation, the
final key type is <code>PairOfWritables<IntWritable,
ArrayListWritable<PairOfInts>></code>. The most obvious idea
is to change that into something
like <code>PairOfWritables<VIntWritable,
ArrayListWritable<PairOfVInts>></code>. This does not work!
The reason is that you will still be materializing each posting, i.e.,
all <code>PairOfVInts</code> objects in memory. This translates into a
Java object for every posting, which is wasteful in terms of memory
usage and will exhaust memory pretty quickly as you scale. In other
words, you're <i>still</i> buffering objects—just inside
the <code>ArrayListWritable</code>.
<p>This new indexer should be
named <code>BuildInvertedIndexCompressed</code>. This new class should
be in the
package <code>ca.uwaterloo.cs.bigdata2017w.assignment3</code>. Make
sure it works on the Shakespeare collection.</p>
<p>Modify <code>BooleanRetrieval</code> so that it works with the new
compressed indexes. Name this new
class <code>BooleanRetrievalCompressed</code>. This new class should
be in the same package as above and give the same
output as the old version.</p>
<p>Use <code>BuildInvertedIndex</code>
and <code>BooleanRetrieval</code> from Bespin as your starting
points. That is, copy over into your repo, rename, and begin your
assignment from there. Don't unnecessarily change code not directly
related to points #1-#3 above. In particular, <b>do not</b> change how
the documents are tokenized, etc. in <code>BuildInvertedIndex</code>
(otherwise there's no good way to check for the correctness of your
algorithm). Also, <b>do not</b> change the <code>fetchLine</code>
method in <code>BooleanRetrieval</code> so that everyone's output
looks the same.</p>
<p>In more detail, make sure that you can build the inverted index
with the following command (make sure your implementation runs in the
Linux student CS environment, as that is where we will be doing the
marking):</p>
<pre>
$ hadoop jar target/bigdata2017w-0.1.0-SNAPSHOT.jar \
ca.uwaterloo.cs.bigdata2017w.assignment3.BuildInvertedIndexCompressed \
-input data/Shakespeare.txt -output cs489-2017w-lintool-a3-index-shakespeare -reducers 4
</pre>
<p>We should be able to control the number of partitions (#3 above)
with the <code>-reducers</code> option. That is, the code should give
the correct results no matter what we set the value to.</p>
<p>Once we build the index, we should then be able to run a boolean
query as follows (in exactly the same manner
as <code>BooleanRetrieval</code> in Bespin</a>):</p>
<pre>
$ hadoop jar target/bigdata2017w-0.1.0-SNAPSHOT.jar \
ca.uwaterloo.cs.bigdata2017w.assignment3.BooleanRetrievalCompressed \
-index cs489-2017w-lintool-a3-index-shakespeare -collection data/Shakespeare.txt \
-query "outrageous fortune AND"
$ hadoop jar target/bigdata2017w-0.1.0-SNAPSHOT.jar \
ca.uwaterloo.cs.bigdata2017w.assignment3.BooleanRetrievalCompressed \
-index cs489-2017w-lintool-a3-index-shakespeare -collection data/Shakespeare.txt \
-query "white red OR rose AND pluck AND"
</pre>
<p>Of course, we will try your program with additional queries to
verify its correctness.</p>
<p>Answer the following question:</p>
<p><b>Question 1.</b> What is the size of your compressed
index for Shakespeare collection? Just so we're using the same units,
report the output of <code>du -h</code>.</p>
<h4 style="padding-top: 10px">Running on the Altiscale cluster</h4>
<p>Now let's try running your implementation on the Altiscale cluster,
on the sample Wikipedia
file <code>/shared/cs489/data/enwiki-20161220-sentences-0.1sample.txt</code>
on HDFS:</p>
<pre>
$ hadoop jar target/bigdata2017w-0.1.0-SNAPSHOT.jar \
ca.uwaterloo.cs.bigdata2017w.assignment3.BuildInvertedIndexCompressed \
-input /shared/cs489/data/enwiki-20161220-sentences-0.1sample.txt \
-output cs489-2017w-lintool-a3-index-wiki -reducers 4
</pre>
<p>The Wikipedia sample contains a sentence on each line, so each
"document" is actually a sentence. Each sentence begins with the
article title and the sentence id, e.g., "Anarchism.0004" is sentence
4 from the article "Anarchism".
<p>And let's try running a query:</p>
<pre>
$ hadoop jar target/bigdata2017w-0.1.0-SNAPSHOT.jar \
ca.uwaterloo.cs.bigdata2017w.assignment3.BooleanRetrievalCompressed \
-index cs489-2017w-lintool-a3-index-wiki \
-collection /shared/cs489/data/enwiki-20161220-sentences-0.1sample.txt \
-query "waterloo stanford OR cheriton AND"
$ hadoop jar target/bigdata2017w-0.1.0-SNAPSHOT.jar \
ca.uwaterloo.cs.bigdata2017w.assignment3.BooleanRetrievalCompressed \
-index cs489-2017w-lintool-a3-index-wiki \
-collection /shared/cs489/data/enwiki-20161220-sentences-0.1sample.txt \
-query "big data AND hadoop spark OR AND"
</pre>
<p>Answer the following questions:</p>
<p><b>Question 2.</b> What is the size of your compressed
index for the sample Wikipedia collection? Just so we're using the
same units, report the output of <code>hadoop fs -du -h</code>.</p>
<p><b>Question 3.</b> What are the "documents" (article + sentence)
retrieved in response to the query <code>"waterloo stanford OR
cheriton AND"</code>?</p>
<p><b>Question 4.</b> What are the "documents" (article + sentence)
retrieved in response to the query <code>"big data AND hadoop spark OR
AND"</code>?</p>
<h4 style="padding-top: 10px">Turning in the Assignment</h4>
<p>Please follow these instructions carefully!</p>
<p>Make sure your repo has the following items:</p>
<ul>
<li>Similar to the previous assignments, the answers to the questions go
in <code>bigdata2017w/assignment3.md</code>.</li>
<li>The implementations should be in
package <code>ca.uwaterloo.cs.bigdata2017w.assignment3</code>.</li>
</ul>
<p>Make sure your implementation runs in the Linux student CS
environment on the Shakespeare collection and also on sample Wikipedia
file <code>/shared/cs489/data/enwiki-20161220-sentences-0.1sample.txt</code>
on HDFS in the Altiscale cluster, per above.</p>
<p>Specifically, we will clone your repo and use the below check
scripts:</p>
<ul>
<li><a href="assignments/check_assignment3_public_linux.py"><code>check_assignment3_public_linux.py</code></a>
in the Linux Student CS environment.</li>
<li><a href="assignments/check_assignment3_public_altiscale.py"><code>check_assignment3_public_altiscale.py</code></a> on the Altiscale cluster.</li>
</ul>
<p>When you've done everything, commit to your repo and remember to
push back to origin. You should be able to see your edits in the web
interface. Before you consider the assignment "complete", we would
recommend that you verify everything above works by performing a clean
clone of your repo and run the public check scripts.</p>
<p>That's it!</p>
<h4 style="padding-top: 10px">Grading</h4>
<p>This assignment is worth a total of 50 points, broken down as
follows:</p>
<ul>
<li>Implementation correctness is worth 30 points. Note that the
questions above are not explicitly worth any points; they exist
primarily to help us gauge your implementation correctness. For
example, if your index size is larger than we expect, it's likely
you've not applied compression correctly. If your retrieved results
do not match ours, it's likely you have a bug in your retrieval
implementation.</li>
<li>Getting your code to compile and successfully run is worth
another 10 points (5 points for the Linux student CS environment and
5 points on the Altiscale cluster). We will make a minimal effort to
fix <i>trivial</i> issues with your code (e.g., a typo)—and
deduct points—but <b>will not</b> spend time debugging your
code. It is your responsibility to make sure your code runs: we have
taken care to specify exactly how we will run your code—if
anything is unclear, it is your responsibility to seek
clarification. In
order to get a perfect score of 10 for this portion of the grade, we
should be able to run the two public check
scripts: <a href="assignments/check_assignment3_public_linux.py"><code>check_assignment3_public_linux.py</code></a>
(on Linux Student CS)
an <a href="assignments/check_assignment3_public_altiscale.py"><code>check_assignment3_public_altiscale.py</code></a>
(on Altiscale cluster) successfully without any errors.</li>
<li>Another 10 points is allotted to us verifying the behavior and
output of your program in ways that we will not tell you. We're
giving you the "public" versions of the check scripts; we'll run a
"private" version to examine your output further (i.e., think blind
test cases).</li>
</ul>
<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>
<p style="padding-top:100px" />
</div><!-- /.container -->
<!-- jQuery (necessary for Bootstrap's JavaScript plugins) -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
<!-- Include all compiled plugins (below), or include individual files as needed -->
<script src="js/bootstrap.min.js"></script>
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<script src="js/ie10-viewport-bug-workaround.js"></script>
</body>
</html>