-
Notifications
You must be signed in to change notification settings - Fork 24
/
Copy pathassignment5.html
741 lines (583 loc) · 27.4 KB
/
assignment5.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="Course homepage for CS 489 Big Data Infrastructure (Winter 2017) at the University of Waterloo">
<meta name="author" content="Jimmy Lin">
<title>Big Data Infrastructure</title>
<!-- Bootstrap -->
<link href="css/bootstrap.min.css" rel="stylesheet">
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<link href="css/ie10-viewport-bug-workaround.css" rel="stylesheet">
<style>
body {
padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
}
</style>
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
<nav class="navbar navbar-inverse navbar-fixed-top">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
</div>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li><a href="index.html">Overview</a></li>
<li><a href="organization.html">Organization</a></li>
<li><a href="syllabus.html">Syllabus</a></li>
<li class="active"><a href="assignments.html">Assignments</a></li>
<li><a href="software.html">Software</a></li>
</ul>
</div><!--/.nav-collapse -->
</div>
</nav>
<div class="container">
<div class="page-header">
<div style="float: right"/><img src="images/waterloo_logo.png"/></div>
<h1>Assignments <small>CS 489/698 Big Data Infrastructure (Winter 2017)</small></h1>
</div>
<div class="subnav">
<ul class="nav nav-pills">
<li><a href="assignment0.html">0</a></li>
<li><a href="assignment1.html">1</a></li>
<li><a href="assignment2.html">2</a></li>
<li><a href="assignment3.html">3</a></li>
<li><a href="assignment4.html">4</a></li>
<li><a href="assignment5.html">5</a></li>
<li><a href="assignment6.html">6</a></li>
<li><a href="assignment7.html">7</a></li>
<li><a href="project.html">Final Project</a></li>
</ul>
</div>
<section style="padding-top:0px">
<div>
<h3>Assignment 5: Data Warehousing <small>due 1:00pm March 2</small></h3>
<p>In this assignment you'll be hand-crafting Spark programs that
implement SQL queries in a data warehousing scenario. Various
SQL-on-Hadoop solutions share in providing an SQL query interface on
data stored in HDFS via an intermediate execution framework. For
example, Hive queries are "compiled" into MapReduce jobs; SparkSQL
queries rely on Spark processing primitives for query execution. In
this assignment, you'll play the role of mediating between SQL queries
and the underlying execution framework (Spark). In more detail, you'll
be given a series of SQL queries, and for each you'll have to
hand-craft a Spark program that corresponds to each query.</p>
<p><b>Important:</b> You are not allowed to use the Dataframe API or
Spark SQL to complete this assignment (with the exception of loading Parquet files, see "hints" below).
You must write code to manipulate raw RDDs.
Furthermore, you are not allowed to use <code>join</code> and
closely-related transformations in Spark for this assignment, because
otherwise it defeats the point of the exercise. The assignment will
guide you toward what we are looking for, but if you have any
questions as to what is allowed or not, ask!</p>
<p>We will be working with data from the TPC-H benchmark in this
assignment. The Transaction Processing Performance Council (TPC) is a
non-profit organization that defines various database benchmarks so
that database vendors can evaluate the performance of their products
fairly. TPC defines the "rules of the game", so to speak. TPC defines
various benchmarks, of which one is TPC-H, for evaluating ad-hoc
decision support systems in a data warehousing scenario. The current
version of the TPC-H benchmark is located
<a href="assignments/tpc-h_v2.17.1.pdf">here</a>. You'll want to skim
through this (long) document; the most important part is the
entity-relationship diagram of the data warehouse on page
13. Throughout the assignment you'll likely be referring to it, as it
will help you make sense of the queries you are running.</p>
<p>The TPC-H benchmark comes with a data generator, and we have
generated some data for you:
<ul>
<li><a href="assignments/TPC-H-0.1-TXT.tar.gz"><code>TPC-H-0.1-TXT.tar.gz</code></a>:
the plain-text version of the data</li>
<li><a href="assignments/TPC-H-0.1-PARQUET.tar.gz"><code>TPC-H-0.1-PARQUET.tar.gz</code></a>:
the Parquet version of the data</li>
</ul>
<p>Download and unpack the datasets above. In the first part of the
assignment where you will be working with Spark locally, you will run
your queries against both datasets.</p>
<p>For the plain-text version of the data, you will see a number of
text files, each corresponding to a table in the TPC-H schema. The
files are delimited by <code>|</code>. You'll notice that some of the
fields, especially the text fields, are gibberish—that's normal,
since the contents are randomly generated.</p>
<p>The Parquet data has the same content, but is encoded in
Parquet.</p>
<p>Implement the following seven SQL queries, running on
the <code>TPC-H-0.1-TXT</code> plain-text data as well as
the <code>TPC-H-0.1-PARQUET</code> Parquet data. For each query you
will write a program that takes the option <code>--text</code> to work
with the plain-text data and <code>--parquet</code>to work with the
Parquet data. More details will be provided below.</p>
<p>Each SQL query is accompanied by a written description of what the
query does: if there are any ambiguities in the language, you can
always assume that the SQL query is correct. Your implementation of each query will
be a separate Spark program. Put your code in the package
<code>ca.uwaterloo.cs.bigdata2017w.assignment5</code>, in the
same repo that you've been working in all semester. Since you'll be
writing Scala code, your source files should go into
<code>src/main/scala/ca/uwaterloo/cs/bigdata2017w/assignment5/</code>.</p>
<p><b>Q1:</b> How many items were shipped on a particular date? This corresponds to the following SQL query:
<pre>
select count(*) from lineitem where l_shipdate = 'YYYY-MM-DD';
</pre>
<p>Write a program such that when we execute the following
command (on the plain-text data):</p>
<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q1 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01' --text
</pre>
<p>the answer to the above SQL query will be printed to stdout (on the
console where the above command is executed), in a line that matches
the following regular expression:</p>
<pre>
ANSWER=\d+
</pre>
<p>The output of the query can contain logging and debug information,
but there must be a line with the answer in <b>exactly</b> the above
format.</p>
<p>And the Parquet version:</p>
<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q1 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --date '1996-01-01' --parquet
</pre>
<p>In both cases, the value of the
<code>--input</code> argument is the directory that contains the
data (either in plain text or in Parquet). The value of the <code>--date</code> argument
corresponds to the <code>l_shipdate</code> predicate in the where
clause in the SQL query. You need to anticipate dates of the
form <code>YYYY-MM-DD</code>, <code>YYYY-MM</code>, or
just <code>YYYY</code>, and your query needs to give the correct
answer depending on the date format. You can assume that a valid date
(in one of the above formats) is provided, so you do not need to
perform input validation.</p>
<p><b>Q2:</b> Which clerks were responsible for processing items that
were shipped on a particular date? List the first 20 by order
key. This corresponds to the following SQL query:</p>
<pre>
select o_clerk, o_orderkey from lineitem, orders
where
l_orderkey = o_orderkey and
l_shipdate = 'YYYY-MM-DD'
order by o_orderkey asc limit 20;
</pre>
<p>Write a program such that when we execute the following
command (on the plain-text data):</p>
<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q2 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01' --text
</pre>
<p>the answer to the above SQL query will be printed to stdout (on the
console where the above command is executed), as a sequence of tuples
in the following format:</p>
<pre>
(o_clerk,o_orderkey)
(o_clerk,o_orderkey)
...
</pre>
<p>That is, each tuple is comma-delimited and surrounded by
parentheses. Everything described in <b>Q1</b> about dates applies
here as well.</p>
<p>And the Parquet version:</p>
<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q2 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --date '1996-01-01' --parquet
</pre>
<p>In the design of this data warehouse, the <code>lineitem</code>
and <code>orders</code> tables are not likely to fit in
memory. Therefore, the only scalable join approach is the reduce-side
join. You must implement this join in Spark using
the <code>cogroup</code> transformation.</p>
<p><b>Q3:</b> What are the names of parts and suppliers of items
shipped on a particular date? List the first 20 by order key. This
corresponds to the following SQL query:</p>
<pre>
select l_orderkey, p_name, s_name from lineitem, part, supplier
where
l_partkey = p_partkey and
l_suppkey = s_suppkey and
l_shipdate = 'YYYY-MM-DD'
order by l_orderkey asc limit 20;
</pre>
<p>Write a program such that when we execute the following
command (on the plain-text data):</p>
<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q3 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01' --text
</pre>
<p>the answer to the above SQL query will be printed to stdout (on the
console where the above command is executed), as a sequence of tuples
in the following format:</p>
<pre>
(l_orderkey,p_name,s_name)
(l_orderkey,p_name,s_name)
...
</pre>
<p>That is, each tuple is comma-delimited and surrounded by
parentheses. Everything described in <b>Q1</b> about dates applies
here as well.</p>
<p>And the Parquet version:</p>
<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q3 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --date '1996-01-01' --parquet
</pre>
<p>In the design of this data warehouse, it is assumed that
the <code>part</code> and <code>supplier</code> tables will fit in
memory. Therefore, it is possible to implement a hash join. For this
query, you must implement a hash join in Spark with broadcast
variables.</p>
<p><b>Q4:</b> How many items were shipped to each country on a
particular date? This corresponds to the following SQL query:</p>
<pre>
select n_nationkey, n_name, count(*) from lineitem, orders, customer, nation
where
l_orderkey = o_orderkey and
o_custkey = c_custkey and
c_nationkey = n_nationkey and
l_shipdate = 'YYYY-MM-DD'
group by n_nationkey, n_name
order by n_nationkey asc;
</pre>
<p>Write a program such that when we execute the following
command (on the plain-text data):</p>
<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q4 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01' --text
</pre>
<p>the answer to the above SQL query will be printed to stdout (on the
console where the above command is executed). Format
the output in the same manner as with the above queries: one tuple per
line, where each tuple is comma-delimited and surrounded by
parentheses. Everything described in <b>Q1</b> about dates applies
here as well.</p>
<p>And the Parquet version:</p>
<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q4 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --date '1996-01-01' --parquet
</pre>
<p>Implement this query with different join techniques as you see
fit. You can assume that the <code>lineitem</code>
and <code>orders</code> table will not fit in memory, but you can
assume that the <code>customer</code> and <code>nation</code> tables
will both fit in memory. For this query, the performance as well as
the scalability of your implementation will contribute to the
grade.</p>
<p><b>Q5:</b> This query represents a very simple end-to-end ad hoc
analysis task: Related to <b>Q4</b>, your boss has asked you to
compare shipments to Canada vs. the United States by month, given all
the data in the data warehouse. You think this request is best
fulfilled by a line graph, with two lines (one representing the US and
one representing Canada), where the <i>x</i>-axis is the year/month and
the <i>y</i> axis is the volume, i.e., <code>count(*)</code>.
Generate this graph for your boss.</p>
<p>First, write a program such that when we execute the following
command (on plain text):</p>
<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q5 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --text
</pre>
<p>the raw data necessary for the graph will be printed to stdout (on the
console where the above command is executed).
Format the output in the same manner as with the above queries: one
tuple per line, where each tuple is comma-delimited and surrounded by
parentheses.</p>
<p>And the Parquet version:</p>
<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q5 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --parquet
</pre>
<p>Next, create this actual graph: use whatever tool you are
comfortable with, e.g., Excel, gnuplot, etc.</p>
<p><b>Q6:</b> This is a slightly modified version of TPC-H Q1 "Pricing
Summary Report Query". This query reports the amount of business that
was billed, shipped, and returned:</p>
<pre>
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from lineitem
where
l_shipdate = 'YYYY-MM-DD'
group by l_returnflag, l_linestatus;
</pre>
<p>Write a program such that when we execute the following
command (on the plain-text data):</p>
<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q6 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01' --text
</pre>
<p>the answer to the above SQL query will be printed to stdout (on the
console where the above command is executed). Format
the output in the same manner as with the above queries: one tuple per
line, where each tuple is comma-delimited and surrounded by
parentheses. Everything described in <b>Q1</b> about dates applies
here as well.</p>
<p>And the Parquet version:</p>
<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q6 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --date '1996-01-01' --parquet
</pre>
<p>Implement this query as efficiently as you can, using all of the
optimizations we discussed in lecture. You will only get full points
for this question if you exploit all the optimization opportunities
that are available.</p>
<p><b>Q7:</b> This is a slightly modified version of TPC-H Q3
"Shipping Priority Query". This query retrieves the 10 unshipped
orders with the highest value:</p>
<pre>
select
c_name,
l_orderkey,
sum(l_extendedprice*(1-l_discount)) as revenue,
o_orderdate,
o_shippriority
from customer, orders, lineitem
where
c_custkey = o_custkey and
l_orderkey = o_orderkey and
o_orderdate < "YYYY-MM-DD" and
l_shipdate > "YYYY-MM-DD"
group by
c_name,
l_orderkey,
o_orderdate,
o_shippriority
order by
revenue desc
limit 10;
</pre>
<p>Write a program such that when we execute the following
command (on the plain-text data):</p>
<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q7 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01' --text
</pre>
<p>the answer to the above SQL query will be printed to stdout (on the
console where the above command is executed).
Format the output in the same manner as with the above queries: one
tuple per line, where each tuple is comma-delimited and surrounded by
parentheses. Here you can assume that the date argument is only in the
format <code>YYYY-MM-DD</code> and that it is a valid date.</p>
<p>And the Parquet version:</p>
<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q7 \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-PARQUET --date '1996-01-01' --parquet
</pre>
<p>Implement this query as efficiently as you can, using all of the
optimizations we discussed in lecture. Plan joins as you see fit,
keeping in mind above assumptions on what will and will not fit in
memory. You will only get full points for this question if you exploit
all the optimization opportunities that are available.</p>
<h4 style="padding-top: 10px">Scaling up on Altiscale</h4>
<p>Once you get your implementation working and debugged in the Linux
environment, run your code on a larger TCP-H dataset, located on HDFS:
<ul>
<li><code>/shared/cs489/data/TPC-H-10-TXT</code> for the plain-text data</li>
<li><code>/shared/cs489/data/TPC-H-10-PARQUET</code> for the Parquet data</li>
</ul>
<p>Make sure that all seven queries above run correctly on this larger
dataset. For your reference, the the command-line parameters
for Q1-Q7 are provided below (so you can copy and paste).</p>
<p><b>Important:</b> Note that for this assignment we are using a
custom Spark submit script <code>a5-spark-submit</code> (which should
already be on your path). The difference in this submit script is that
we set <code>--deploy-mode client</code>. This will force the driver
to run on the client (i.e., workspace), so that you will see the
output of <code>println</code>. Otherwise, the driver will run on an
arbitrary cluster node, making stdout not directly visible. Also, we
redirect stdout to a file so that it'll be easier to view the output.</p>
<p><b>Q1:</b></p>
<pre>
a5-spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q1 \
--num-executors 5 --executor-cores 2 --executor-memory 4G --driver-memory 2g \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input /shared/cs489/data/TPC-H-10-TXT \
--date '1996-01-01' --text > q1t.out
a5-spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q1 \
--num-executors 5 --executor-cores 2 --executor-memory 4G --driver-memory 2g \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input /shared/cs489/data/TPC-H-10-PARQUET \
--date '1996-01-01' --parquet > q1p.out
</pre>
<p><b>Q2:</b></p>
<pre>
a5-spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q2 \
--num-executors 5 --executor-cores 2 --executor-memory 4G --driver-memory 2g \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input /shared/cs489/data/TPC-H-10-TXT \
--date '1996-01-01' --text > q2t.out
a5-spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q2 \
--num-executors 5 --executor-cores 2 --executor-memory 4G --driver-memory 2g \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input /shared/cs489/data/TPC-H-10-PARQUET \
--date '1996-01-01' --parquet > q2p.out
</pre>
<p><b>Q3:</b></p>
<pre>
a5-spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q3 \
--num-executors 5 --executor-cores 2 --executor-memory 4G --driver-memory 2g \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input /shared/cs489/data/TPC-H-10-TXT \
--date '1996-01-01' --text > q3t.out
a5-spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q3 \
--num-executors 5 --executor-cores 2 --executor-memory 4G --driver-memory 2g \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input /shared/cs489/data/TPC-H-10-PARQUET \
--date '1996-01-01' --parquet > q3p.out
</pre>
<p><b>Q4:</b></p>
<pre>
a5-spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q4 \
--num-executors 5 --executor-cores 2 --executor-memory 4G --driver-memory 2g \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input /shared/cs489/data/TPC-H-10-TXT \
--date '1996-01-01' --text > q4t.out
a5-spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q4 \
--num-executors 5 --executor-cores 2 --executor-memory 4G --driver-memory 2g \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input /shared/cs489/data/TPC-H-10-PARQUET \
--date '1996-01-01' --parquet > q4p.out
</pre>
<p><b>Q5:</b></p>
<pre>
a5-spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q5 \
--num-executors 5 --executor-cores 2 --executor-memory 4G --driver-memory 2g \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input /shared/cs489/data/TPC-H-10-TXT --text > q5t.out
a5-spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q5 \
--num-executors 5 --executor-cores 2 --executor-memory 4G --driver-memory 2g \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input /shared/cs489/data/TPC-H-10-PARQUET --parquet > q5p.out
</pre>
<p><b>Q6:</b></p>
<pre>
a5-spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q6 \
--num-executors 5 --executor-cores 2 --executor-memory 4G --driver-memory 2g \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input /shared/cs489/data/TPC-H-10-TXT \
--date '1996-01-01' --text > q6t.out
a5-spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q6 \
--num-executors 5 --executor-cores 2 --executor-memory 4G --driver-memory 2g \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input /shared/cs489/data/TPC-H-10-PARQUET \
--date '1996-01-01' --parquet > q6p.out
</pre>
<p><b>Q7:</b></p>
<pre>
a5-spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q7 \
--num-executors 5 --executor-cores 2 --executor-memory 4G --driver-memory 2g \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input /shared/cs489/data/TPC-H-10-TXT \
--date '1996-01-01' --text > q7t.out
a5-spark-submit --class ca.uwaterloo.cs.bigdata2017w.assignment5.Q7 \
--num-executors 5 --executor-cores 2 --executor-memory 4G --driver-memory 2g \
target/bigdata2017w-0.1.0-SNAPSHOT.jar --input /shared/cs489/data/TPC-H-10-PARQUET \
--date '1996-01-01' --parquet > q7p.out
</pre>
<p>In this configuration, your programs shouldn't take more than a
couple of minutes. If it's taking more than five minutes, you're
probably doing something wrong.</p>
<h4 style="padding-top: 10px">Turning in the Assignment</h4>
<p>Please follow these instructions carefully!</p>
<p>Make sure your repo has the following items:</p>
<ul>
<li>Optional: put anything that you want to convey to us about your
implementation in <code>bigdata2017w/assignment5.md</code>.</li>
<li>Two files,
named <code>bigdata2017w/assignment5-Q5-small.pdf</code>
and <code>bigdata2017w/assignment5-Q5-large.pdf</code> that contains
the graphs for Q5 on the <code>TPC-H-0.1-TXT</code>
and <code>TPC-H-10-TXT</code> datasets, respectively. If you cannot
easily generate PDFs, the files should be some easily-viewable format,
e.g., <code>png</code>, <code>gif</code>, etc.</li>
<li>Your implementations for the queries should be in
package <code>ca.uwaterloo.cs.bigdata2017w.assignment5</code>. There
should be at the minimum seven classes (Q1-Q7), but you may include
helper classes as you see fit.</li>
</ul>
<p>Make sure your implementation runs in the Linux student CS
environment on <code>TPC-H-0.1-TXT</code>, and on the Alitscale
cluster on the <code>TPC-H-10-TXT</code> data.</p>
<p>Specifically, we will clone your repo and use the below check
scripts:</p>
<ul>
<li><a href="assignments/check_assignment5_public_linux.py"><code>check_assignment5_public_linux.py</code></a>
in the Linux Student CS environment.</li>
<li><a href="assignments/check_assignment5_public_altiscale.py"><code>check_assignment5_public_altiscale.py</code></a>
on the Altiscale cluster.</li>
</ul>
<p>When you've done everything, commit to your repo and remember to
push back to origin. You should be able to see your edits in the web
interface. Before you consider the assignment "complete", we would
recommend that you verify everything above works by performing a clean
clone of your repo and run the public check scripts.</p>
<p>That's it!</p>
<h4 style="padding-top: 10px">Hints</h4>
<p>The easiest way to read Parquet is to use the Dataframe API, as
follows:</p>
<pre>
import org.apache.spark.sql.SparkSession
val sparkSession = SparkSession.builder.getOrCreate
val lineitemDF = sparkSession.read.parquet("TPC-H-0.1-PARQUET/lineitem")
val lineitemRDD = lineitemDF.rdd
</pre>
<p>Once you read in the table as as dataframe, convert it into an RDD
and work from there.</p>
<p>To use the above features, you'll need to add the following
dependency to your <code>pom.xml</code>:</p>
<pre>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
</pre>
<p>You are not allowed to use Spark SQL in your implementations for
this assignment, but I have no problem if you use Spark SQL
to <i>check</i> your answers.</p>
<h4 style="padding-top: 10px">Grading</h4>
<p>The entire assignment is worth 100 points:</p>
<ul>
<li>Getting your code to compile is worth 10 points (by now, these
should be "free" points).</li>
<li>For Q1-Q3, each query is worth 10 points: 5 points for a correct
implementation that works in the Linux Student CS environment (both
on plain text and Parquet), 5 points for a correct implementation
that works on the Altiscale cluster (both on plain text and
Parquet).</li>
<li>Q4 and Q5 are each worth 14 points: 7 points for a correct
implementation that works in the Linux Student CS environment (both
on plain text and Parquet), 7 points for a correct implementation
that works on the Altiscale cluster (both on plain text and
Parquet).</li>
<li>Q6 and Q7 are each worth 16 points: 8 points for a correct
implementation that works in the Linux Student CS environment (both
on plain text and Parquet), 8 points for a correct implementation
that works on the Altiscale cluster (both on plain text and
Parquet).</li>
</ul>
<p>A working implementation means that your code gives the right
output according to our private check scripts, which will
contain <code>--date</code> parameters that are unknown to you (but
will nevertheless conform to our specifications above).</p>
<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>
<p style="padding-top:100px" />
</div><!-- /.container -->
<!-- jQuery (necessary for Bootstrap's JavaScript plugins) -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
<!-- Include all compiled plugins (below), or include individual files as needed -->
<script src="js/bootstrap.min.js"></script>
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<script src="js/ie10-viewport-bug-workaround.js"></script>
</body>
</html>