開課年級: 大四研究所 (中文授課)
學分數: 3
課程目標:
1. be familiar with data analytic platform, software stack and tools for cloud computing and big data processing.
2. do MapReduce coding for data processing problems.
3. paper reading on the state-of-art data analytic platforms and their optimization techniques.

先修科目: None

課程大綱:
1. Introduction of Cloud Computing & Big Data

2. Apache Hadoop Data Analytic Software Stack
2-1: File System: HDFS, GFS (Google File System)
2-2: Parallel Processing: MapReduce, Hadoop
2-3: NoSQL Database: Big Table, HBase, Hive
2-4: Resource Management: Hadoop-YARN

3. MapReduce Basic Implementation, Algorithms & Application
3-1: Inverted Index
3-2: Page Rank
3-3: Graph Algorithm
3-4: Hadoop Programming

4. MapReduce Advanced Scheduling & Optimization
4-1: Locality Aware Delay Scheduling
4-2: Heterogeneous Environment Scheduling
4-3: Load balancing

5. Query Processing Optimization
5-1: Hadoop DB
5-2: Query Optimization

6. BDAS(Berkeley Data Analytic System) In-Memory Computing
6-1: Mesos
6-2: SPARK
6-3: SHARK

指定用書: Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer. Morgan & Claypool Publishers, 2010.
參考書籍:
A. 雲端程式設計入門與應用實務
B. Hadoop: The Definitive Guide, O'Reilly, 2009
C. Paper reading:
1. The Google File System. SOSP-03, pages 29-43.
2. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004, pages 137-150.
3. MapReduce and Parallel DBMSs: Friends or Foes? Communications of the ACM, 53(1):64-71, 2010.
4. Bigtable: A Distributed Storage System for Structured Data. OSDI 2006, pages 205-218.
5. Apache Hadoop YARN: Yet Another Resource Negotiator. SOCC 2013.
6. Hive – A Petabyte Scale Data Warehouse Using Hadoop. ICDE 2010.

7. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. Euro-Sys 2010.
8. Scarlett: Coping with Skewed Content Popularity in MapReduce Clusters. Euro-Sys 2011.
9. Interference and Locality-Aware Task Scheduling for MapReduce Applications in Virtual Clusters. HPDC 2013
11. Balancing Reducer Skew in MapReduce Workloads using Progressive Sampling. SOCC 2012.
12. Coupling Task Progress for MapReduce Resource-Aware Scheduling. INFOCOM 2013.

13. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. VLDB 2009.
14. SciHadoop: array-based query processing in Hadoop. SC 2011.
15. Spark: Cluster Computing with Working Sets. HotCloud 2009.
16. Shark: SQL and Rich Analytics at Scale. SIGMOD 2013.
17. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. NSDI 2011.
18. HaLoop: efficient iterative data processing on large clusters. VLDB 2010
19. The case for RAMClouds: scalable high-performance storage entirely in DRAM. SIGOPS 2010

教學方式:
講義授課上機實驗

教學進度:
一周一個講義或實驗

成績考核
Lab: 30%
LAB1: Hadoop (HDFS & HBASE)
LAB2: MapReduce
LAB3: SPARK

Programming HW: 30%
HW1: Inverted Index
HW2: Page Rank

Final Project: 25%
Build a Search Engine using the open source, algorithm learned from the class.

Course Participation & Quiz: 15%