A Weather Dataset

略

Analyzing the Data with Unix tools

略

Analyzing the Data with Hadoop

Map and Reduce

这一节主要就是讲了如何结合mapreduce统计信息每年温度的最大值。mapreduce处理他的思想就是在map端进行数据预处理，取出一条记录中的年份和温度，map的output是按照年份为key的聚合，其value为该map上的所有温度的list。reduce端遍历这个list并找到最大温度

Java MapReduce

这一节主要讲如何写java的mapredce。

map长这样

public class MaxTemperatureMapper
	extends Mapper<LongWritable, Text, Text, IntWritable> {
	private static final int MISSING = 9999;
	@Override
	public void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		String line = value.toString();
		String year = line.substring(15, 19);
		int airTemperature;
		if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
			airTemperature = Integer.parseInt(line.substring(88, 92)); } else {
			airTemperature = Integer.parseInt(line.substring(87, 92));
		}
		String quality = line.substring(92, 93);
		if (airTemperature != MISSING && quality.matches("[01459]")){ 
			context.write(new Text(year), new IntWritable(airTemperature));
		} 
	}
}

reduce长这样

public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
	@Override
	public void reduce(Text key, Iterable<IntWritable> values, Context context) 
			throws IOException, InterruptedException {
		int maxValue = Integer.MIN_VALUE; 
		for (IntWritable value : values) {
			maxValue = Math.max(maxValue, value.get());
		}
		context.write(key, new IntWritable(maxValue)); }
	}
}

是比较基础的概念

在main函数中有很多设置，这个看jobconf的文档比较清楚.

一般来讲需要设置

job.setJarByClass//哪个类执行这次mr
job.setJobName//给mr起个名字
FileInputFormat.addInputPath//input的路径，可以是单个文件，或者一个文件夹，或者a filte pattern(文件匹配表达式？)。
FileOutputFormat.setOutputPath
job.setOutputKeyClass/setOutputValueClass/setMapOutputKeyClass/setMapOutputValueClass设置mr任务输出的格式

Scaling Out

Data Flow

一个mr job包括input data， mr program和configure。
map端将任务划分为input splits。默认情况下MR会为每个split创建一个map task
通常情况下，一个splits大小是和hdfs的block大小一致，为128mb。splits过大会有load balance问题，太小会有很多overhead。同时和block一致有助于任务划分。
任务划分有一个data locality optimization。因为hdfs是分布式存储，所以input也是分布式。首先任务倾向于node含有input文件的机器，其次是同一机架的node，其次是其他机架。(figure2-2)
map的结果写在local disk，不写入hdfs是因为这些结果只是中间结果。
reduce没有空间局部性，因为每个map task都为每个reduce创建了输入文件。当reduce有多个时，map tasks partition their output。partition和key对应的规则可以自定义，但是每个key对应唯一一个reduce。
map和reduce文件的传输是一个shuffle的过程，可以参考“shuffle and sort” on page 197
reduce tasks可以有0个，如果你不需要shuffle你的数据

Finally, it’s also possible to have zero reduce tasks. This can be appropriate when you don’t need the shuffle because the processing can be carried out entirely in parallel (a few examples are discussed in “NLineInputFormat” on page 234). In this case, the only off-node data transfer is when the map tasks write to HDFS (see Figure 2-5).(p34) 这一段的确没理解

Combiner Functions

MR的一大延迟就是在map向reduce传递数据，因此map task结束后可以用combiner处理数据，进行压缩。MR不保证combiner的调用次数。
combiner继承reduce class

Hadoop Streaming

用其他语言编写map task和reduce task。书中给了ruby和python的例子，都是直接读标准输入，tab分割key，value。输出也是直接print
比如python

#map
#!/usr/bin/env python
import re
import sys
for line in sys.stdin:
	val = line.strip()
	(year, temp, q) = (val[15:19], val[87:92], val[92:93]) 
	if (temp != "+9999" and re.match("[01459]", q)):
		print "%s\t%s" % (year, temp)
		
#reduce
#!/usr/bin/env python
import sys
(last_key, max_val) = (None, -sys.maxint) 
for line in sys.stdin:
	(key, val) = line.strip().split("\t") 
	if last_key and last_key != key:
		print "%s\t%s" % (last_key, max_val)
		(last_key, max_val) = (key, int(val)) 
	else:
   		(last_key, max_val) = (key, max(max_val, int(val)))
   if last_key:
		print "%s\t%s" % (last_key, max_val)

启动脚本为

1
2
3

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \ 
-input input/ncdc/sample.txt \
-output output \
-mapper map.py \
-combiner reduce.py \
-reducer reduce.py

这种方法在没有性能要求的时候开发难度的确有所降低。但是对性能应该有影响，具体没有测过

本文采用创作共用保留署名-非商业-禁止演绎4.0国际许可证，欢迎转载，但转载请注明来自http://thousandhu.github.io，并保持转载后文章内容的完整。本人保留所有版权相关权利。

本文链接：http://thousandhu.github.io/2015/10/26/Hadoop-The-Definitive-Guide-4th读书笔记-chapter2/