Hadoop: The Definitive Guide 4th读书笔记: chapter 17 hive

Running Hive

HiveQL

Data Types:
复杂数据结构支持array,map,struct,union。
比如

1
2
CREATE TABLE complex ( c1 ARRAY<INT>, c2 MAP<STRING, INT>, c3 STRUCT<a:STRING, b:INT, c:DOUBLE>, c4 UNIONTYPE<STRING, INT> ); hive> SELECT c1[0], c2['b'], c3.c, c4 FROM complex;
1 2 1.0 {1:63}

Tables

managed tables and external tables

普通的table会把数据放在 hive的warehouse directory。
external table则不会移动数据

1
2
3
4
5
6
CREATE TABLE managed_table (dummy STRING); LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
//数据会被从hdfs://user/tom/data.txt mv出去,drop table以后数据会丢失
CREATE EXTERNAL TABLE external_table (dummy STRING) LOCATION '/user/tom/external_table'; LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table; //数据会移动到external table 的location。
//drop的时候只drop metadata
//external table创建table时不检查数据合法性,意味着可以先创建table,延迟创建数据

Partitions and Buckets

通过对table的指定列进行partation和buckets可以更迅速的进行搜索

Partitions:

  • partition可以在一个有很多值的列上做,hive对于多partitions的查询依然迅速(但是感觉还是应该在有限的列上做partition。比如日期之类)

  • 1
    2
    3
    CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING); LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
    //用alert table命令也可以操作partition
  • 对于partation,hive储存时实际上用了子文件夹的方式。
  • partition column并未存在文件中,实际上他们是通过文件夹名记录的

bucket:

bucket是通过将文件按照特定的列hash存储成多个小文件。他的好处有两个:

  • 更快的query。比如join时对于同样bucket的表就可以简化join操作。
  • 更快的sample操作
1
2
3
4
CREATE TABLE bucketed_users (id INT, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS;
//还可以sort
CREATE TABLE bucketed_users (id INT, name STRING) CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;

Querying Data

MapReduce Scripts

感觉是简单的UDf:

1
#!/usr/bin/env python import re import sys for line in sys.stdin: (year, temp, q) = line.strip().split() if (temp != "9999" and re.match("[01459]", q)): print "%s\t%s" % (year, temp)
1
hive> ADD FILE /Users/tom/book-workspace/hadoop-book/ch17-hive/ src/main/python/is_good_quality.py; hive> FROM records2 > SELECT TRANSFORM(year, temperature, quality) > USING 'is_good_quality.py' > AS year, temperature;

更进一步

1
FROM ( FROM records2 MAP year, temperature, quality USING 'is_good_quality.py' AS year, temperature) map_output REDUCE year, temperature USING 'max_temperature_reduce.py' AS year, temperature;

view

通过create view的方式。数据并未被产生,但是产生方式被存在了metadata中,从而可以在后面的查询中被使用。(一个view使用时类似一个查询子句,并且可以被desc查看)

User Defined Functions

UDF

UDAF

UDTF

Write a UDF

1
2
3
4
5
6
7
8
9
10
package com.hadoopbook.hive; import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text; public class Strip extends UDF {
private Text result = new Text(); public Text evaluate(Text str) {
if (str == null) { return null;
} result.set(StringUtils.strip(str.toString())); return result; }
public Text evaluate(Text str, String stripChars) { if (str == null) {
return null; } result.set(StringUtils.strip(str.toString(), stripChars));  return result;
} }

UDF必须是org.apache.hadoop.hive.ql.exec.UDF的一个子类,并且必须至少实现一个evaluate函数

使用udf需要两步:

  1. 将udf达成一个jar包:mvn package
  2. 在hive中注册udf:
1
CREATE FUNCTION strip AS 'com.hadoopbook.hive.Strip' USING JAR '/path/to/hive-examples.jar';

给udf加注释

First off, your UDF will need to inherit from the GenericUDF class. This class can be annotated with several annotations:

@UDFType(deterministic = true/false) will state whether your UDF is deterministic or not. This is set to true by default. Deterministic functions are those that always return the same result whenever they are called with the same set of arguments. For example avg() is a deterministic function because the result will always be the same for the same input. However, unix_timestamp() (without any arguments) is not deterministic because it returns the current time as a unix timestamp using the default timezone. Consequently, the result depends on when this function is called.

@Description(name=”my_udf”, value=”output of describe command”, extended=”output of describe extended command”). This annotation is used to set the name of the UDF and text to be displayed when a DESCRIBE FUNCTION my_udf or DESCRIBE FUNCTION EXTENDED my_udf are issued. These commands print out a description of the UDF.

有了description以后,可以用hive> DESCRIBE FUNCTION length;查看udf的说明

UDAF

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
package com.hadoopbook.hive;
import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
import org.apache.hadoop.hive.serde2.io.DoubleWritable;
public class Mean extends UDAF {
public static class MeanDoubleUDAFEvaluator implements UDAFEvaluator {
public static class PartialResult {
double sum;
long count;
}
private PartialResult partial;
public void init() {
partial = null;
}
public boolean iterate(DoubleWritable value) {
if (value == null) {
return true;
}
if (partial == null) {
partial = new PartialResult();
}
partial.sum += value.get();
partial.count++;
return true;
}
public PartialResult terminatePartial() {
return partial;
}
public boolean merge(PartialResult other) {
if (other == null) {
return true;
}
if (partial == null) {
partial = new PartialResult();
}
partial.sum += other.sum;
partial.count += other.count;
return true;
}
public DoubleWritable terminate() {
if (partial == null) {
return null;
}
return new DoubleWritable(partial.sum / partial.count);
}
}
}

主要要继承udaf类,然后内部静态类类实现UDAFEvaluator接口。
内部类中实现五个函数,这5个函数的使用时机见图

151217udaf.png

另一个教程见https://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy

自己踩得第一个坑,在partialresult传递结果时,结构体中不能直接用数组,要用arraylist。http://mail-archives.apache.org/mod_mbox/hive-user/201307.mbox/%3CCE1CA41C.1176E%25rdm@baynote.com%3E。而且感觉在构造函数中直接new出arraylist这种也不好,应该放在外面加。


本文采用创作共用保留署名-非商业-禁止演绎4.0国际许可证,欢迎转载,但转载请注明来自http://thousandhu.github.io,并保持转载后文章内容的完整。本人保留所有版权相关权利。

本文链接:http://thousandhu.github.io/2015/12/13/Hadoop-The-Definitive-Guide-4th读书笔记-chapter-17-hive/