UDTF函数编写

如果hive的数据中有这样一类数据，他本身是一个jsonArray，array中储存着数个string（比如taglist）。当我们希望计算每个tag的一些信息时，需要将jsonarray拆开成（xxx，tag）的单独一行。这中将1行变成多行的功能就需要udtf实现。

udtf的功能和udaf正好相反，主要提供将一行拆分成多行的操作。UDTF执行过程是这样的，首先调用initialize函数，该函数每个instance调用一次，主要是准备udtf的列的状态之类的操作。然后会调用process方法，这个方法每次处理输入数据中一行数据。在process中当生成了结果中的一行时，调用forward函数，forward函数不需要自己实现，唯一需要注意的是forward接收的参数是一个数组，比如object[]。最后UDTF调用close方法做收尾工作。

实现udtf需要继承GenericUDTF并且override以下下几个函数：
initialize主要是对列名，列的内容进行设置。

@Override
   public StructObjectInspector initialize(ObjectInspector[] argOIs) throws UDFArgumentException {
       ArrayList<String> fieldNames = new ArrayList<String>();
       ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();
       fieldNames.add("col1");
       fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
       return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,fieldOIs);
   }

process函数是具体的操作逻辑，在这个例子中就是将jsonArray解析出来然后将每个object变成单独的一行。这里在函数外面new forwardObj[]的目的是复用这个数组，不要每次forward的时候都new。

Object forwardObj[] = new String[1];
   @Override
   public void process(Object[] args) throws HiveException {
       if (args[0] == null) return;
       String input = args[0].toString();
       Object obj = JSONValue.parse(input);
       if (obj instanceof JSONArray) {
           for(Object item : ((JSONArray) obj).toArray()) {
               if (item instanceof String) {
                   forwardObj[0] = item;
               } else if (item instanceof JSONObject) {
                   forwardObj[0] = ((JSONObject) item).toJSONString();
               } else {
                   forwardObj[0] = item.toString();
               }
               forward(forwardObj);
           }
       } else {
           throw new HiveException("input item is not a jsonArray");
       }
   }

最后close在本例中不需要处理什么事情：

@Override
  public void close() throws HiveException {
  }

然后将这个函数添加一下即可：

1 2	add jar ./kg-hive-udf-1.0-SNAPSHOT.jar; Create Temporary Function SplitJsonArray as 'package.SplitJsonArray';

UDTF的使用

最基本的方法就是select SplitJsonArray(tag) as col1 from table;，注意udtf后面的as不能省略。但是udtf使用条件比较苛刻，不能在select的时候和别的列一起使用，也不能嵌套调用，也不能和group by/cluster by/distribute by/sort by一起使用。

为了和其他列一起使用。我们可以用lateral view来解决这一问题：

1 2	select a.somecol, b.col1 tag from some_table a lateral view SplitJsonArray(tag) table_alias as col1 where tag!="";

这里执行资料显示相当于读了两次表然后union到一起，但是比union写起来更清晰。

参考文献

https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF
http://dev.bizo.com/2010/07/extending-hive-with-custom-udtfs.html
http://wdicc.com/udf-in-hive/
http://www.cnblogs.com/ggjucheng/archive/2013/02/01/2888819.html
http://blog.csdn.net/oopsoom/article/details/26001307

本文采用创作共用保留署名-非商业-禁止演绎4.0国际许可证，欢迎转载，但转载请注明来自http://thousandhu.github.io，并保持转载后文章内容的完整。本人保留所有版权相关权利。

本文链接：http://thousandhu.github.io/2016/01/11/hive-udtf使用/