之前一段时间迁移机房，服务切到新机房但是机器还没完全搬过去，hbase集群压力很大，就将晚上全量导入hbase的方式改为了Bulk Load的方式。一般我们写hbase都是通过hbase的TableOutputFormat，但是这样需要频繁与regionServer通信，特别占用资源。Bulk Load方法就是先在HDFS中生成HFile，然后直接将文件move到对应的RegionServer。当region的边界改变时，hbase会自动split边界。当然如果有split的效率不会太高。

生成HFile

生成HFile是直接通过mapper写指定格式的文件，mapper的key是ImmutableBytesWritable rowKey， value是Hbase的一个put。


public static void main(String[] args) throws Exception {
       Configuration conf = HBaseConfiguration.create();   
       job.setMapperClass(BulkLoadJob.BulkLoadMap.class);  
       job.setMapOutputKeyClass(ImmutableBytesWritable.class);  
       job.setMapOutputValueClass(Put.class);  
       // speculation  
       job.setSpeculativeExecution(false);  
       job.setReduceSpeculativeExecution(false);  
       // in/out format  
       job.setInputFormatClass(TextInputFormat.class);  
       job.setOutputFormatClass(HFileOutputFormat2.class); 
       FileInputFormat.setInputPaths(job, inputPath);  
       FileOutputFormat.setOutputPath(job, new Path(outputPath));    
       hTable = new HTable(conf, "table");  
       HFileOutputFormat2.configureIncrementalLoad(job, hTable);  
       job.waitForCompletion(true);
}
public void map(Text row, Text value, Context context) throws InterruptedException, IOException {
        byte[] rowBytes = Bytes.toBytes(row.toString());
        ImmutableBytesWritable rowKey = new ImmutableBytesWritable(rowBytes);
        Put p = new Put(rowKey);  
        byte[] cell = Bytes.toBytes(hvalue);  
        p.add(Bytes.toBytes(family), Bytes.toBytes(column), cell);  
        context.write(rowKey, p);
    }

导入数据

文件生成后有两种方法导入，一种是在程序里利用api导入hbase，里面的变量都是上面main函数的。这里可能会有权限问题，需要给文件777

FsShell shell = new FsShell(conf);  
try {  
    shell.run(new String[]{"-chmod", "-R", "777", args[1]});  
} catch (Exception e) {  
    logger.error("Couldnt change the file permissions ", e);  
    throw new IOException(e);  
}  
hTable = new HTable(conf, "table");  
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);  
loader.doBulkLoad(new Path(outputPath), hTable);

另一种是直接在shell中调用hbase的命令


hadoop fs -chmod -R 777 $inputPath
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles $inputPath $tableName

据说如果目标表不存在，工具会自动创建表，我没验证。

对于tsv格式hbase据说有直接导入的工具，我也没用过。

参考文献

本文采用创作共用保留署名-非商业-禁止演绎4.0国际许可证，欢迎转载，但转载请注明来自http://thousandhu.github.io，并保持转载后文章内容的完整。本人保留所有版权相关权利。

本文链接：http://thousandhu.github.io/2016/07/22/Hbase-利用Bulk-Load导入数据/