Home | 简体中文 | 繁体中文 | 杂文 | Github | 知乎专栏 | 51CTO学院 | CSDN程序员研修院 | OSChina 博客 | 腾讯云社区 | 阿里云栖社区 | Facebook | Linkedin | Youtube | 打赏(Donations) | About
知乎专栏多维度架构

第 17 章 Apache Hive

目录

17.1. 安装 Apache Hive
17.1.1. MySQL
17.1.2. Hadoop
17.1.3. Hive
17.1.4. 启动 Hive
17.1.5. 访问 Hive
17.1.6. 配置 hiveserver2
17.2. beeline
17.3. 管理 Hive
17.3.1. 表管理
17.3.2. 分区表
17.3.3. 视图管理
17.3.4. 数据管理
17.3.5. HDFS与本地文件系统管理
17.4. HiveQL - Hive查询语言
17.4.1. JOIN 连接查询
17.4.2. 子查询
17.5. FAQ
17.5.1. adoop.security.authorize.AuthorizationException): User: hadoop is not allowed to impersonate anonymous

Hive是基于Hadoop构建的一套数据仓库分析系统,它提供了丰富的SQL查询方式来分析存储在Hadoop 分布式文件系统中的数据。其在Hadoop的架构体系中承担了一个SQL解析的过程,它提供了对外的入口来获取用户的指令然后对指令进行分析,解析出一个MapReduce程序组成可执行计划,并按照该计划生成对应的MapReduce任务提交给Hadoop集群处理,获取最终的结果。

17.1. 安装 Apache Hive

安装 Apache Hive 需要 Hadoop和MySQL,这里假设你已经懂得如何安装Hadoop和MySQL,所以一下将采用Netkiller OSCM一件安装脚本来初始化Hadoop和MySQL,如果需要详细的安装步骤请参考笔者的相关文章。

17.1.1. MySQL

默认情况下, Hive 使用内嵌的 Derby 数据库保存元数据, 通常生产环境会使用 MySQL 来存放 Hive 元数据。

使用下面脚本一键安装MySQL 5.7 安装后会显示mysql的初始密码,是所有初始密码登陆后修改为你的需要密码

			
curl -s https://raw.githubusercontent.com/oscm/shell/master/database/mysql/5.7/mysql.server.sh | bash

2016-02-16T08:22:58.253030Z 1 [Note] A temporary password is generated for root@localhost: sd%%my.Ak7Ma
			
			

安装 MySQL JDBC 连接库。

			
curl -s https://raw.githubusercontent.com/oscm/shell/master/database/mysql/5.7/mysql-connector-java.sh | bash
			
			

创建一个 hive 数据库用来存储 Hive 元数据,且数据库访问的用户名和密码都为 hive。

			
mysql> CREATE DATABASE hive; 
Query OK, 1 row affected (0.03 sec)
			
			

创建用户hive并授权访问hive数据库

			
mysql> CREATE USER 'hive'@'localhost' IDENTIFIED BY 'hive';
Query OK, 0 rows affected (0.04 sec)

mysql> GRANT ALL ON hive.* TO 'hive'@'localhost' IDENTIFIED BY 'hive';
Query OK, 0 rows affected (0.01 sec)

mysql> GRANT ALL ON hive.* TO 'hive'@'%' IDENTIFIED BY 'hive';
Query OK, 0 rows affected (0.00 sec)

mysql> FLUSH PRIVILEGES;
Query OK, 0 rows affected (0.00 sec)

mysql> quit;
Bye
			
			

17.1.2. Hadoop

安装 Hadoop 采用单机模式

			
curl -s https://raw.githubusercontent.com/oscm/shell/master/distributed/hadoop/hadoop-2.8.0.sh | bash
curl -s https://raw.githubusercontent.com/oscm/shell/master/distributed/hadoop/single.sh | bash
curl -s https://raw.githubusercontent.com/oscm/shell/master/distributed/hadoop/startup.sh | bash 
			
			

17.1.3. Hive

可以从 Apache 镜像站点中下载最新稳定版的 apache-hive-2.1.1-bin.tar.gz

			
cd /usr/local/src
wget http://mirrors.hust.edu.cn/apache/hive/stable-2/apache-hive-2.1.1-bin.tar.gz

tar zxf apache-hive-2.1.1-bin.tar.gz
mv apache-hive-2.1.1-bin /srv/apache-hive-2.1.1
ln -s /srv/apache-hive-2.1.1/ /srv/apache-hive
chown hadoop:hadoop -R /srv/apache-hive-2.1.1
			
			

			
cat > /srv/apache-hive/conf/hive-env.sh <<'EOF'
export JAVA_HOME=/srv/java
export HADOOP_HOME=/srv/apache-hadoop
export HBASE_HOME=/srv/apache-hbase
export HIVE_HOME=/srv/apache-hive
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin
EOF

cat >> ~/.bash_profile <<'EOF'
export JAVA_HOME=/srv/java
export HADOOP_HOME=/srv/apache-hadoop
export HBASE_HOME=/srv/apache-hbase
export HIVE_HOME=/srv/apache-hive
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin
EOF

source ~/.bash_profile
			
			

安装JDBC驱动

			
[root@localhost apache-hive]# ln -s  /usr/share/java/mysql-connector-java.jar /srv/apache-hive/lib/
[root@localhost apache-hive]# ll /srv/apache-hive/lib/mysql-connector-java.jar 
lrwxrwxrwx 1 root root 40 Jun 29 01:59 /srv/apache-hive/lib/mysql-connector-java.jar -> /usr/share/java/mysql-connector-java.jar
			
			

修改 hive-site.xml 配置文件,配置工作目录

			
<property>
    <name>system:java.io.tmpdir</name>
    <value>/tmp/hive</value>
</property>
<property>
    <name>system:user.name</name>
    <value>hadoop</value>
</property>

  <property>
    <name>hive.querylog.location</name>
    <value>/tmp/live/hadoop</value>
    <description>Location of Hive run time structured log file</description>
  </property>
  <property>
    <name>hive.exec.local.scratchdir</name>
    <value>/tmp/hive</value>
    <description>Local scratch space for Hive jobs</description>
  </property>
  <property>
    <name>hive.downloaded.resources.dir</name>
    <value>/tmp/hive/${hive.session.id}_resources</value>
    <description>Temporary local directory for added resources in the remote file system.</description>
  </property>

  <property>
    <name>hive.querylog.location</name>
    <value>/user/hive/log</value>
    <description>Location of Hive run time structured log file</description>
  </property>
			
			

把默认的 Derby 修改为 MySQL 需要在该文件中配置 MySQL 数据库连接信息。

			
<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:;databaseName=metastore_db;create=true</value>
    <description>
      JDBC connect string for a JDBC metastore.
      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
    </description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.apache.derby.jdbc.EmbeddedDriver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>APP</value>
    <description>Username to use against metastore database</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>mine</value>
    <description>password to use against metastore database</description>
  </property>
			
			

将上面配置项 value 改为下面的配置

			
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8&amp;useSSL=false</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hive</value>
  </property>
			
			

17.1.4. 启动 Hive

启动 Hive 前你必须做两件事,一是创建HDFS目录,二是初始化 MySQL 数据库。

为 Hive 创建 HDFS 工作目录并给它们赋相应的权限。

			
[root@localhost ~]$ su - hadoop
[hadoop@localhost ~]$ /srv/apache-hadoop/bin/hdfs dfs -mkdir -p /user/hive/warehouse
[hadoop@localhost ~]$ /srv/apache-hadoop/bin/hdfs dfs -mkdir -p /tmp/hive
[hadoop@localhost ~]$ /srv/apache-hadoop/bin/hdfs dfs -chmod g+w /user/hive/warehouse
[hadoop@localhost ~]$ /srv/apache-hadoop/bin/hdfs dfs -chmod 777 /tmp/hive
			
			

初始化 MySQL 数据库

			
[hadoop@localhost ~]$ /srv/apache-hive/bin/schematool -dbType mysql -initSchema
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/srv/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/srv/apache-hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL:	 jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&characterEncoding=UTF-8&useSSL=false
Metastore Connection Driver :	 com.mysql.jdbc.Driver
Metastore connection User:	 hive
Starting metastore schema initialization to 2.1.0
Initialization script hive-schema-2.1.0.mysql.sql
Initialization script completed
schemaTool completed
			
			

如果使用内嵌数据库 derby 请使用下面命令初始化

schematool -initSchema -dbType derby
			

17.1.5. 访问 Hive

启动 Hadoop

			
[hadoop@localhost ~]$ /srv/apache-hadoop/sbin/start-all.sh 
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /srv/apache-hadoop-2.8.0/logs/hadoop-hadoop-namenode-localhost.localdomain.out
localhost: starting datanode, logging to /srv/apache-hadoop-2.8.0/logs/hadoop-hadoop-datanode-localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /srv/apache-hadoop-2.8.0/logs/hadoop-hadoop-secondarynamenode-localhost.localdomain.out
starting yarn daemons
starting resourcemanager, logging to /srv/apache-hadoop-2.8.0/logs/yarn-hadoop-resourcemanager-localhost.localdomain.out
localhost: starting nodemanager, logging to /srv/apache-hadoop-2.8.0/logs/yarn-hadoop-nodemanager-localhost.localdomain.out
			
			

进入 Hive 然后输入 show databases; 测试安装是否正常。

			
[hadoop@localhost conf]$ /srv/apache-hive/bin/hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/srv/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/srv/apache-hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in file:/srv/apache-hive-2.1.1/conf/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> show databases;
OK
default
Time taken: 0.264 seconds, Fetched: 1 row(s)
hive>			
			
			

至此 Apache Hive 已经安装配置完成!

尝试执行下面的SQL命令测试Hive是否正常

create database test;
use test;
create table test_table (id int,name string) row format delimited fields terminated by ',' stored as textfile
insert into test_table values (1, 'Neo');
select * from test_table;
			

17.1.6. 配置 hiveserver2

hiveserver2 提供远程访问 Hive 服务,用户可以通过IP地址和端口号连接到Hive,类似mysql client

			
[hadoop@localhost ~]$ /srv/apache-hive/bin/hiveserver2 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/srv/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/srv/apache-hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
			
			

检查端口

[hadoop@localhost bin]$ ss -lnt | grep 10000
LISTEN     0      50           *:10000                    *:* 
			

测试 beeline 是否可以正常进入

			
[hadoop@localhost ~]$ /srv/apache-hive/bin/beeline -u jdbc:hive2://
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/srv/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/srv/apache-hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://
17/06/29 22:01:16 [main]: WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Connected to: Apache Hive (version 2.1.1)
Driver: Hive JDBC (version 2.1.1)
17/06/29 22:01:16 [main]: WARN jdbc.HiveConnection: Request to set autoCommit to false; Hive does not support autoCommit=false.
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.1.1 by Apache Hive
0: jdbc:hive2://> show databases;
OK
+----------------+--+
| database_name  |
+----------------+--+
| default        |
+----------------+--+
1 row selected (1.318 seconds)
			
			

如果是生产环境启动请使用下面的方法

			
[hadoop@localhost ~]$ /srv/apache-hive/bin/hive --service hiveserver2 &
[1] 20375
[hadoop@localhost ~]$ SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/srv/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/srv/apache-hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]