EXCELSIOR

하이브(Hive) - 데이터 업로드 본문

DataBase/Hadoop

하이브(Hive) - 데이터 업로드

Excelsior-JH 2016. 11. 26. 15:47


1. 데이터 업로드

하이브는 로컬 파일 시스템에 있는 데이터와 HDFS에 저장도니 데이터를 모두 업로드 할 수 있다. 

여기서는 항공운항통계 데이터 파일을 업로드 하겠다. (항공운항데이터는 http://stat-computing.org/dataexpo/2009/에서 다운로드 받을 수 있다.)


하이브 CLI에서 다음과 같이 LOAD DATA를 입력한다. OVERWRITE INTO 절은 중복된 데이터가 있어도 무시하고 입력한다는 의미이다. PARTITION 절은 파티션 키인 delayYear 값을 2008로 설정해 데이터를 입력하는 설정이다. 앞에 포스팅한 글에서 테이블을 생성할 때 파티션을 설정했는데, 데이터를 업로드할 때 PARTITION절을 설정하지 않으면 LOAD DATA 실행 시 오류가 발생한다.

hive> LOAD DATA LOCAL INPATH '/usr/local/hadoop/study/dataexpo/2008.csv' > OVERWRITE INTO TABLE airline_delay > PARTITION (delayYear='2008'); #결과# Loading data to table default.airline_delay partition (delayyear=2008) OK Time taken: 5.699 seconds

LOAD DATA가 실행되면 다음과 같은 SELECT 쿼리문을 실행해 데이터가 정상적으로 등록됐는지 확인한다. SELECT 절의 기본 문법은 RDMS의 SQL 문법과 유사하며, MySQL 처럼 LIMIT을 이용해 상위 10개의 데이터만 조회한다. 

hive> SELECT year, month, deptime, arrtime, uniquecarrier, flightnum
    > FROM airline_delay
    > WHERE delayYear='2008'
    > LIMIT 10;
OK
year    month   deptime arrtime uniquecarrier   flightnum
2008    1       2003    2211    WN      335
2008    1       754     1002    WN      3231
2008    1       628     804     WN      448
2008    1       926     1054    WN      1746
2008    1       1829    1959    WN      3920
2008    1       1940    2121    WN      378
2008    1       1937    2037    WN      509
2008    1       1039    1132    WN      535
2008    1       617     652     WN      11
2008    1       1620    1639    WN      810
Time taken: 0.854 seconds, Fetched: 10 row(s)


2. 집계 함수


다음은 집계함수 중 COUNT를 이용하여 쿼리문을 작성하였다. 미국 항공 운항 지연 데이터 가운데 2008년도 1월의 지연 건수를 조회하는 쿼리문이다. 

(원래는 전체 다 해보려고 했는데.. MapReduce 과정에서 컴퓨터가 자꾸 꺼져 버리는 바람에 1월로 한정하였다. 가상분산 모드로 설치했는데 어떻게 해결해야 할지 모르겠다.. 혹시 아시는분 댓글좀 달아주세요 ㅜㅜ)


하이브는 사용자가 입력한 하이브QL을 분석한 후, 맵리듀스 잡으로 생성해서 실행한다. 매퍼와 리듀서의 수행 단계가 퍼센트로 표시되며, 작업이 완료되면 몇 개의 맵 태스크와 리듀스 태스크가 실행됐고 HDFS와 CPU 자원은 얼마나 소모했는지 표시된다. 아래 쿼리문의 경우 3개의 맵 트스크와 1개의 리듀스 태스크를 실행한 후, 결과로 605,765를 출력한다.

hive> SELECT COUNT(1)
    > FROM airline_delay
    > WHERE delayYear=2008 and month=1
    > ;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20161126150713_72e18d59-e920-4353-a806-16fcce376b3b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1480139944401_0002, Tracking URL = http://0.0.0.0:8089/proxy/application_1480139944401_0002/
Kill Command = /usr/local/hadoop-2.7.3/bin/hadoop job  -kill job_1480139944401_0002
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
2016-11-26 15:07:21,345 Stage-1 map = 0%,  reduce = 0%
2016-11-26 15:07:30,813 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 4.93 sec
2016-11-26 15:07:31,859 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 16.42 sec
2016-11-26 15:07:36,052 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 18.26 sec
MapReduce Total cumulative CPU time: 18 seconds 260 msec
Ended Job = job_1480139944401_0002
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 18.26 sec   HDFS Read: 689462775 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 18 seconds 260 msec
OK
605765
Time taken: 23.97 seconds, Fetched: 1 row(s)

하이브 쿼리는 SQL문의 GROUP BY 기능도 지원한다. 아래의 쿼리문은 GROUP BY를 써서 연도와 월별로 도착 지연 건수를 조회하는 쿼리문이다.

hive> SELECT Year, Month, COUNT(*) AS arrive_delay_count
    > FROM airline_delay
    > WHERE delayYear=2008 and month<=6
    > AND ArrDelay > 0
    > GROUP BY Year, Month;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20161126153024_64f46af6-924b-4ee5-ae09-b1d5203a5857
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1480139944401_0003, Tracking URL = http://0.0.0.0:8089/proxy/application_1480139944401_0003/
Kill Command = /usr/local/hadoop-2.7.3/bin/hadoop job  -kill job_1480139944401_0003
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 3
2016-11-26 15:30:32,544 Stage-1 map = 0%,  reduce = 0%
2016-11-26 15:30:43,057 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 4.85 sec
2016-11-26 15:30:44,086 Stage-1 map = 78%,  reduce = 0%, Cumulative CPU 17.55 sec
2016-11-26 15:30:45,138 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 18.26 sec
2016-11-26 15:30:50,370 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 22.83 sec
2016-11-26 15:30:51,412 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 24.77 sec
MapReduce Total cumulative CPU time: 24 seconds 770 msec
Ended Job = job_1480139944401_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 24.77 sec   HDFS Read: 689480099 HDFS Write: 84 SUCCESS
Total MapReduce CPU Time Spent: 24 seconds 770 msec
OK
year    month   arrive_delay_count
2008    2       278902
2008    5       254673
2008    3       294556
2008    6       295897
2008    1       279427
2008    4       256142
Time taken: 28.285 seconds, Fetched: 6 row(s)


3. 하이브 내장함수

Return Type

Function Name (Signature)

Description

BIGINT

round(double a)

returns the rounded BIGINT value of the double

BIGINT

floor(double a)

returns the maximum BIGINT value that is equal or less than the double

BIGINT

ceil(double a)

returns the minimum BIGINT value that is equal or greater than the double

double

rand(), rand(int seed)

returns a random number (that changes from row to row). Specifiying the seed will make sure the generated random number sequence is deterministic.

string

concat(string A, string B,...)

returns the string resulting from concatenating B after A. For example, concat('foo', 'bar') results in 'foobar'. This function accepts arbitrary number of arguments and return the concatenation of all of them.

string

substr(string A, int start)

returns the substring of A starting from start position till the end of string A. For example, substr('foobar', 4) results in 'bar'

string

substr(string A, int start, int length)

returns the substring of A starting from start position with the given length, for example, 
substr('foobar', 4, 2) results in 'ba'

string

upper(string A)

returns the string resulting from converting all characters of A to upper case, for example, upper('fOoBaR') results in 'FOOBAR'

string

ucase(string A)

Same as upper

string

lower(string A)

returns the string resulting from converting all characters of B to lower case, for example, lower('fOoBaR') results in 'foobar'

string

lcase(string A)

Same as lower

string

trim(string A)

returns the string resulting from trimming spaces from both ends of A, for example, trim(' foobar ') results in 'foobar'

string

ltrim(string A)

returns the string resulting from trimming spaces from the beginning(left hand side) of A. For example, ltrim(' foobar ') results in 'foobar '

string

rtrim(string A)

returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ') results in ' foobar'

string

regexp_replace(string A, string B, string C)

returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See Java regular expressions syntax) with C. For example, regexp_replace('foobar', 'oo|ar', ) returns 'fb'

int

size(Map<K.V>)

returns the number of elements in the map type

int

size(Array<T>)

returns the number of elements in the array type

value of <type>

cast(<expr> as <type>)

converts the results of the expression expr to <type>, for example, cast('1' as BIGINT) will convert the string '1' to it integral representation. A null is returned if the conversion does not succeed.

string

from_unixtime(int unixtime)

convert the number of seconds from the UNIX epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00"

string

to_date(string timestamp)

Return the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01"

int

year(string date)

Return the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970

int

month(string date)

Return the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11

int

day(string date)

Return the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1

string

get_json_object(string json_string, string path)

Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return


Comments