일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | |||||
3 | 4 | 5 | 6 | 7 | 8 | 9 |
10 | 11 | 12 | 13 | 14 | 15 | 16 |
17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 | 26 | 27 | 28 | 29 | 30 |
- recursion
- Sort
- C언어
- NumPy
- Java
- collections
- 텐서플로
- 알고리즘
- 주식분석
- 코딩더매트릭스
- GRU
- LSTM
- hadoop2
- hive
- 하이브
- scrapy
- 선형대수
- graph
- 하둡2
- HelloWorld
- RNN
- C
- effective python
- yarn
- 그래프이론
- codingthematrix
- tensorflow
- python
- 파이썬
- 딥러닝
- Today
- Total
EXCELSIOR
하이브(Hive) - 데이터 업로드 본문
1. 데이터 업로드
하이브는 로컬 파일 시스템에 있는 데이터와 HDFS에 저장도니 데이터를 모두 업로드 할 수 있다.
여기서는 항공운항통계 데이터 파일을 업로드 하겠다. (항공운항데이터는 http://stat-computing.org/dataexpo/2009/에서 다운로드 받을 수 있다.)
하이브 CLI에서 다음과 같이 LOAD DATA를 입력한다. OVERWRITE INTO 절은 중복된 데이터가 있어도 무시하고 입력한다는 의미이다. PARTITION 절은 파티션 키인 delayYear 값을 2008로 설정해 데이터를 입력하는 설정이다. 앞에 포스팅한 글에서 테이블을 생성할 때 파티션을 설정했는데, 데이터를 업로드할 때 PARTITION절을 설정하지 않으면 LOAD DATA 실행 시 오류가 발생한다.
hive> LOAD DATA LOCAL INPATH '/usr/local/hadoop/study/dataexpo/2008.csv' > OVERWRITE INTO TABLE airline_delay > PARTITION (delayYear='2008'); #결과# Loading data to table default.airline_delay partition (delayyear=2008) OK Time taken: 5.699 seconds
LOAD DATA가 실행되면 다음과 같은 SELECT 쿼리문을 실행해 데이터가 정상적으로 등록됐는지 확인한다. SELECT 절의 기본 문법은 RDMS의 SQL 문법과 유사하며, MySQL 처럼 LIMIT을 이용해 상위 10개의 데이터만 조회한다.
hive> SELECT year, month, deptime, arrtime, uniquecarrier, flightnum > FROM airline_delay > WHERE delayYear='2008' > LIMIT 10; OK year month deptime arrtime uniquecarrier flightnum 2008 1 2003 2211 WN 335 2008 1 754 1002 WN 3231 2008 1 628 804 WN 448 2008 1 926 1054 WN 1746 2008 1 1829 1959 WN 3920 2008 1 1940 2121 WN 378 2008 1 1937 2037 WN 509 2008 1 1039 1132 WN 535 2008 1 617 652 WN 11 2008 1 1620 1639 WN 810 Time taken: 0.854 seconds, Fetched: 10 row(s)
2. 집계 함수
다음은 집계함수 중 COUNT를 이용하여 쿼리문을 작성하였다. 미국 항공 운항 지연 데이터 가운데 2008년도 1월의 지연 건수를 조회하는 쿼리문이다.
(원래는 전체 다 해보려고 했는데.. MapReduce 과정에서 컴퓨터가 자꾸 꺼져 버리는 바람에 1월로 한정하였다. 가상분산 모드로 설치했는데 어떻게 해결해야 할지 모르겠다.. 혹시 아시는분 댓글좀 달아주세요 ㅜㅜ)
하이브는 사용자가 입력한 하이브QL을 분석한 후, 맵리듀스 잡으로 생성해서 실행한다. 매퍼와 리듀서의 수행 단계가 퍼센트로 표시되며, 작업이 완료되면 몇 개의 맵 태스크와 리듀스 태스크가 실행됐고 HDFS와 CPU 자원은 얼마나 소모했는지 표시된다. 아래 쿼리문의 경우 3개의 맵 트스크와 1개의 리듀스 태스크를 실행한 후, 결과로 605,765를 출력한다.
hive> SELECT COUNT(1) > FROM airline_delay > WHERE delayYear=2008 and month=1 > ; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20161126150713_72e18d59-e920-4353-a806-16fcce376b3b Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapreduce.job.reduces= Starting Job = job_1480139944401_0002, Tracking URL = http://0.0.0.0:8089/proxy/application_1480139944401_0002/ Kill Command = /usr/local/hadoop-2.7.3/bin/hadoop job -kill job_1480139944401_0002 Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1 2016-11-26 15:07:21,345 Stage-1 map = 0%, reduce = 0% 2016-11-26 15:07:30,813 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 4.93 sec 2016-11-26 15:07:31,859 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 16.42 sec 2016-11-26 15:07:36,052 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 18.26 sec MapReduce Total cumulative CPU time: 18 seconds 260 msec Ended Job = job_1480139944401_0002 MapReduce Jobs Launched: Stage-Stage-1: Map: 3 Reduce: 1 Cumulative CPU: 18.26 sec HDFS Read: 689462775 HDFS Write: 7 SUCCESS Total MapReduce CPU Time Spent: 18 seconds 260 msec OK 605765 Time taken: 23.97 seconds, Fetched: 1 row(s)
하이브 쿼리는 SQL문의 GROUP BY 기능도 지원한다. 아래의 쿼리문은 GROUP BY를 써서 연도와 월별로 도착 지연 건수를 조회하는 쿼리문이다.
hive> SELECT Year, Month, COUNT(*) AS arrive_delay_count > FROM airline_delay > WHERE delayYear=2008 and month<=6 > AND ArrDelay > 0 > GROUP BY Year, Month; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20161126153024_64f46af6-924b-4ee5-ae09-b1d5203a5857 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 3 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapreduce.job.reduces= Starting Job = job_1480139944401_0003, Tracking URL = http://0.0.0.0:8089/proxy/application_1480139944401_0003/ Kill Command = /usr/local/hadoop-2.7.3/bin/hadoop job -kill job_1480139944401_0003 Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 3 2016-11-26 15:30:32,544 Stage-1 map = 0%, reduce = 0% 2016-11-26 15:30:43,057 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 4.85 sec 2016-11-26 15:30:44,086 Stage-1 map = 78%, reduce = 0%, Cumulative CPU 17.55 sec 2016-11-26 15:30:45,138 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 18.26 sec 2016-11-26 15:30:50,370 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 22.83 sec 2016-11-26 15:30:51,412 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 24.77 sec MapReduce Total cumulative CPU time: 24 seconds 770 msec Ended Job = job_1480139944401_0003 MapReduce Jobs Launched: Stage-Stage-1: Map: 3 Reduce: 3 Cumulative CPU: 24.77 sec HDFS Read: 689480099 HDFS Write: 84 SUCCESS Total MapReduce CPU Time Spent: 24 seconds 770 msec OK year month arrive_delay_count 2008 2 278902 2008 5 254673 2008 3 294556 2008 6 295897 2008 1 279427 2008 4 256142 Time taken: 28.285 seconds, Fetched: 6 row(s)
3. 하이브 내장함수
Return Type | Function Name (Signature) | Description |
---|---|---|
BIGINT | round(double a) | returns the rounded BIGINT value of the double |
BIGINT | floor(double a) | returns the maximum BIGINT value that is equal or less than the double |
BIGINT | ceil(double a) | returns the minimum BIGINT value that is equal or greater than the double |
double | rand(), rand(int seed) | returns a random number (that changes from row to row). Specifiying the seed will make sure the generated random number sequence is deterministic. |
string | concat(string A, string B,...) | returns the string resulting from concatenating B after A. For example, concat('foo', 'bar') results in 'foobar'. This function accepts arbitrary number of arguments and return the concatenation of all of them. |
string | substr(string A, int start) | returns the substring of A starting from start position till the end of string A. For example, substr('foobar', 4) results in 'bar' |
string | substr(string A, int start, int length) | returns the substring of A starting from start position with the given length, for example, |
string | upper(string A) | returns the string resulting from converting all characters of A to upper case, for example, upper('fOoBaR') results in 'FOOBAR' |
string | ucase(string A) | Same as upper |
string | lower(string A) | returns the string resulting from converting all characters of B to lower case, for example, lower('fOoBaR') results in 'foobar' |
string | lcase(string A) | Same as lower |
string | trim(string A) | returns the string resulting from trimming spaces from both ends of A, for example, trim(' foobar ') results in 'foobar' |
string | ltrim(string A) | returns the string resulting from trimming spaces from the beginning(left hand side) of A. For example, ltrim(' foobar ') results in 'foobar ' |
string | rtrim(string A) | returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ') results in ' foobar' |
string | regexp_replace(string A, string B, string C) | returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See Java regular expressions syntax) with C. For example, regexp_replace('foobar', 'oo|ar', ) returns 'fb' |
int | size(Map<K.V>) | returns the number of elements in the map type |
int | size(Array<T>) | returns the number of elements in the array type |
value of <type> | cast(<expr> as <type>) | converts the results of the expression expr to <type>, for example, cast('1' as BIGINT) will convert the string '1' to it integral representation. A null is returned if the conversion does not succeed. |
string | from_unixtime(int unixtime) | convert the number of seconds from the UNIX epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00" |
string | to_date(string timestamp) | Return the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01" |
int | year(string date) | Return the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970 |
int | month(string date) | Return the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11 |
int | day(string date) | Return the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1 |
string | get_json_object(string json_string, string path) | Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return |
'DataBase > Hadoop' 카테고리의 다른 글
하이브 버킷(Hive Bucket) (0) | 2016.11.30 |
---|---|
하이브 조인(Hive Join) (0) | 2016.11.30 |
Apache-Hive : 하이브QL(Hive QL) - 테이블 생성 (0) | 2016.11.21 |
하이브 (Apache Hive) 개념 및 설치 (아파치 더비 사용) (0) | 2016.11.14 |
하둡2 예제실행 (0) | 2016.11.11 |