. Advertisement .
..3..
. Advertisement .
..4..
Apache Hive uses tables as the storage format for its data. This essential functionality is at the core of this data warehouse software. Let’s learn how to make Hive create table and take advantage of its SQL-like interface.
Tables In Hive
In the data warehouse Apache Hive, rows and columns containing related data create a table like other applications. Hive tables use data and records in tabular format to store the data. We can put them into two categories: internal and external tables.
Internal tables store their data inside an HDFS with a folder of their name. They are also known as managed tables. When you specify an HDFS used to store data, you are going to create an external table.
Depending on the version of your Hive and Hadoop installation, these tables may have different default storage locations. If you use Hadoop version 3.0 or newer, for instance, Hive tables are located at “/warehouse/tablespace/external/hive/” by default, while the folder “/warehouse/tablespace/managed/hive” is reserved for table management. Meanwhile, previous versions of Hive use “/apps/hive/warehouse/” as their default storage location.
Hive Create Table
We need to use the CREATE TABLE statement to create a table with the Hive format.
Syntax:
CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier
[ ( col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ]
[ COMMENT table_comment ]
[ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT col_comment2 ], ... )
| ( col_name1, col_name2, ... ) ]
[ CLUSTERED BY ( col_name1, col_name2, ...)
[ SORTED BY ( col_name1 [ ASC | DESC ], col_name2 [ ASC | DESC ], ... ) ]
INTO num_buckets BUCKETS ]
[ ROW FORMAT row_format ]
[ STORED AS file_format ]
[ LOCATION path ]
[ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ]
[ AS select_statement ]
Where:
- table_identifier: the name of your table. You can also use a schema by entering [schema_name.] table_name
- EXTERNAL: insert this if you want to create an external instead of using the default storage location. Remember to provide a path to your folder when using this clause.
- COMMENT: the description of your table. It should be a string literal.
- TBLPROPERTIES: use this clause to tag your table definition by providing a list of key-value pairs.
- PARTITIONED BY: partitions you want to create on your table. The statement infers the columns you provide to create partitions.
- CLUSTERED BY: this clause controls whether you want to bucket the partitions above into fixed buckets. This technique can optimize your applications as they carry out the data partitioning using bucketing columns, avoiding data shuffle.
- SORTED BY: the order you want to sort bucket columns. By default, the CREATE TABLE uses ASC (ascending order), but you can switch the descending order by using the DESC clause.
- row_format: the row format for output and input.
- STORED AS: the file format you want to store the table. It could be PARQUET, ORC, TEXTFILE, etc.
- LOCATION: the path to the folder that stores your table data. It should be located on a file system Hadoop supports, such as HDFS.
Examples
If you don’t have a working installation of Hadoop or Hive, you need to set up and configure it first.
Once both Hadoop and Hive have been properly installed, start a distributed file system with this command and make sure there is no error:
start-dfs.sh
Launch Hive:
hive
And create a table:
CREATE TABLE IF NOT EXISTS site (
site string,
ranking int,
type string )
COMMENT 'Website table'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
This statement creates a Hive table containing information on several websites, such as their name, ranking, and type. You can verify this table with the DESCRIBE statement:
DESCRIBE site;
......... ADVERTISEMENT .........
..8..
Conclusion
You can make Hive create table with a simple statement: CREATE TABLE. This process is very similar to that of SQL, where you can also carry out many data manipulation operations.
Leave a comment