. Advertisement .
..3..
. Advertisement .
..4..
Hive is one of the most popular database manipulation programs in the world. However, it’s also among the most challenging, with its tables and partitions being some prime examples. To make it easier for you, we will explain Hive – INSERT INTO vs INSERT OVERWRITE in this article.
Your life will become much easier if you can grasp these useful functions.
Hive – INSERT INTO vs INSERT OVERWRITE – INSERT INTO
INSERT INTO is a relatively new addition to Hive, appearing in version 0.8. Its main use is to append any records, data, and rows into a partition or table. True to its name, this function only inserts without modifying any existing data.
Here is its syntax:
INSERT INTO TABLE table1
[PARTITION (partcol1=val1, partcol2=val2 ...)]
select_statement1 FROM from_statement;
Let’s try it out with an example. You start by creating a table that lists students’ names, their GPA, and their age.
CREATE TABLE pupils (name VARCHAR(64), gpa DECIMAL(3,2), age INT)
With this code, we have a table with 3 columns, name, GPA, and age. Now, to manipulate this table, you can start adding values with INSERT INTO.
INSERT INTO pupils VALUES
('John',3.9,12)
('Alice',3.7,11)
You should always keep in mind that to ensure the validity of the statement, each table column must have a value. In other words, if you use the code below, it won’t work.
INSERT INTO pupils VALUES
('John',3.9)
The computer will only throw back an error message. However, you can work around this issue by adding null values for any column you don’t want a value assigned.
It differs a little bit when we talk about partitioned tables. Let’s take the example and partition the table with the age column.
INSERT INTO pupils PARTITION(age=12) VALUES (‘Susan’, 3.7, 12);
It’s not clear in this example, but the partitioned column must always be the last one added in.
If you can master this function, joining two data frames together is no longer a challenge.
Hive – INSERT INTO vs INSERT OVERWRITE – INSERT OVERWRITE
As the name suggests, while both of them insert values into a table, INSERT OVERWRITE will straight-up replace the existing data. Despite its space-saving advantage, you should be extremely careful when using this function.
We recommend always putting in an IF NOT EXIST case to only overwrite if there is a partition.
INSERT OVERWRITE Syntax:
INSERT OVERWRITE TABLE table1
[PARTITION (partcol1=val1, partcol2=val2 ...)
[IF NOT EXISTS]]
select_statement1 FROM from_statement;
Let’s take the example above again.
CREATE TABLE pupils (name VARCHAR(64), gpa DECIMAL(3,2), age INT)
INSERT INTO pupils VALUES
('John',3.9,12)
(‘Alice’,3.7,11)
INSERT OVERWRITE pupils VALUES
(‘Mike’,3.6,11)
Instead of having two rows with two students named John and Mike, there is only Mike. The cause of this phenomenon is the function deleting all the existing Hive table data and putting in the rows within VALUES.
When there is a partition clause, the code will only remove records within that partition without touching any other data.
INSERT OVERWRITE pupils PARTITION(age=12) VALUES
(‘Mike’,3.6,12)
This code will delete the ‘John’ record but leave the ‘Alice’ record alone.
Conclusion
We hope that after reading through our article, you can now compare Hive – INSERT INTO vs INSERT OVERWRITE easily. Once you master both of these functions, manipulating SQL data and tables will become a lot more manageable.
Please look forward to our next publication, as they will only be more interesting than this article.
Leave a comment