INCREMENTAL UPDATES IN HIVE TABLES

In this blog I will focus on Incremental load OR updating the exiting records and inserting new record and dynamic partition table loading. In BI world delta load/incremental load is very common activity.

We can implement the same on HIVE table using below steps. In a relation database we will perform updating the exiting records using unique index/primary key, but in Hive we have to do it in different way.

There are many ways to do it, let us see how we can do it using below steps. I am assuming that you have data in HDFS which will be updating hourly/daily from your sqoop/flume/kafka.

Design consideration:

I am assuming that good temp space is maintained according to your data volume, so that we not facing temp space issue during the process.

I am assuming that you have ingestion logic set to in your code for full load and incremental loads and data is available in HDFS.

Partition column value should be a converted in to string value and it should be having special characters like ‘/’ OR ‘…’.

Note:

Hive user: We don’t have any specific command to create a hive user. All HDFS users can connect to hive and if the user is authorized to access to table as per the permissions set in Ranger user can access tables.

Step1: Create a hive target table and do a full load from your source.
My target table is orders and its create statement:

Let say full loading is done. Now we have data in our target table orders (for this example I have loaded only 5 records).

Step2: Create a stage table, to hold all the delta data (records which needs to be updated and new records which needs to be inserted in to DW Table).

Created a stage table orders_stage and table creation script:

Now load the delta data:

Step3: Create one more temp table to hold merge delta records, from orders_sate and orders.

Now let’s merge orders and orders_stage data and load it into temp table orders_temp using below script:

insert into orders_temp partition (order_date) select t1.* from (select * from orders union all select * from orders_stage) t1 join (select order_no, max(last_update_date) as last_update_date from (select * From orders union all select * From orders_stage ) t2 group by order_no, quantity, amount) t3 on t1.order_no = t3.order_no and t1.last_update_date = t3.last_update_date;

Step4: Overwrite Main Hive table orders with temp table with dynamic partition.

Insert overwrite table orders select * from orders_temp;

Hoping that this blog help you understanding incremental load in HIVE. Please send your comments if you have any questions.

Remove or Replace unwanted simple like space, $ and comma (,) from hive data

hive -e 'select regexp_replace("$60,825.48","\\$|\,","");' OK 60825.48 Time taken: 3.912 seconds, Fetched: 1 row(s) hive -e 'select regexp_replace("$60,825.48","[^0-9.]","");' 60825.48 Time taken: 2.249 seconds, Fetched: 1 row(s) substring from a string using the TRANSLATE function to replace the substring with the empty string. For your query it would become this: drop table vp_hiphop; create table vp_hiphop as select userid, ntext, regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'), 'rock music', 'rockmusic') as ntext1 from vp_nlp_protext_males; drop table vp_hiphop; create table vp_hiphop as select userid, ntext, translate(ntext, ' ', '') as ntext1 from vp_nlp_protext_males;

Unknown6 June 2018 at 05:57
Not a scalable one if you are dealing with large data
Suba2 July 2018 at 03:27
I was barely amazed at how you had written this content. Please keep posting.
Hadoop training institutes in chennai
Hadoop Training in Velachery
Nikisha5 December 2019 at 03:21
The business process involved in the data warehouse solution providers helps in keeping the robust data security along with efficient data modelling and problem-solving techniques

DATA PUMP - BIG DATA & HADOOP

Search This Blog