In this blog I will focus on Incremental load OR updating the exiting records and inserting new record and dynamic partition table loading. In BI world delta load/incremental load is very common activity.
We can implement the same on HIVE table using below steps. In a relation database we will perform updating the exiting records using unique index/primary key, but in Hive we have to do it in different way.
There are many ways to do it, let us see how we can do it using below steps. I am assuming that you have data in HDFS which will be updating hourly/daily from your sqoop/flume/kafka.
Design consideration:
- I am assuming that good temp space is maintained according to your data volume, so that we not facing temp space issue during the process.
- I am assuming that you have ingestion logic set to in your code for full load and incremental loads and data is available in HDFS.
- Partition column value should be a converted in to string value and it should be having special characters like ‘/’ OR ‘…’.
Note:
Hive user: We don’t have any specific command to create a hive user. All HDFS users can connect to hive and if the user is authorized to access to table as per the permissions set in Ranger user can access tables.
Step1: Create a hive target table and do a full load from your source.
My target table is orders and its create statement:
Let say full loading is done. Now we have data in our target table orders (for this example I have loaded only 5 records).
Step2: Create a stage table, to hold all the delta data (records which needs to be updated and new records which needs to be inserted in to DW Table).
Created a stage table orders_stage and table creation script:
Now load the delta data:
Step3: Create one more temp table to hold merge delta records, from orders_sate and orders.
Now let’s merge orders and orders_stage data and load it into temp table orders_temp using below script:
insert into orders_temp partition (order_date) select t1.* from (select * from orders union all select * from orders_stage) t1 join (select order_no, max(last_update_date) as last_update_date from (select * From orders union all select * From orders_stage ) t2 group by order_no, quantity, amount) t3 on t1.order_no = t3.order_no and t1.last_update_date = t3.last_update_date;
Step4: Overwrite Main Hive table orders with temp table with dynamic partition.
Insert overwrite table orders select * from orders_temp;
Hoping that this blog help you understanding incremental load in HIVE. Please send your comments if you have any questions.
Not a scalable one if you are dealing with large data
ReplyDeleteI was barely amazed at how you had written this content. Please keep posting.
ReplyDeleteHadoop training institutes in chennai
Hadoop Training in Velachery
The business process involved in the data warehouse solution providers helps in keeping the robust data security along with efficient data modelling and problem-solving techniques
ReplyDelete