Archive for the ‘Compaction’ Tag

HBase Major Compaction

This is in continuation to my last two posts:

Each HBase Table has

  • 1 or More Column-families – that group columns and specify the physical layout of data storage
  • 1 or More Regions – that are akin to Shards (in the RDBMS world) i.e. A set of rows belonging to a table specified by its StartKey and EndKey

For every Column-family of a table in a region we have a Store which has

  • 1 MemStore – a buffer that holds in-memory modifications (till it is flushed to store files)
  • 0 or More Store files (HFiles) – that get created when MemStore fills up.

These store files are immutable and HBase creates a new file on every MemStore flush i.e. it does not write to an existing HFile.

Compaction combines all these Store files for a Region into fewer Store files to optimize performance. There are two types of compaction.

  • Minor Compaction – combines several Store files into fewer Store files
  • Major Compaction – reads all the Store files for a Region and writes to a single Store file.

Let us see how Major Compaction impacts HBase storage.

Create a table and insert data.


hbase(main):021:0> create 'users','info'
0 row(s) in 1.0540 seconds

hbase(main):022:0> list
TABLE
tbl1
users
2 row(s) in 0.0160 seconds

hbase(main):023:0> put 'users','abhi','info:name','abhishek'
0 row(s) in 0.0730 seconds

hbase(main):024:0> put 'users','abhi','info:age','30'
0 row(s) in 0.0120 seconds

Let us browse the HBase Root Directory and see how the data gets persisted physically on the filesystem.


abhi@hbase2:~$ ls -ltha /opt/hbase/data/
total 48K
drwxrwxr-x 4 hbase hbase 4.0K Nov  3 14:50 users
drwxr-xr-x 8 hbase users 4.0K Nov  3 14:50 .
drwxrwxr-x 4 hbase hbase 4.0K Nov  3 07:43 tbl1
drwxrwxr-x 2 hbase hbase 4.0K Nov  3 05:35 .oldlogs
drwxrwxr-x 3 hbase hbase 4.0K Nov  3 05:34 .logs
drwxrwxr-x 4 hbase hbase 4.0K Oct 30 12:00 -ROOT-
drwxrwxr-x 3 hbase hbase 4.0K Oct 30 12:00 .META.
-rwxr-xr-x 1 hbase hbase   38 Oct 30 12:00 hbase.id
-rw-rw-r-- 1 hbase hbase   12 Oct 30 12:00 .hbase.id.crc
-rwxr-xr-x 1 hbase hbase    3 Oct 30 12:00 hbase.version
-rw-rw-r-- 1 hbase hbase   12 Oct 30 12:00 .hbase.version.crc
drwxr-xr-x 3 abhi  users 4.0K Oct 11 08:10 ..
abhi@hbase2:~$
abhi@hbase2:~$ ls -ltha /opt/hbase/data/users/
total 24K
drwxrwxr-x 4 hbase hbase 4.0K Nov  3 14:50 6dda0024cbf8619a9c823e6ebbf78888
drwxrwxr-x 4 hbase hbase 4.0K Nov  3 14:50 .
-rwxr-xr-x 1 hbase hbase  515 Nov  3 14:50 .tableinfo.0000000001
-rw-rw-r-- 1 hbase hbase   16 Nov  3 14:50 ..tableinfo.0000000001.crc
drwxrwxr-x 2 hbase hbase 4.0K Nov  3 14:50 .tmp
drwxr-xr-x 8 hbase users 4.0K Nov  3 14:50 ..
abhi@hbase2:~$ ls -ltha /opt/hbase/data/users/6dda0024cbf8619a9c823e6ebbf78888/
total 24K
drwxrwxr-x 4 hbase hbase 4.0K Nov  3 14:50 .
drwxrwxr-x 2 hbase hbase 4.0K Nov  3 14:50 .oldlogs
drwxrwxr-x 2 hbase hbase 4.0K Nov  3 14:50 info
-rwxr-xr-x 1 hbase hbase  222 Nov  3 14:50 .regioninfo
-rw-rw-r-- 1 hbase hbase   12 Nov  3 14:50 ..regioninfo.crc
drwxrwxr-x 4 hbase hbase 4.0K Nov  3 14:50 ..
abhi@hbase2:~$ ls -ltha /opt/hbase/data/users/6dda0024cbf8619a9c823e6ebbf78888/info/
total 8.0K
drwxrwxr-x 4 hbase hbase 4.0K Nov  3 14:50 ..
drwxrwxr-x 2 hbase hbase 4.0K Nov  3 14:50 .

As you can see above, HBase created

  • a directory ‘users’ for the table and under it
  • a sub-directory ‘6dda0024cbf8619a9c823e6ebbf78888’ for the Region and under it
  • a sub-directory ‘info’ for the Column-family

All modifications to table/region columns that belong to the ‘info’ column-family get stored as store files under ‘/opt/hbase/data/users/6dda0024cbf8619a9c823e6ebbf78888/info/’

Although we entered data in the table but we don’t see any store files as all the data is currently in MemStore and has not been flushed yet. So let us flush the memstore and view the contents of the ‘info’ directory.


hbase(main):025:0> flush 'users'
0 row(s) in 0.0390 seconds

abhi@hbase2:~$ ls -ltha /opt/hbase/data/users/6dda0024cbf8619a9c823e6ebbf78888/info/
total 16K
drwxrwxr-x 2 hbase hbase 4.0K Nov  3 14:52 .
drwxrwxr-x 5 hbase hbase 4.0K Nov  3 14:52 ..
-rwxrwxrwx 1 hbase hbase  660 Nov  3 14:52 32f19d12583a46b98211ee77311f48eb
-rw-rw-r-- 1 hbase hbase   16 Nov  3 14:52 .32f19d12583a46b98211ee77311f48eb.crc

Notice how the store file /opt/hbase/data/users/6dda0024cbf8619a9c823e6ebbf78888/info/32f19d12583a46b98211ee77311f48eb got created. Let us add few more data to our table and view the filesystem.


hbase(main):026:0> put 'users','avi','info:name','avinash'
0 row(s) in 0.0050 seconds

hbase(main):027:0> flush 'users'
0 row(s) in 0.0490 seconds
abhi@hbase2:~$ ls -ltha /opt/hbase/data/users/6dda0024cbf8619a9c823e6ebbf78888/info/
total 24K
drwxrwxr-x 2 hbase hbase 4.0K Nov  3 14:52 .
-rwxrwxrwx 1 hbase hbase  623 Nov  3 14:52 ecc5f02da6234ac397d25bee6df0d019
-rw-rw-r-- 1 hbase hbase   16 Nov  3 14:52 .ecc5f02da6234ac397d25bee6df0d019.crc
drwxrwxr-x 5 hbase hbase 4.0K Nov  3 14:52 ..
-rwxrwxrwx 1 hbase hbase  660 Nov  3 14:52 32f19d12583a46b98211ee77311f48eb
-rw-rw-r-- 1 hbase hbase   16 Nov  3 14:52 .32f19d12583a46b98211ee77311f48eb.crc

Let us add some more data..

hbase(main):028:0> put 'users','avi','info:age','20'
0 row(s) in 0.0040 seconds

hbase(main):029:0> flush 'users'
0 row(s) in 0.1040 seconds
abhi@hbase2:~$ ls -ltha /opt/hbase/data/users/6dda0024cbf8619a9c823e6ebbf78888/info/
total 32K
drwxrwxr-x 2 hbase hbase 4.0K Nov  3 14:53 .
-rwxrwxrwx 1 hbase hbase  615 Nov  3 14:53 ebda0cc0af9a4d9e803a10cce27c52b6
-rw-rw-r-- 1 hbase hbase   16 Nov  3 14:53 .ebda0cc0af9a4d9e803a10cce27c52b6.crc
-rwxrwxrwx 1 hbase hbase  623 Nov  3 14:52 ecc5f02da6234ac397d25bee6df0d019
-rw-rw-r-- 1 hbase hbase   16 Nov  3 14:52 .ecc5f02da6234ac397d25bee6df0d019.crc
drwxrwxr-x 5 hbase hbase 4.0K Nov  3 14:52 ..
-rwxrwxrwx 1 hbase hbase  660 Nov  3 14:52 32f19d12583a46b98211ee77311f48eb
-rw-rw-r-- 1 hbase hbase   16 Nov  3 14:52 .32f19d12583a46b98211ee77311f48eb.crc
abhi@hbase2:~$

Notice how for each flush, a new store file gets created. Let us view the contents of these store files.

abhi@hbase2:~$ hbase org.apache.hadoop.hbase.io.hfile.HFile -f /opt/hbase/data/users/6dda0024cbf8619a9c823e6ebbf78888/info/ebda0cc0af9a4d9e803a10cce27c52b6 -p
12/11/03 14:55:59 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
12/11/03 14:55:59 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS
12/11/03 14:56:00 INFO hfile.CacheConfig: Allocating LruBlockCache with maximum size 247.9m
K: avi/info:age/1351979593884/Put/vlen=2 V: 20
Scanned kv count -> 1
abhi@hbase2:~$ hbase org.apache.hadoop.hbase.io.hfile.HFile -f /opt/hbase/data/users/6dda0024cbf8619a9c823e6ebbf78888/info/ecc5f02da6234ac397d25bee6df0d019 -p
12/11/03 14:56:19 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
12/11/03 14:56:19 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS
12/11/03 14:56:20 INFO hfile.CacheConfig: Allocating LruBlockCache with maximum size 247.9m
K: avi/info:name/1351979559394/Put/vlen=7 V: avinash
Scanned kv count -> 1
abhi@hbase2:~$ hbase org.apache.hadoop.hbase.io.hfile.HFile -f /opt/hbase/data/users/6dda0024cbf8619a9c823e6ebbf78888/info/32f19d12583a46b98211ee77311f48eb -p
12/11/03 14:56:31 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
12/11/03 14:56:31 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS
12/11/03 14:56:31 INFO hfile.CacheConfig: Allocating LruBlockCache with maximum size 247.9m
K: abhi/info:age/1351979477099/Put/vlen=2 V: 30
K: abhi/info:name/1351979467158/Put/vlen=8 V: abhishek
Scanned kv count -> 2
abhi@hbase2:~$

An alternate method to view the store file contents..

abhi@hbase2:~$ hbase org.apache.hadoop.hbase.io.hfile.HFile --printkv --file /opt/hbase/data/users/6dda0024cbf8619a9c823e6ebbf78888/info/ebda0cc0af9a4d9e803a10cce27c52b6
12/11/03 14:56:57 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
12/11/03 14:56:57 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS
12/11/03 14:56:58 INFO hfile.CacheConfig: Allocating LruBlockCache with maximum size 247.9m
K: avi/info:age/1351979593884/Put/vlen=2 V: 20
Scanned kv count -> 1
abhi@hbase2:~$

Let us invoke Major Compaction to combine these files into a single new file.

hbase(main):030:0> major_compact 'users'
0 row(s) in 0.1000 seconds

hbase(main):031:0>
abhi@hbase2:~$
abhi@hbase2:~$ ls -ltha /opt/hbase/data/users/6dda0024cbf8619a9c823e6ebbf78888/info/
total 16K
drwxrwxr-x 2 hbase hbase 4.0K Nov  3 14:57 .
-rwxrwxrwx 1 hbase hbase  731 Nov  3 14:57 6a65463fa2814751b255fdcf1542cd0d
-rw-rw-r-- 1 hbase hbase   16 Nov  3 14:57 .6a65463fa2814751b255fdcf1542cd0d.crc
drwxrwxr-x 5 hbase hbase 4.0K Nov  3 14:52 ..
abhi@hbase2:~$

Let us view the contents of the new file that got created as a result of major compaction.

abhi@hbase2:~$
abhi@hbase2:~$ hbase org.apache.hadoop.hbase.io.hfile.HFile -f /opt/hbase/data/users/6dda0024cbf8619a9c823e6ebbf78888/info/6a65463fa2814751b255fdcf1542cd0d -p          12/11/03 14:58:23 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
12/11/03 14:58:23 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS
12/11/03 14:58:23 INFO hfile.CacheConfig: Allocating LruBlockCache with maximum size 247.9m
K: abhi/info:age/1351979477099/Put/vlen=2 V: 30
K: abhi/info:name/1351979467158/Put/vlen=8 V: abhishek
K: avi/info:age/1351979593884/Put/vlen=2 V: 20
K: avi/info:name/1351979559394/Put/vlen=7 V: avinash
Scanned kv count -> 4
abhi@hbase2:~$
abhi@hbase2:~$