GGFS class providing ability to group file's data blocks together on one node.
All blocks within the same group are guaranteed to be cached together on the same node.
Group size parameter controls how many sequential blocks will be cached together on the same node.
For example, if block size is
64kb and group size is
256, then each group will contain
64kb * 256 = 16Mb. Larger group sizes would reduce number of splits required to run map-reduce
tasks, but will increase inequality of data size being stored on different nodes.
Note that
#groupSize() parameter must correlate to Hadoop split size parameter defined
in Hadoop via
mapred.max.split.size property. Ideally you want all blocks accessed
within one split to be mapped to
1 group, so they can be located on the same grid node.
For example, default Hadoop split size is
64mb and default
GGFS block size
is
64kb. This means that to make sure that each split goes only through blocks on
the same node (without hopping between nodes over network), we have to make the
#groupSize()value be equal to
64mb / 64kb = 1024.
It is required for
GGFS data cache to be configured with this mapper. Here is an
example of how it can be specified in XML configuration:
<bean id="cacheCfgBase" class="org.gridgain.grid.cache.GridCacheConfiguration" abstract="true">
...
<property name="affinityMapper">
<bean class="org.gridgain.grid.ggfs.GridGgfsGroupDataBlocksKeyMapper">
<!-- How many sequential blocks will be stored on the same node. -->
<constructor-arg value="512"/>
</bean>
</property>
...
</bean>