MapReduce的InputFormat

前言

MapReduce可以读取各种各样格式的数据，它们都是继承于InputFormat来实现的。
下面看看有hadoop实现了哪些类，各有什么功能。

InputFormat

InputFormat describes the input-specification for a Map-Reduce job.
The Map-Reduce framework relies on the InputFormat of the job to:

Validate the input-specification of the job.
Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper.
Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper.

The default behavior of file-based InputFormats, typically sub-classes of FileInputFormat, is to split the input into logical InputSplits based on the total size, in bytes, of the input files. However, the FileSystem blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapreduce.input.fileinputformat.split.minsize.

Clearly, logical splits based on input-size is insufficient for many applications since record boundaries are to respected. In such cases, the application has to also implement a RecordReader on whom lies the responsibility to respect record-boundaries and present a record-oriented view of the logical InputSplit to the individual task.

InputFormat的子类们

FileInputFormat

A base class for file-based InputFormats.
FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of getSplits(JobContext). Subclasses of FileInputFormat can also override the isSplitable(JobContext, Path) method to ensure input-files are not split-up and are processed as a whole by Mappers.

CombineFileInputFormat

An abstract InputFormat that returns CombineFileSplit’s in InputFormat.getSplits(JobContext) method. Splits are constructed from the files under the input paths. A split cannot have files from different pools. Each split returned may contain blocks from different files. If a maxSplitSize is specified, then blocks on the same node are combined to form a single split. Blocks that are left over are then combined with other blocks in the same rack. If maxSplitSize is not specified, then blocks from the same rack are combined in a single split; no attempt is made to create node-local splits. If the maxSplitSize is equal to the block size, then this class is similar to the default splitting behavior in Hadoop: each block is a locally processed split. Subclasses implement InputFormat.createRecordReader(InputSplit, TaskAttemptContext) to construct RecordReader’s for CombineFileSplit’s.

FixedLengthInputFormat

FixedLengthInputFormat is an input format used to read input files which contain fixed length records. The content of a record need not be text. It can be arbitrary binary data. Users must configure the record length property by calling: FixedLengthInputFormat.setRecordLength(conf, recordLength);

or conf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, recordLength);

KeyValueTextInputFormat

An InputFormat for plain text files. Files are broken into lines. Either line feed or carriage-return are used to signal end of line. Each line is divided into key and value parts by a separator byte. If no such a byte exists, the key will be the entire line and value will be empty. The separator byte can be specified in config file under the attribute name mapreduce.input.keyvaluelinerecordreader.key.value.separator. The default is the tab character (‘\t’).

NLineInputFormat

NLineInputFormat which splits N lines of input as one split. In many “pleasantly” parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.(Referred to as “parameter sweeps”). One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.). The NLineInputFormat can be used in such applications, that splits the input file such that by default, one line is fed as a value to one map task, and key is the offset. i.e. (k,v) is (LongWritable, Text). The location hints will span the whole mapred cluster.

SequenceFileInputFormat

An InputFormat for SequenceFiles.
Direct Known Subclasses:
SequenceFileAsBinaryInputFormat, SequenceFileAsTextInputFormat, SequenceFileInputFilter

TextInputFormat

An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text.

DBInputFormat

A InputFormat that reads input data from an SQL table.
DBInputFormat emits LongWritables containing the record number as key and DBWritables as value. The SQL query, and input class can be using one of the two setInput methods.

CompositeInputFormat

An InputFormat capable of performing joins over a set of data sources sorted and partitioned the same way. A user may define new join types by setting the property mapreduce.join.define. to a classname. In the expression mapreduce.join.expr, the identifier will be assumed to be a ComposableRecordReader. mapreduce.join.keycomparator can be a classname used to compare keys in the join.

ComposableInputFormat

Refinement of InputFormat requiring implementors to provide ComposableRecordReader instead of RecordReader.