file_sorter(3erl) Erlang Module Definition file_sorter(3erl)
NAME
file_sorter - File sorter.
DESCRIPTION
This module contains functions for sorting terms on files, merging al-
ready sorted files, and checking files for sortedness. Chunks contain-
ing binary terms are read from a sequence of files, sorted internally
in memory and written on temporary files, which are merged producing
one sorted file as output. Merging is provided as an optimization; it
is faster when the files are already sorted, but it always works to
sort instead of merge.
On a file, a term is represented by a header and a binary. Two options
define the format of terms on files:
{header, HeaderLength}:
HeaderLength determines the number of bytes preceding each binary
and containing the length of the binary in bytes. Defaults to 4.
The order of the header bytes is defined as follows: if B is a bi-
nary containing a header only, size Size of the binary is calcu-
lated as <<Size:HeaderLength/unit:8>> = B.
{format, Format}:
Option Format determines the function that is applied to binaries
to create the terms to be sorted. Defaults to binary_term, which is
equivalent to fun binary_to_term/1. Value binary is equivalent to
fun(X) -> X end, which means that the binaries are sorted as they
are. This is the fastest format. If Format is term, io:read/2 is
called to read terms. In that case, only the default value of op-
tion header is allowed.
Option format also determines what is written to the sorted output
file: if Format is term, then io:format/3 is called to write each
term, otherwise the binary prefixed by a header is written. Notice
that the binary written is the same binary that was read; the re-
sults of applying function Format are thrown away when the terms
have been sorted. Reading and writing terms using the io module is
much slower than reading and writing binaries.
Other options are:
{order, Order}:
The default is to sort terms in ascending order, but that can be
changed by value descending or by specifying an ordering function
Fun. An ordering function is antisymmetric, transitive, and total.
Fun(A, B) is to return true if A comes before B in the ordering,
otherwise false. An example of a typical ordering function is less
than or equal to, =</2. Using an ordering function slows down the
sort considerably. Functions keysort, keymerge and keycheck do not
accept ordering functions.
{unique, boolean()}:
When sorting or merging files, only the first of a sequence of
terms that compare equal (==) is output if this option is set to
true. Defaults to false, which implies that all terms that compare
equal are output. When checking files for sortedness, a check that
no pair of consecutive terms compares equal is done if this option
is set to true.
{tmpdir, TempDirectory}:
The directory where temporary files are put can be chosen explic-
itly. The default, implied by value "", is to put temporary files
on the same directory as the sorted output file. If output is a
function (see below), the directory returned by file:get_cwd() is
used instead. The names of temporary files are derived from the Er-
lang nodename (node()), the process identifier of the current Er-
lang emulator (os:getpid()), and a unique integer (er-
lang:unique_integer([positive])). A typical name is fs_mynode@my-
host_1763_4711.17, where 17 is a sequence number. Existing files
are overwritten. Temporary files are deleted unless some uncaught
EXIT signal occurs.
{compressed, boolean()}:
Temporary files and the output file can be compressed. Defaults
false, which implies that written files are not compressed. Regard-
less of the value of option compressed, compressed files can always
be read. Notice that reading and writing compressed files are sig-
nificantly slower than reading and writing uncompressed files.
{size, Size}:
By default about 512*1024 bytes read from files are sorted inter-
nally. This option is rarely needed.
{no_files, NoFiles}:
By default 16 files are merged at a time. This option is rarely
needed.
As an alternative to sorting files, a function of one argument can be
specified as input. When called with argument read, the function is as-
sumed to return either of the following:
* end_of_input or {end_of_input, Value}} when there is no more input
(Value is explained below).
* {Objects, Fun}, where Objects is a list of binaries or terms de-
pending on the format, and Fun is a new input function.
Any other value is immediately returned as value of the current call to
sort or keysort. Each input function is called exactly once. If an er-
ror occurs, the last function is called with argument close, the reply
of which is ignored.
A function of one argument can be specified as output. The results of
sorting or merging the input is collected in a non-empty sequence of
variable length lists of binaries or terms depending on the format. The
output function is called with one list at a time, and is assumed to
return a new output function. Any other return value is immediately re-
turned as value of the current call to the sort or merge function. Each
output function is called exactly once. When some output function has
been applied to all of the results or an error occurs, the last func-
tion is called with argument close, and the reply is returned as value
of the current call to the sort or merge function.
If a function is specified as input and the last input function returns
{end_of_input, Value}, the function specified as output is called with
argument {value, Value}. This makes it easy to initiate the sequence of
output functions with a value calculated by the input functions.
As an example, consider sorting the terms on a disk log file. A func-
tion that reads chunks from the disk log and returns a list of binaries
is used as input. The results are collected in a list of terms.
sort(Log) ->
{ok, _} = disk_log:open([{name,Log}, {mode,read_only}]),
Input = input(Log, start),
Output = output([]),
Reply = file_sorter:sort(Input, Output, {format,term}),
ok = disk_log:close(Log),
Reply.
input(Log, Cont) ->
fun(close) ->
ok;
(read) ->
case disk_log:chunk(Log, Cont) of
{error, Reason} ->
{error, Reason};
{Cont2, Terms} ->
{Terms, input(Log, Cont2)};
{Cont2, Terms, _Badbytes} ->
{Terms, input(Log, Cont2)};
eof ->
end_of_input
end
end.
output(L) ->
fun(close) ->
lists:append(lists:reverse(L));
(Terms) ->
output([Terms | L])
end.
For more examples of functions as input and output, see the end of the
file_sorter module; the term format is implemented with functions.
The possible values of Reason returned when an error occurs are:
* bad_object, {bad_object, FileName} - Applying the format function
failed for some binary, or the key(s) could not be extracted from
some term.
* {bad_term, FileName} - io:read/2 failed to read some term.
* {file_error, FileName, file:posix()} - For an explanation of
file:posix(), see file(3erl).
* {premature_eof, FileName} - End-of-file was encountered inside some
binary term.
DATA TYPES
file_name() = file:name()
file_names() = [file:name()]
i_command() = read | close
i_reply() =
end_of_input |
{end_of_input, value()} |
{[object()], infun()} |
input_reply()
infun() = fun((i_command()) -> i_reply())
input() = file_names() | infun()
input_reply() = term()
o_command() = {value, value()} | [object()] | close
o_reply() = outfun() | output_reply()
object() = term() | binary()
outfun() = fun((o_command()) -> o_reply())
output() = file_name() | outfun()
output_reply() = term()
value() = term()
options() = [option()] | option()
option() =
{compressed, boolean()} |
{header, header_length()} |
{format, format()} |
{no_files, no_files()} |
{order, order()} |
{size, size()} |
{tmpdir, tmp_directory()} |
{unique, boolean()}
format() = binary_term | term | binary | format_fun()
format_fun() = fun((binary()) -> term())
header_length() = integer() >= 1
key_pos() = integer() >= 1 | [integer() >= 1]
no_files() = integer() >= 1
order() = ascending | descending | order_fun()
order_fun() = fun((term(), term()) -> boolean())
size() = integer() >= 0
tmp_directory() = [] | file:name()
reason() =
bad_object |
{bad_object, file_name()} |
{bad_term, file_name()} |
{file_error,
file_name(),
file:posix() | badarg | system_limit} |
{premature_eof, file_name()}
EXPORTS
check(FileName) -> Reply
check(FileNames, Options) -> Reply
Types:
FileNames = file_names()
Options = options()
Reply = {ok, [Result]} | {error, reason()}
Result = {FileName, TermPosition, term()}
FileName = file_name()
TermPosition = integer() >= 1
Checks files for sortedness. If a file is not sorted, the first
out-of-order element is returned. The first term on a file has
position 1.
check(FileName) is equivalent to check([FileName], []).
keycheck(KeyPos, FileName) -> Reply
keycheck(KeyPos, FileNames, Options) -> Reply
Types:
KeyPos = key_pos()
FileNames = file_names()
Options = options()
Reply = {ok, [Result]} | {error, reason()}
Result = {FileName, TermPosition, term()}
FileName = file_name()
TermPosition = integer() >= 1
Checks files for sortedness. If a file is not sorted, the first
out-of-order element is returned. The first term on a file has
position 1.
keycheck(KeyPos, FileName) is equivalent to keycheck(KeyPos,
[FileName], []).
keymerge(KeyPos, FileNames, Output) -> Reply
keymerge(KeyPos, FileNames, Output, Options) -> Reply
Types:
KeyPos = key_pos()
FileNames = file_names()
Output = output()
Options = options()
Reply = ok | {error, reason()} | output_reply()
Merges tuples on files. Each input file is assumed to be sorted
on key(s).
keymerge(KeyPos, FileNames, Output) is equivalent to
keymerge(KeyPos, FileNames, Output, []).
keysort(KeyPos, FileName) -> Reply
Types:
KeyPos = key_pos()
FileName = file_name()
Reply = ok | {error, reason()} | input_reply() | output_re-
ply()
Sorts tuples on files.
keysort(N, FileName) is equivalent to keysort(N, [FileName],
FileName).
keysort(KeyPos, Input, Output) -> Reply
keysort(KeyPos, Input, Output, Options) -> Reply
Types:
KeyPos = key_pos()
Input = input()
Output = output()
Options = options()
Reply = ok | {error, reason()} | input_reply() | output_re-
ply()
Sorts tuples on files. The sort is performed on the element(s)
mentioned in KeyPos. If two tuples compare equal (==) on one el-
ement, the next element according to KeyPos is compared. The
sort is stable.
keysort(N, Input, Output) is equivalent to keysort(N, Input,
Output, []).
merge(FileNames, Output) -> Reply
merge(FileNames, Output, Options) -> Reply
Types:
FileNames = file_names()
Output = output()
Options = options()
Reply = ok | {error, reason()} | output_reply()
Merges terms on files. Each input file is assumed to be sorted.
merge(FileNames, Output) is equivalent to merge(FileNames, Out-
put, []).
sort(FileName) -> Reply
Types:
FileName = file_name()
Reply = ok | {error, reason()} | input_reply() | output_re-
ply()
Sorts terms on files.
sort(FileName) is equivalent to sort([FileName], FileName).
sort(Input, Output) -> Reply
sort(Input, Output, Options) -> Reply
Types:
Input = input()
Output = output()
Options = options()
Reply = ok | {error, reason()} | input_reply() | output_re-
ply()
Sorts terms on files.
sort(Input, Output) is equivalent to sort(Input, Output, []).
Ericsson AB stdlib 3.13 file_sorter(3erl)