aboutsummaryrefslogtreecommitdiffstats
path: root/lib/stdlib/doc/src/file_sorter.xml
diff options
context:
space:
mode:
authorErlang/OTP <[email protected]>2009-11-20 14:54:40 +0000
committerErlang/OTP <[email protected]>2009-11-20 14:54:40 +0000
commit84adefa331c4159d432d22840663c38f155cd4c1 (patch)
treebff9a9c66adda4df2106dfd0e5c053ab182a12bd /lib/stdlib/doc/src/file_sorter.xml
downloadotp-84adefa331c4159d432d22840663c38f155cd4c1.tar.gz
otp-84adefa331c4159d432d22840663c38f155cd4c1.tar.bz2
otp-84adefa331c4159d432d22840663c38f155cd4c1.zip
The R13B03 release.OTP_R13B03
Diffstat (limited to 'lib/stdlib/doc/src/file_sorter.xml')
-rw-r--r--lib/stdlib/doc/src/file_sorter.xml390
1 files changed, 390 insertions, 0 deletions
diff --git a/lib/stdlib/doc/src/file_sorter.xml b/lib/stdlib/doc/src/file_sorter.xml
new file mode 100644
index 0000000000..b3f4da294c
--- /dev/null
+++ b/lib/stdlib/doc/src/file_sorter.xml
@@ -0,0 +1,390 @@
+<?xml version="1.0" encoding="latin1" ?>
+<!DOCTYPE erlref SYSTEM "erlref.dtd">
+
+<erlref>
+ <header>
+ <copyright>
+ <year>2001</year><year>2009</year>
+ <holder>Ericsson AB. All Rights Reserved.</holder>
+ </copyright>
+ <legalnotice>
+ The contents of this file are subject to the Erlang Public License,
+ Version 1.1, (the "License"); you may not use this file except in
+ compliance with the License. You should have received a copy of the
+ Erlang Public License along with this software. If not, it can be
+ retrieved online at http://www.erlang.org/.
+
+ Software distributed under the License is distributed on an "AS IS"
+ basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See
+ the License for the specific language governing rights and limitations
+ under the License.
+
+ </legalnotice>
+
+ <title>file_sorter</title>
+ <prepared>Hans Bolinder</prepared>
+ <responsible>nobody</responsible>
+ <docno></docno>
+ <approved>nobody</approved>
+ <checked>no</checked>
+ <date>2001-03-13</date>
+ <rev>PA1</rev>
+ <file>file_sorter.sgml</file>
+ </header>
+ <module>file_sorter</module>
+ <modulesummary>File Sorter</modulesummary>
+ <description>
+ <p>The functions of this module sort terms on files, merge already
+ sorted files, and check files for sortedness. Chunks containing
+ binary terms are read from a sequence of files, sorted
+ internally in memory and written on temporary files, which are
+ merged producing one sorted file as output. Merging is provided
+ as an optimization; it is faster when the files are already
+ sorted, but it always works to sort instead of merge.
+ </p>
+ <p>On a file, a term is represented by a header and a binary. Two
+ options define the format of terms on files:
+ </p>
+ <list type="bulleted">
+ <item><c>{header, HeaderLength}</c>. HeaderLength determines the
+ number of bytes preceding each binary and containing the
+ length of the binary in bytes. Default is 4. The order of the
+ header bytes is defined as follows: if <c>B</c> is a binary
+ containing a header only, the size <c>Size</c> of the binary
+ is calculated as
+ <c><![CDATA[<<Size:HeaderLength/unit:8>> = B]]></c>.
+ </item>
+ <item><c>{format, Format}</c>. The format determines the
+ function that is applied to binaries in order to create the
+ terms that will be sorted. The default value is
+ <c>binary_term</c>, which is equivalent to
+ <c>fun&nbsp;binary_to_term/1</c>. The value <c>binary</c> is
+ equivalent to <c>fun(X) -> X end</c>, which means that the
+ binaries will be sorted as they are. This is the fastest
+ format. If <c>Format</c> is <c>term</c>, <c>io:read/2</c> is
+ called to read terms. In that case only the default value of
+ the <c>header</c> option is allowed. The <c>format</c> option
+ also determines what is written to the sorted output file: if
+ <c>Format</c> is <c>term</c> then <c>io:format/3</c> is called
+ to write each term, otherwise the binary prefixed by a header
+ is written. Note that the binary written is the same binary
+ that was read; the results of applying the <c>Format</c>
+ function are thrown away as soon as the terms have been
+ sorted. Reading and writing terms using the <c>io</c> module
+ is very much slower than reading and writing binaries.
+ </item>
+ </list>
+ <p>Other options are:
+ </p>
+ <list type="bulleted">
+ <item><c>{order, Order}</c>. The default is to sort terms in
+ ascending order, but that can be changed by the value
+ <c>descending</c> or by giving an ordering function <c>Fun</c>.
+ An ordering function is antisymmetric, transitive and total.
+ <c>Fun(A,&nbsp;B)</c> should return <c>true</c> if <c>A</c>
+ comes before <c>B</c> in the ordering, <c>false</c> otherwise.
+ Using an ordering function will slow down the sort
+ considerably. The <c>keysort</c>, <c>keymerge</c> and
+ <c>keycheck</c> functions do not accept ordering functions.
+ </item>
+ <item><c>{unique, bool()}</c>. When sorting or merging files,
+ only the first of a sequence of terms that compare equal is
+ output if this option is set to <c>true</c>. The default
+ value is <c>false</c> which implies that all terms that
+ compare equal are output. When checking files for
+ sortedness, a check that no pair of consecutive terms
+ compares equal is done if this option is set to <c>true</c>.
+ </item>
+ <item><c>{tmpdir, TempDirectory}</c>. The directory where
+ temporary files are put can be chosen explicitly. The
+ default, implied by the value <c>""</c>, is to put temporary
+ files on the same directory as the sorted output file. If
+ output is a function (see below), the directory returned by
+ <c>file:get_cwd()</c> is used instead. The names of
+ temporary files are derived from the Erlang nodename
+ (<c>node()</c>), the process identifier of the current Erlang
+ emulator (<c>os:getpid()</c>), and a timestamp
+ (<c>erlang:now()</c>); a typical name would be
+ <c>fs_mynode@myhost_1763_1043_337000_266005.17</c>, where
+ <c>17</c> is a sequence number. Existing files will be
+ overwritten. Temporary files are deleted unless some
+ uncaught EXIT signal occurs.
+ </item>
+ <item><c>{compressed, bool()}</c>. Temporary files and the
+ output file may be compressed. The default value
+ <c>false</c> implies that written files are not
+ compressed. Regardless of the value of the <c>compressed</c>
+ option, compressed files can always be read. Note that
+ reading and writing compressed files is significantly slower
+ than reading and writing uncompressed files.
+ </item>
+ <item><c>{size, Size}</c>. By default approximately 512*1024
+ bytes read from files are sorted internally. This option
+ should rarely be needed.
+ </item>
+ <item><c>{no_files, NoFiles}</c>. By default 16 files are
+ merged at a time. This option should rarely be needed.
+ </item>
+ </list>
+ <p>To summarize, here is the syntax of the options:</p>
+ <list type="bulleted">
+ <item>
+ <p><c>Options = [Option] | Option</c></p>
+ </item>
+ <item>
+ <p><c>Option = {header, HeaderLength} | {format, Format} | {order, Order} | {unique, bool()} | {tmpdir, TempDirectory} | {compressed, bool()} | {size, Size} | {no_files, NoFiles}</c></p>
+ </item>
+ <item>
+ <p><c>HeaderLength = int() > 0</c></p>
+ </item>
+ <item>
+ <p><c>Format = binary_term | term | binary | FormatFun</c></p>
+ </item>
+ <item>
+ <p><c>FormatFun = fun(Binary) -> Term</c></p>
+ </item>
+ <item>
+ <p><c>Order = ascending | descending | OrderFun</c></p>
+ </item>
+ <item>
+ <p><c>OrderFun = fun(Term, Term) -> bool()</c></p>
+ </item>
+ <item>
+ <p><c>TempDirectory = "" | file_name()</c></p>
+ </item>
+ <item>
+ <p><c>Size = int() >= 0</c></p>
+ </item>
+ <item>
+ <p><c>NoFiles = int() > 1</c></p>
+ </item>
+ </list>
+ <p>As an alternative to sorting files, a function of one argument
+ can be given as input. When called with the argument <c>read</c>
+ the function is assumed to return <c>end_of_input</c> or
+ <c>{end_of_input, Value}}</c> when there is no more input
+ (<c>Value</c> is explained below), or <c>{Objects, Fun}</c>,
+ where <c>Objects</c> is a list of binaries or terms depending on
+ the format and <c>Fun</c> is a new input function. Any other
+ value is immediately returned as value of the current call to
+ <c>sort</c> or <c>keysort</c>. Each input function will be
+ called exactly once, and should an error occur, the last
+ function is called with the argument <c>close</c>, the reply of
+ which is ignored.
+ </p>
+ <p>A function of one argument can be given as output. The results
+ of sorting or merging the input is collected in a non-empty
+ sequence of variable length lists of binaries or terms depending
+ on the format. The output function is called with one list at a
+ time, and is assumed to return a new output function. Any other
+ return value is immediately returned as value of the current
+ call to the sort or merge function. Each output function is
+ called exactly once. When some output function has been applied
+ to all of the results or an error occurs, the last function is
+ called with the argument <c>close</c>, and the reply is returned
+ as value of the current call to the sort or merge function. If a
+ function is given as input and the last input function returns
+ <c>{end_of_input, Value}</c>, the function given as output will
+ be called with the argument <c>{value, Value}</c>. This makes it
+ easy to initiate the sequence of output functions with a value
+ calculated by the input functions.
+ </p>
+ <p>As an example, consider sorting the terms on a disk log file.
+ A function that reads chunks from the disk log and returns a
+ list of binaries is used as input. The results are collected in
+ a list of terms.</p>
+ <pre>
+sort(Log) ->
+ {ok, _} = disk_log:open([{name,Log}, {mode,read_only}]),
+ Input = input(Log, start),
+ Output = output([]),
+ Reply = file_sorter:sort(Input, Output, {format,term}),
+ ok = disk_log:close(Log),
+ Reply.
+
+input(Log, Cont) ->
+ fun(close) ->
+ ok;
+ (read) ->
+ case disk_log:chunk(Log, Cont) of
+ {error, Reason} ->
+ {error, Reason};
+ {Cont2, Terms} ->
+ {Terms, input(Log, Cont2)};
+ {Cont2, Terms, _Badbytes} ->
+ {Terms, input(Log, Cont2)};
+ eof ->
+ end_of_input
+ end
+ end.
+
+output(L) ->
+ fun(close) ->
+ lists:append(lists:reverse(L));
+ (Terms) ->
+ output([Terms | L])
+ end. </pre>
+ <p>Further examples of functions as input and output can be found
+ at the end of the <c>file_sorter</c> module; the <c>term</c>
+ format is implemented with functions.
+ </p>
+ <p>The possible values of <c>Reason</c> returned when an error
+ occurs are:</p>
+ <list type="bulleted">
+ <item>
+ <p><c>bad_object</c>, <c>{bad_object, FileName}</c>.
+ Applying the format function failed for some binary,
+ or the key(s) could not be extracted from some term.</p>
+ </item>
+ <item>
+ <p><c>{bad_term, FileName}</c>. <c>io:read/2</c> failed
+ to read some term.</p>
+ </item>
+ <item>
+ <p><c>{file_error, FileName, Reason2}</c>. See
+ <c>file(3)</c> for an explanation of <c>Reason2</c>.</p>
+ </item>
+ <item>
+ <p><c>{premature_eof, FileName}</c>. End-of-file was
+ encountered inside some binary term.</p>
+ </item>
+ </list>
+ <p><em>Types</em></p>
+ <pre>
+Binary = binary()
+FileName = file_name()
+FileNames = [FileName]
+ICommand = read | close
+IReply = end_of_input | {end_of_input, Value} | {[Object], Infun} | InputReply
+Infun = fun(ICommand) -> IReply
+Input = FileNames | Infun
+InputReply = Term
+KeyPos = int() > 0 | [int() > 0]
+OCommand = {value, Value} | [Object] | close
+OReply = Outfun | OutputReply
+Object = Term | Binary
+Outfun = fun(OCommand) -> OReply
+Output = FileName | Outfun
+OutputReply = Term
+Term = term()
+Value = Term</pre>
+ </description>
+ <funcs>
+ <func>
+ <name>sort(FileName) -> Reply</name>
+ <name>sort(Input, Output) -> Reply</name>
+ <name>sort(Input, Output, Options) -> Reply</name>
+ <fsummary>Sort terms on files.</fsummary>
+ <type>
+ <v>Reply = ok | {error, Reason} | InputReply | OutputReply</v>
+ </type>
+ <desc>
+ <p>Sorts terms on files.
+ </p>
+ <p><c>sort(FileName)</c> is equivalent to
+ <c>sort([FileName], FileName)</c>.
+ </p>
+ <p><c>sort(Input, Output)</c> is equivalent to
+ <c>sort(Input, Output, [])</c>.
+ </p>
+ <p></p>
+ </desc>
+ </func>
+ <func>
+ <name>keysort(KeyPos, FileName) -> Reply</name>
+ <name>keysort(KeyPos, Input, Output) -> Reply</name>
+ <name>keysort(KeyPos, Input, Output, Options) -> Reply</name>
+ <fsummary>Sort terms on files by key.</fsummary>
+ <type>
+ <v>Reply = ok | {error, Reason} | InputReply | OutputReply</v>
+ </type>
+ <desc>
+ <p>Sorts tuples on files. The sort is performed on the
+ element(s) mentioned in <c>KeyPos</c>. If two tuples
+ compare equal on one element, next element according to
+ <c>KeyPos</c> is compared. The sort is stable.
+ </p>
+ <p><c>keysort(N, FileName)</c> is equivalent to
+ <c>keysort(N, [FileName], FileName)</c>.
+ </p>
+ <p><c>keysort(N, Input, Output)</c> is equivalent to
+ <c>keysort(N, Input, Output, [])</c>.
+ </p>
+ <p></p>
+ </desc>
+ </func>
+ <func>
+ <name>merge(FileNames, Output) -> Reply</name>
+ <name>merge(FileNames, Output, Options) -> Reply</name>
+ <fsummary>Merge terms on files.</fsummary>
+ <type>
+ <v>Reply = ok | {error, Reason} | OutputReply</v>
+ </type>
+ <desc>
+ <p>Merges terms on files. Each input file is assumed to be
+ sorted.
+ </p>
+ <p><c>merge(FileNames, Output)</c> is equivalent to
+ <c>merge(FileNames, Output, [])</c>.
+ </p>
+ </desc>
+ </func>
+ <func>
+ <name>keymerge(KeyPos, FileNames, Output) -> Reply</name>
+ <name>keymerge(KeyPos, FileNames, Output, Options) -> Reply</name>
+ <fsummary>Merge terms on files by key.</fsummary>
+ <type>
+ <v>Reply = ok | {error, Reason} | OutputReply</v>
+ </type>
+ <desc>
+ <p>Merges tuples on files. Each input file is assumed to be
+ sorted on key(s).
+ </p>
+ <p><c>keymerge(KeyPos, FileNames, Output)</c> is equivalent
+ to <c>keymerge(KeyPos, FileNames, Output, [])</c>.
+ </p>
+ <p></p>
+ </desc>
+ </func>
+ <func>
+ <name>check(FileName) -> Reply</name>
+ <name>check(FileNames, Options) -> Reply</name>
+ <fsummary>Check whether terms on files are sorted.</fsummary>
+ <type>
+ <v>Reply = {ok, [Result]} | {error, Reason}</v>
+ <v>Result = {FileName, TermPosition, Term}</v>
+ <v>TermPosition = int() > 1</v>
+ </type>
+ <desc>
+ <p>Checks files for sortedness. If a file is not sorted, the
+ first out-of-order element is returned. The first term on a
+ file has position 1.
+ </p>
+ <p><c>check(FileName)</c> is equivalent to
+ <c>check([FileName], [])</c>.
+ </p>
+ </desc>
+ </func>
+ <func>
+ <name>keycheck(KeyPos, FileName) -> CheckReply</name>
+ <name>keycheck(KeyPos, FileNames, Options) -> Reply</name>
+ <fsummary>Check whether terms on files are sorted by key.</fsummary>
+ <type>
+ <v>Reply = {ok, [Result]} | {error, Reason}</v>
+ <v>Result = {FileName, TermPosition, Term}</v>
+ <v>TermPosition = int() > 1</v>
+ </type>
+ <desc>
+ <p>Checks files for sortedness. If a file is not sorted, the
+ first out-of-order element is returned. The first term on a
+ file has position 1.
+ </p>
+ <p><c>keycheck(KeyPos, FileName)</c> is equivalent
+ to <c>keycheck(KeyPos, [FileName], [])</c>.
+ </p>
+ <p></p>
+ </desc>
+ </func>
+ </funcs>
+</erlref>
+