From 84adefa331c4159d432d22840663c38f155cd4c1 Mon Sep 17 00:00:00 2001 From: Erlang/OTP Date: Fri, 20 Nov 2009 14:54:40 +0000 Subject: The R13B03 release. --- lib/stdlib/doc/src/file_sorter.xml | 390 +++++++++++++++++++++++++++++++++++++ 1 file changed, 390 insertions(+) create mode 100644 lib/stdlib/doc/src/file_sorter.xml (limited to 'lib/stdlib/doc/src/file_sorter.xml') diff --git a/lib/stdlib/doc/src/file_sorter.xml b/lib/stdlib/doc/src/file_sorter.xml new file mode 100644 index 0000000000..b3f4da294c --- /dev/null +++ b/lib/stdlib/doc/src/file_sorter.xml @@ -0,0 +1,390 @@ + + + + +
+ + 20012009 + Ericsson AB. All Rights Reserved. + + + The contents of this file are subject to the Erlang Public License, + Version 1.1, (the "License"); you may not use this file except in + compliance with the License. You should have received a copy of the + Erlang Public License along with this software. If not, it can be + retrieved online at http://www.erlang.org/. + + Software distributed under the License is distributed on an "AS IS" + basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See + the License for the specific language governing rights and limitations + under the License. + + + + file_sorter + Hans Bolinder + nobody + + nobody + no + 2001-03-13 + PA1 + file_sorter.sgml +
+ file_sorter + File Sorter + +

The functions of this module sort terms on files, merge already + sorted files, and check files for sortedness. Chunks containing + binary terms are read from a sequence of files, sorted + internally in memory and written on temporary files, which are + merged producing one sorted file as output. Merging is provided + as an optimization; it is faster when the files are already + sorted, but it always works to sort instead of merge. +

+

On a file, a term is represented by a header and a binary. Two + options define the format of terms on files: +

+ + {header, HeaderLength}. HeaderLength determines the + number of bytes preceding each binary and containing the + length of the binary in bytes. Default is 4. The order of the + header bytes is defined as follows: if B is a binary + containing a header only, the size Size of the binary + is calculated as + > = B]]>. + + {format, Format}. The format determines the + function that is applied to binaries in order to create the + terms that will be sorted. The default value is + binary_term, which is equivalent to + fun binary_to_term/1. The value binary is + equivalent to fun(X) -> X end, which means that the + binaries will be sorted as they are. This is the fastest + format. If Format is term, io:read/2 is + called to read terms. In that case only the default value of + the header option is allowed. The format option + also determines what is written to the sorted output file: if + Format is term then io:format/3 is called + to write each term, otherwise the binary prefixed by a header + is written. Note that the binary written is the same binary + that was read; the results of applying the Format + function are thrown away as soon as the terms have been + sorted. Reading and writing terms using the io module + is very much slower than reading and writing binaries. + + +

Other options are: +

+ + {order, Order}. The default is to sort terms in + ascending order, but that can be changed by the value + descending or by giving an ordering function Fun. + An ordering function is antisymmetric, transitive and total. + Fun(A, B) should return true if A + comes before B in the ordering, false otherwise. + Using an ordering function will slow down the sort + considerably. The keysort, keymerge and + keycheck functions do not accept ordering functions. + + {unique, bool()}. When sorting or merging files, + only the first of a sequence of terms that compare equal is + output if this option is set to true. The default + value is false which implies that all terms that + compare equal are output. When checking files for + sortedness, a check that no pair of consecutive terms + compares equal is done if this option is set to true. + + {tmpdir, TempDirectory}. The directory where + temporary files are put can be chosen explicitly. The + default, implied by the value "", is to put temporary + files on the same directory as the sorted output file. If + output is a function (see below), the directory returned by + file:get_cwd() is used instead. The names of + temporary files are derived from the Erlang nodename + (node()), the process identifier of the current Erlang + emulator (os:getpid()), and a timestamp + (erlang:now()); a typical name would be + fs_mynode@myhost_1763_1043_337000_266005.17, where + 17 is a sequence number. Existing files will be + overwritten. Temporary files are deleted unless some + uncaught EXIT signal occurs. + + {compressed, bool()}. Temporary files and the + output file may be compressed. The default value + false implies that written files are not + compressed. Regardless of the value of the compressed + option, compressed files can always be read. Note that + reading and writing compressed files is significantly slower + than reading and writing uncompressed files. + + {size, Size}. By default approximately 512*1024 + bytes read from files are sorted internally. This option + should rarely be needed. + + {no_files, NoFiles}. By default 16 files are + merged at a time. This option should rarely be needed. + + +

To summarize, here is the syntax of the options:

+ + +

Options = [Option] | Option

+
+ +

Option = {header, HeaderLength} | {format, Format} | {order, Order} | {unique, bool()} | {tmpdir, TempDirectory} | {compressed, bool()} | {size, Size} | {no_files, NoFiles}

+
+ +

HeaderLength = int() > 0

+
+ +

Format = binary_term | term | binary | FormatFun

+
+ +

FormatFun = fun(Binary) -> Term

+
+ +

Order = ascending | descending | OrderFun

+
+ +

OrderFun = fun(Term, Term) -> bool()

+
+ +

TempDirectory = "" | file_name()

+
+ +

Size = int() >= 0

+
+ +

NoFiles = int() > 1

+
+
+

As an alternative to sorting files, a function of one argument + can be given as input. When called with the argument read + the function is assumed to return end_of_input or + {end_of_input, Value}} when there is no more input + (Value is explained below), or {Objects, Fun}, + where Objects is a list of binaries or terms depending on + the format and Fun is a new input function. Any other + value is immediately returned as value of the current call to + sort or keysort. Each input function will be + called exactly once, and should an error occur, the last + function is called with the argument close, the reply of + which is ignored. +

+

A function of one argument can be given as output. The results + of sorting or merging the input is collected in a non-empty + sequence of variable length lists of binaries or terms depending + on the format. The output function is called with one list at a + time, and is assumed to return a new output function. Any other + return value is immediately returned as value of the current + call to the sort or merge function. Each output function is + called exactly once. When some output function has been applied + to all of the results or an error occurs, the last function is + called with the argument close, and the reply is returned + as value of the current call to the sort or merge function. If a + function is given as input and the last input function returns + {end_of_input, Value}, the function given as output will + be called with the argument {value, Value}. This makes it + easy to initiate the sequence of output functions with a value + calculated by the input functions. +

+

As an example, consider sorting the terms on a disk log file. + A function that reads chunks from the disk log and returns a + list of binaries is used as input. The results are collected in + a list of terms.

+
+sort(Log) ->
+    {ok, _} = disk_log:open([{name,Log}, {mode,read_only}]),
+    Input = input(Log, start),
+    Output = output([]),
+    Reply = file_sorter:sort(Input, Output, {format,term}),
+    ok = disk_log:close(Log),
+    Reply.
+
+input(Log, Cont) ->
+    fun(close) ->
+            ok;
+       (read) ->
+            case disk_log:chunk(Log, Cont) of
+                {error, Reason} ->
+                    {error, Reason};
+                {Cont2, Terms} ->
+                    {Terms, input(Log, Cont2)};
+                {Cont2, Terms, _Badbytes} ->
+                    {Terms, input(Log, Cont2)};
+                eof ->
+                    end_of_input
+            end
+    end.
+
+output(L) ->
+    fun(close) ->
+            lists:append(lists:reverse(L));
+       (Terms) ->
+            output([Terms | L])
+    end.    
+

Further examples of functions as input and output can be found + at the end of the file_sorter module; the term + format is implemented with functions. +

+

The possible values of Reason returned when an error + occurs are:

+ + +

bad_object, {bad_object, FileName}. + Applying the format function failed for some binary, + or the key(s) could not be extracted from some term.

+
+ +

{bad_term, FileName}. io:read/2 failed + to read some term.

+
+ +

{file_error, FileName, Reason2}. See + file(3) for an explanation of Reason2.

+
+ +

{premature_eof, FileName}. End-of-file was + encountered inside some binary term.

+
+
+

Types

+
+Binary = binary()
+FileName = file_name()
+FileNames = [FileName]
+ICommand = read | close
+IReply = end_of_input | {end_of_input, Value} | {[Object], Infun} | InputReply
+Infun = fun(ICommand) -> IReply
+Input = FileNames | Infun
+InputReply = Term
+KeyPos = int() > 0 | [int() > 0]
+OCommand = {value, Value} | [Object] | close
+OReply = Outfun | OutputReply
+Object = Term | Binary
+Outfun = fun(OCommand) -> OReply
+Output = FileName | Outfun
+OutputReply = Term
+Term = term()
+Value = Term
+
+ + + sort(FileName) -> Reply + sort(Input, Output) -> Reply + sort(Input, Output, Options) -> Reply + Sort terms on files. + + Reply = ok | {error, Reason} | InputReply | OutputReply + + +

Sorts terms on files. +

+

sort(FileName) is equivalent to + sort([FileName], FileName). +

+

sort(Input, Output) is equivalent to + sort(Input, Output, []). +

+

+
+
+ + keysort(KeyPos, FileName) -> Reply + keysort(KeyPos, Input, Output) -> Reply + keysort(KeyPos, Input, Output, Options) -> Reply + Sort terms on files by key. + + Reply = ok | {error, Reason} | InputReply | OutputReply + + +

Sorts tuples on files. The sort is performed on the + element(s) mentioned in KeyPos. If two tuples + compare equal on one element, next element according to + KeyPos is compared. The sort is stable. +

+

keysort(N, FileName) is equivalent to + keysort(N, [FileName], FileName). +

+

keysort(N, Input, Output) is equivalent to + keysort(N, Input, Output, []). +

+

+
+
+ + merge(FileNames, Output) -> Reply + merge(FileNames, Output, Options) -> Reply + Merge terms on files. + + Reply = ok | {error, Reason} | OutputReply + + +

Merges terms on files. Each input file is assumed to be + sorted. +

+

merge(FileNames, Output) is equivalent to + merge(FileNames, Output, []). +

+
+
+ + keymerge(KeyPos, FileNames, Output) -> Reply + keymerge(KeyPos, FileNames, Output, Options) -> Reply + Merge terms on files by key. + + Reply = ok | {error, Reason} | OutputReply + + +

Merges tuples on files. Each input file is assumed to be + sorted on key(s). +

+

keymerge(KeyPos, FileNames, Output) is equivalent + to keymerge(KeyPos, FileNames, Output, []). +

+

+
+
+ + check(FileName) -> Reply + check(FileNames, Options) -> Reply + Check whether terms on files are sorted. + + Reply = {ok, [Result]} | {error, Reason} + Result = {FileName, TermPosition, Term} + TermPosition = int() > 1 + + +

Checks files for sortedness. If a file is not sorted, the + first out-of-order element is returned. The first term on a + file has position 1. +

+

check(FileName) is equivalent to + check([FileName], []). +

+
+
+ + keycheck(KeyPos, FileName) -> CheckReply + keycheck(KeyPos, FileNames, Options) -> Reply + Check whether terms on files are sorted by key. + + Reply = {ok, [Result]} | {error, Reason} + Result = {FileName, TermPosition, Term} + TermPosition = int() > 1 + + +

Checks files for sortedness. If a file is not sorted, the + first out-of-order element is returned. The first term on a + file has position 1. +

+

keycheck(KeyPos, FileName) is equivalent + to keycheck(KeyPos, [FileName], []). +

+

+
+
+
+
+ -- cgit v1.2.3