Package com.google.code.externalsorting
Class ExternalSort
- java.lang.Object
-
- com.google.code.externalsorting.ExternalSort
-
public class ExternalSort extends java.lang.ObjectGoal: offer a generic external-memory sorting program in Java. It must be : - hackable (easy to adapt) - scalable to large files - sensibly efficient. This software is in the public domain. Usage: java com/google/code/externalsorting/ExternalSort somefile.txt out.txt You can change the default maximal number of temporary files with the -t flag: java com/google/code/externalsorting/ExternalSort somefile.txt out.txt -t 3 For very large files, you might want to use an appropriate flag to allocate more memory to the Java VM: java -Xms2G com/google/code/externalsorting/ExternalSort somefile.txt out.txt By (in alphabetical order) Philippe Beaudoin, Eleftherios Chetzakis, Jon Elsas, Christan Grant, Daniel Haran, Daniel Lemire, Sugumaran Harikrishnan, Amit Jain, Thomas Mueller, Jerry Yang, First published: April 2010 originally posted at http://lemire.me/blog/archives/2010/04/01/external-memory-sorting-in-java/
-
-
Field Summary
Fields Modifier and Type Field Description static java.util.Comparator<java.lang.String>defaultcomparatordefault comparator between strings.static intDEFAULTMAXTEMPFILESDefault maximal number of temporary files allowed.
-
Constructor Summary
Constructors Constructor Description ExternalSort()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description private static voiddisplayUsage()static longestimateAvailableMemory()This method calls the garbage collector and then returns the free memory.static longestimateBestSizeOfBlocks(long sizeoffile, int maxtmpfiles, long maxMemory)we divide the file into small blocks.static voidmain(java.lang.String[] args)static longmergeSortedFiles(java.io.BufferedWriter fbw, java.util.Comparator<java.lang.String> cmp, boolean distinct, java.util.List<IOStringStack> buffers)This merges several BinaryFileBuffer to an output writer.static longmergeSortedFiles(java.util.List<java.io.File> files, java.io.BufferedWriter fbw, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, boolean distinct, boolean usegzip)This merges a bunch of temporary flat filesstatic longmergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile)This merges a bunch of temporary flat filesstatic longmergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp)This merges a bunch of temporary flat filesstatic longmergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, boolean distinct)This merges a bunch of temporary flat filesstatic longmergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs)This merges a bunch of temporary flat filesstatic longmergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, boolean distinct)This merges a bunch of temporary flat filesstatic longmergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, boolean distinct, boolean append, boolean usegzip)This merges a bunch of temporary flat filesstatic voidsort(java.io.File input, java.io.File output)This sorts a file (input) to an output file (output) using default parametersstatic voidsort(java.io.File input, java.io.File output, java.util.Comparator<java.lang.String> cmp)This sorts a file (input) to an output file (output) using customized comparatorstatic java.io.FilesortAndSave(java.util.List<java.lang.String> tmplist, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory)Sort a list and save it to a temporary filestatic java.io.FilesortAndSave(java.util.List<java.lang.String> tmplist, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, boolean usegzip, boolean parallel)Sort a list and save it to a temporary filestatic java.util.List<java.io.File>sortInBatch(java.io.BufferedReader fbr, long datalength)This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>sortInBatch(java.io.BufferedReader fbr, long datalength, java.util.Comparator<java.lang.String> cmp, boolean distinct)This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>sortInBatch(java.io.BufferedReader fbr, long datalength, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, long maxMemory, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, boolean parallel)This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>sortInBatch(java.io.File file)This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp)This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, boolean distinct)This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct)This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader)This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, boolean usegzip)This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, boolean parallel)This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, java.io.File tmpdirectory, boolean distinct, int numHeader)This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader)This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
-
-
-
Field Detail
-
defaultcomparator
public static java.util.Comparator<java.lang.String> defaultcomparator
default comparator between strings.
-
DEFAULTMAXTEMPFILES
public static final int DEFAULTMAXTEMPFILES
Default maximal number of temporary files allowed.- See Also:
- Constant Field Values
-
-
Method Detail
-
displayUsage
private static void displayUsage()
-
estimateAvailableMemory
public static long estimateAvailableMemory()
This method calls the garbage collector and then returns the free memory. This avoids problems with applications where the GC hasn't reclaimed memory and reports no available memory.- Returns:
- available memory
-
estimateBestSizeOfBlocks
public static long estimateBestSizeOfBlocks(long sizeoffile, int maxtmpfiles, long maxMemory)we divide the file into small blocks. If the blocks are too small, we shall create too many temporary files. If they are too big, we shall be using too much memory.- Parameters:
sizeoffile- how much data (in bytes) can we expectmaxtmpfiles- how many temporary files can we create (e.g., 1024)maxMemory- Maximum memory to use (in bytes)- Returns:
- the estimate
-
main
public static void main(java.lang.String[] args) throws java.io.IOException- Parameters:
args- command line argument- Throws:
java.io.IOException- generic IO exception
-
mergeSortedFiles
public static long mergeSortedFiles(java.io.BufferedWriter fbw, java.util.Comparator<java.lang.String> cmp, boolean distinct, java.util.List<IOStringStack> buffers) throws java.io.IOExceptionThis merges several BinaryFileBuffer to an output writer.- Parameters:
fbw- A buffer where we write the data.cmp- A comparator object that tells us how to sort the lines.distinct- Passtrueif duplicate lines should be discarded.buffers- Where the data should be read.- Returns:
- The number of lines sorted.
- Throws:
java.io.IOException- generic IO exception
-
mergeSortedFiles
public static long mergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile) throws java.io.IOExceptionThis merges a bunch of temporary flat files- Parameters:
files- TheListof sortedFiles to be merged.outputfile- The outputFileto merge the results to.- Returns:
- The number of lines sorted.
- Throws:
java.io.IOException- generic IO exception
-
mergeSortedFiles
public static long mergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp) throws java.io.IOExceptionThis merges a bunch of temporary flat files- Parameters:
files- TheListof sortedFiles to be merged.outputfile- The outputFileto merge the results to.cmp- TheComparatorto use to compareStrings.- Returns:
- The number of lines sorted.
- Throws:
java.io.IOException- generic IO exception
-
mergeSortedFiles
public static long mergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, boolean distinct) throws java.io.IOExceptionThis merges a bunch of temporary flat files- Parameters:
files- TheListof sortedFiles to be merged.outputfile- The outputFileto merge the results to.cmp- TheComparatorto use to compareStrings.distinct- Passtrueif duplicate lines should be discarded.- Returns:
- The number of lines sorted.
- Throws:
java.io.IOException- generic IO exception
-
mergeSortedFiles
public static long mergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs) throws java.io.IOExceptionThis merges a bunch of temporary flat files- Parameters:
files- TheListof sortedFiles to be merged.outputfile- The outputFileto merge the results to.cmp- TheComparatorto use to compareStrings.cs- TheCharsetto be used for the byte to character conversion.- Returns:
- The number of lines sorted.
- Throws:
java.io.IOException- generic IO exception
-
mergeSortedFiles
public static long mergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, boolean distinct) throws java.io.IOExceptionThis merges a bunch of temporary flat files- Parameters:
files- TheListof sortedFiles to be merged.distinct- Passtrueif duplicate lines should be discarded.outputfile- The outputFileto merge the results to.cmp- TheComparatorto use to compareStrings.cs- TheCharsetto be used for the byte to character conversion.- Returns:
- The number of lines sorted.
- Throws:
java.io.IOException- generic IO exception- Since:
- v0.1.2
-
mergeSortedFiles
public static long mergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, boolean distinct, boolean append, boolean usegzip) throws java.io.IOExceptionThis merges a bunch of temporary flat files- Parameters:
files- TheListof sortedFiles to be merged.distinct- Passtrueif duplicate lines should be discarded.outputfile- The outputFileto merge the results to.cmp- TheComparatorto use to compareStrings.cs- TheCharsetto be used for the byte to character conversion.append- Passtrueif result should append toFileinstead of overwrite. Default to be false for overloading methods.usegzip- assumes we used gzip compression for temporary files- Returns:
- The number of lines sorted.
- Throws:
java.io.IOException- generic IO exception- Since:
- v0.1.4
-
mergeSortedFiles
public static long mergeSortedFiles(java.util.List<java.io.File> files, java.io.BufferedWriter fbw, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, boolean distinct, boolean usegzip) throws java.io.IOExceptionThis merges a bunch of temporary flat files- Parameters:
files- TheListof sortedFiles to be merged.distinct- Passtrueif duplicate lines should be discarded.fbw- The outputBufferedWriterto merge the results to.cmp- TheComparatorto use to compareStrings.cs- TheCharsetto be used for the byte to character conversion.usegzip- assumes we used gzip compression for temporary files- Returns:
- The number of lines sorted.
- Throws:
java.io.IOException- generic IO exception- Since:
- v0.1.4
-
sort
public static void sort(java.io.File input, java.io.File output) throws java.io.IOExceptionThis sorts a file (input) to an output file (output) using default parameters- Parameters:
input- source fileoutput- output file- Throws:
java.io.IOException- generic IO exception
-
sort
public static void sort(java.io.File input, java.io.File output, java.util.Comparator<java.lang.String> cmp) throws java.io.IOExceptionThis sorts a file (input) to an output file (output) using customized comparator- Parameters:
input- source fileoutput- output filecmp- TheComparatorto use to compareStrings.- Throws:
java.io.IOException- generic IO exception
-
sortAndSave
public static java.io.File sortAndSave(java.util.List<java.lang.String> tmplist, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory) throws java.io.IOExceptionSort a list and save it to a temporary file- Parameters:
tmplist- data to be sortedcmp- string comparatorcs- charset to use for output (can use Charset.defaultCharset())tmpdirectory- location of the temporary files (set to null for default location)- Returns:
- the file containing the sorted data
- Throws:
java.io.IOException- generic IO exception
-
sortAndSave
public static java.io.File sortAndSave(java.util.List<java.lang.String> tmplist, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, boolean usegzip, boolean parallel) throws java.io.IOExceptionSort a list and save it to a temporary file- Parameters:
tmplist- data to be sortedcmp- string comparatorcs- charset to use for output (can use Charset.defaultCharset())tmpdirectory- location of the temporary files (set to null for default location)distinct- Passtrueif duplicate lines should be discarded.usegzip- set totrueif you are using gzip compression for the temporary filesparallel- set totruewhen sorting in parallel- Returns:
- the file containing the sorted data
- Throws:
java.io.IOException- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.BufferedReader fbr, long datalength) throws java.io.IOExceptionThis will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.- Parameters:
fbr- data sourcedatalength- estimated data volume (in bytes)- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.BufferedReader fbr, long datalength, java.util.Comparator<java.lang.String> cmp, boolean distinct) throws java.io.IOExceptionThis will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.- Parameters:
fbr- data sourcedatalength- estimated data volume (in bytes)cmp- string comparatordistinct- Passtrueif duplicate lines should be discarded.- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.BufferedReader fbr, long datalength, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, long maxMemory, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, boolean parallel) throws java.io.IOExceptionThis will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.- Parameters:
fbr- data sourcedatalength- estimated data volume (in bytes)cmp- string comparatormaxtmpfiles- maximal number of temporary filesmaxMemory- maximum amount of memory to use (in bytes)cs- character set to use (can use Charset.defaultCharset())tmpdirectory- location of the temporary files (set to null for default location)distinct- Passtrueif duplicate lines should be discarded.numHeader- number of lines to preclude before sorting startsusegzip- use gzip compression for the temporary filesparallel- sort in parallel- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.File file) throws java.io.IOExceptionThis will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.- Parameters:
file- some flat file- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp) throws java.io.IOExceptionThis will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.- Parameters:
file- some flat filecmp- string comparator- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, boolean distinct) throws java.io.IOExceptionThis will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.- Parameters:
file- some flat filecmp- string comparatordistinct- Passtrueif duplicate lines should be discarded.- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, java.io.File tmpdirectory, boolean distinct, int numHeader) throws java.io.IOExceptionThis will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.- Parameters:
file- some flat filecmp- string comparatortmpdirectory- location of the temporary files (set to null for default location)distinct- Passtrueif duplicate lines should be discarded.numHeader- number of lines to preclude before sorting starts- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct) throws java.io.IOExceptionThis will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.- Parameters:
file- some flat filecmp- string comparatormaxtmpfiles- maximal number of temporary filescs- character set to use (can use Charset.defaultCharset())tmpdirectory- location of the temporary files (set to null for default location)distinct- Passtrueif duplicate lines should be discarded.- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader) throws java.io.IOExceptionThis will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.- Parameters:
file- some flat filecmp- string comparatorcs- character set to use (can use Charset.defaultCharset())tmpdirectory- location of the temporary files (set to null for default location)distinct- Passtrueif duplicate lines should be discarded.numHeader- number of lines to preclude before sorting starts- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader) throws java.io.IOExceptionThis will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.- Parameters:
file- some flat filecmp- string comparatormaxtmpfiles- maximal number of temporary filescs- character set to use (can use Charset.defaultCharset())tmpdirectory- location of the temporary files (set to null for default location)distinct- Passtrueif duplicate lines should be discarded.numHeader- number of lines to preclude before sorting starts- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, boolean usegzip) throws java.io.IOExceptionThis will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.- Parameters:
file- some flat filecmp- string comparatormaxtmpfiles- maximal number of temporary filescs- character set to use (can use Charset.defaultCharset())tmpdirectory- location of the temporary files (set to null for default location)distinct- Passtrueif duplicate lines should be discarded.numHeader- number of lines to preclude before sorting startsusegzip- use gzip compression for the temporary files- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, boolean parallel) throws java.io.IOExceptionThis will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.- Parameters:
file- some flat filecmp- string comparatormaxtmpfiles- maximal number of temporary filescs- character set to use (can use Charset.defaultCharset())tmpdirectory- location of the temporary files (set to null for default location)distinct- Passtrueif duplicate lines should be discarded.numHeader- number of lines to preclude before sorting startsusegzip- use gzip compression for the temporary filesparallel- whether to sort in parallel- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException- generic IO exception
-
-