Class SimilarityIndex
- java.lang.Object
-
- org.eclipse.jgit.diff.SimilarityIndex
-
public class SimilarityIndex extends java.lang.ObjectIndex structure of lines/blocks in one file.This structure can be used to compute an approximation of the similarity between two files. The index is used by
SimilarityRenameDetectorto compute scores between files.To save space in memory, this index uses a space efficient encoding which will not exceed 1 MiB per instance. The index starts out at a smaller size (closer to 2 KiB), but may grow as more distinct blocks within the scanned file are discovered.
- Since:
- 4.0
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classSimilarityIndex.TableFullExceptionThrown bycreate()when file is too large.
-
Field Summary
Fields Modifier and Type Field Description private longhashedCntTotal amount of bytes hashed into the structure, including \n.private intidGrowAtprivate long[]idHashPairings of content keys and counters.private intidHashBitsidHash.length == 1 << idHashBits.private intidSizeNumber of non-zero entries inidHash.private static intKEY_SHIFTShift to apply before storing a key.private static longMAX_COUNTMaximum value of the count field, also mask to extract the count.static SimilarityIndex.TableFullExceptionTABLE_FULL_OUT_OF_MEMORYA specialSimilarityIndex.TableFullExceptionused in place of OutOfMemoryError.
-
Constructor Summary
Constructors Constructor Description SimilarityIndex()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description (package private) voidadd(int key, int cnt)private static longcommon(long[] srcHash, int srcIdx, long[] dstHash, int dstIdx)(package private) longcommon(SimilarityIndex dst)private static longcommon(SimilarityIndex src, SimilarityIndex dst)(package private) longcount(int idx)private static longcountOf(long v)static SimilarityIndexcreate(ObjectLoader obj)Create a new similarity index for the given object(package private) intfindIndex(int key)private voidgrow()private static intgrowAt(int idHashBits)(package private) voidhash(byte[] raw, int ptr, int end)(package private) voidhash(java.io.InputStream in, long remaining, boolean text)(package private) voidhash(ObjectLoader obj)private voidhashLargeObject(ObjectLoader obj)(package private) intkey(int idx)private static intkeyOf(long v)private intpackedIndex(int idx)private static longpair(int key, long cnt)intscore(SimilarityIndex dst, int maxScore)Compute the similarity score between this index and another.(package private) intsize()private intslot(int key)(package private) voidsort()Sort the internal table so it can be used for efficient scoring.
-
-
-
Field Detail
-
TABLE_FULL_OUT_OF_MEMORY
public static final SimilarityIndex.TableFullException TABLE_FULL_OUT_OF_MEMORY
A specialSimilarityIndex.TableFullExceptionused in place of OutOfMemoryError.
-
KEY_SHIFT
private static final int KEY_SHIFT
Shift to apply before storing a key.Within the 64 bit table record space, we leave the highest bit unset so all values are positive. The lower 32 bits to count bytes.
- See Also:
- Constant Field Values
-
MAX_COUNT
private static final long MAX_COUNT
Maximum value of the count field, also mask to extract the count.- See Also:
- Constant Field Values
-
hashedCnt
private long hashedCnt
Total amount of bytes hashed into the structure, including \n. This is usually the size of the file minus number of CRLF encounters.
-
idSize
private int idSize
Number of non-zero entries inidHash.
-
idGrowAt
private int idGrowAt
-
idHash
private long[] idHash
Pairings of content keys and counters.Slots in the table are actually two ints wedged into a single long. The upper 32 bits stores the content key, and the remaining lower bits stores the number of bytes associated with that key. Empty slots are denoted by 0, which cannot occur because the count cannot be 0. Values can only be positive, which we enforce during key addition.
-
idHashBits
private int idHashBits
idHash.length == 1 << idHashBits.
-
-
Method Detail
-
create
public static SimilarityIndex create(ObjectLoader obj) throws java.io.IOException, SimilarityIndex.TableFullException
Create a new similarity index for the given object- Parameters:
obj- the object to hash- Returns:
- similarity index for this object
- Throws:
java.io.IOException- file contents cannot be read from the repository.SimilarityIndex.TableFullException- object hashing overflowed the storage capacity of the SimilarityIndex.
-
hash
void hash(ObjectLoader obj) throws MissingObjectException, java.io.IOException, SimilarityIndex.TableFullException
- Throws:
MissingObjectExceptionjava.io.IOExceptionSimilarityIndex.TableFullException
-
hashLargeObject
private void hashLargeObject(ObjectLoader obj) throws java.io.IOException, SimilarityIndex.TableFullException
- Throws:
java.io.IOExceptionSimilarityIndex.TableFullException
-
hash
void hash(byte[] raw, int ptr, int end) throws SimilarityIndex.TableFullException
-
hash
void hash(java.io.InputStream in, long remaining, boolean text) throws java.io.IOException, SimilarityIndex.TableFullException- Throws:
java.io.IOExceptionSimilarityIndex.TableFullException
-
sort
void sort()
Sort the internal table so it can be used for efficient scoring.Once sorted, additional lines/blocks cannot be added to the index.
-
score
public int score(SimilarityIndex dst, int maxScore)
Compute the similarity score between this index and another.A region of a file is defined as a line in a text file or a fixed-size block in a binary file. To prepare an index, each region in the file is hashed; the values and counts of hashes are retained in a sorted table. Define the similarity fraction F as the count of matching regions between the two files divided between the maximum count of regions in either file. The similarity score is F multiplied by the maxScore constant, yielding a range [0, maxScore]. It is defined as maxScore for the degenerate case of two empty files.
The similarity score is symmetrical; i.e. a.score(b) == b.score(a).
- Parameters:
dst- the other indexmaxScore- the score representing a 100% match- Returns:
- the similarity score
-
common
long common(SimilarityIndex dst)
-
common
private static long common(SimilarityIndex src, SimilarityIndex dst)
-
common
private static long common(long[] srcHash, int srcIdx, long[] dstHash, int dstIdx)
-
size
int size()
-
key
int key(int idx)
-
count
long count(int idx)
-
findIndex
int findIndex(int key)
-
packedIndex
private int packedIndex(int idx)
-
add
void add(int key, int cnt) throws SimilarityIndex.TableFullException
-
pair
private static long pair(int key, long cnt) throws SimilarityIndex.TableFullException
-
slot
private int slot(int key)
-
growAt
private static int growAt(int idHashBits)
-
grow
private void grow() throws SimilarityIndex.TableFullException
-
keyOf
private static int keyOf(long v)
-
countOf
private static long countOf(long v)
-
-