Class FeatureBasedClusterer
- All Implemented Interfaces:
ClusteringAlgorithm<Point>
- Direct Known Subclasses:
AutomaticClusterer, GreedyClusterer, KMeansClusterer, SpectralClusterer
Provides utilities to cluster arbitrary data by mapping items to Points (immutable float[] feature
vectors).
Usage: Use cluster(Collection, Function) to cluster your own data by providing an extractor
that produces the feature vector for each item. The result is a list of clusters, each represented as a
Map<T, float[]> containing the original items and their extracted features.
Available clustering algorithms:
newAutomatic(DistanceMeasure): Automatic clustering that determines the number of clusters and thresholds based on distance statistics.newGreedy(DistanceMeasure, double): Greedy, single-pass clustering using a distance threshold.newKMeans(DistanceMeasure, int): K-means style clustering with a specified number of clusters.newSpectral(DistanceMeasure, int): Spectral clustering using a Gaussian kernel and Laplacian embedding.
Performance: Internally, distances are cached for efficiency. All clustering is performed on
Point objects with unique ids and float[] coordinates.
Extensibility: Subclasses implement ClusteringAlgorithm.cluster(Collection) to provide concrete clustering
strategies over Points.
Thread safety: Not thread-safe. Each instance maintains internal state for distance caching.
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescription(package private) Function<Collection<Point>, Point> centroid()Returns a function that computes the centroid of a collection of points.cluster(Collection<T> input, Function<T, float[]> extractor) Clusters arbitrary items by first extracting their float feature representation.(package private) ToDoubleBiFunction<Point, Point> distance()Returns a function that computes the distance between two points.(package private) doubleReturns the distance between two points.(package private) doubleReturns the median distance threshold used for greedy clustering and initialisation.(package private) Function<Collection<Point>, List<Point>> Returns a function that generates an initial set of centroids from the input points.(package private) booleanReturns true if the configured distance measure is squared Euclidean.static FeatureBasedClustererReturns a new automatic clusterer using squared Euclidean distance.static FeatureBasedClusterernewAutomatic(DistanceMeasure measure) Returns a new automatic clusterer using the specified distance measure.static FeatureBasedClusterernewGreedy(double threshold) Returns a new greedy, single-pass clusterer using squared Euclidean distance and the given threshold.static FeatureBasedClusterernewGreedy(DistanceMeasure measure, double threshold) Returns a new greedy, single-pass clusterer using the supplied distance and threshold.static FeatureBasedClusterernewKMeans(int k) Returns a new k-means–style clusterer using squared Euclidean distance and the given number of clusters.static FeatureBasedClusterernewKMeans(DistanceMeasure measure, int k) Returns a new k-means–style clusterer using the supplied distance measure and number of clusters.static FeatureBasedClusterernewSpectral(int k) Returns a new spectral clusterer using squared Euclidean distance and the given number of clusters.static FeatureBasedClusterernewSpectral(DistanceMeasure measure, int k) Returns a new spectral clusterer using the supplied distance measure and number of clusters.(package private) voidsetup(Collection<Point> input) Prepares the internal distance cache for the given input points and distance measure.Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface ClusteringAlgorithm
cluster
-
Field Details
-
myCache
-
myMeasure
-
-
Constructor Details
-
FeatureBasedClusterer
FeatureBasedClusterer(DistanceMeasure measure)
-
-
Method Details
-
newAutomatic
Returns a new automatic clusterer using squared Euclidean distance. Equivalent tonewAutomatic(DistanceMeasure)withDistanceMeasure.SQUARED_EUCLIDEAN.- Returns:
- a new automatic clusterer
-
newAutomatic
Returns a new automatic clusterer using the specified distance measure.The algorithm:
- Extracts features
- Caches all pairwise distances
- Performs statistical analysis to determine a distance threshold
- Performs greedy clustering to get initial centroids
- Filters out very small clusters (determining k)
- Performs k-means clustering to refine clusters and centroids
- Parameters:
measure- the distance measure to use- Returns:
- a new automatic clusterer
-
newGreedy
Returns a new greedy, single-pass clusterer using the supplied distance and threshold.Each item is assigned to the nearest existing centroid if its distance is
<= threshold; otherwise a new cluster is created. The threshold must be in the same units as the chosen distance measure.- Parameters:
measure- the distance measurethreshold- the maximum allowed distance to join an existing cluster- Returns:
- a new greedy clusterer
-
newGreedy
Returns a new greedy, single-pass clusterer using squared Euclidean distance and the given threshold.- Parameters:
threshold- the maximum allowed distance to join an existing cluster- Returns:
- a new greedy clusterer
-
newKMeans
Returns a new k-means–style clusterer using the supplied distance measure and number of clusters.- Parameters:
measure- the distance functionk- the number of clusters (k >= 1)- Returns:
- a new k-means clusterer
-
newKMeans
Returns a new k-means–style clusterer using squared Euclidean distance and the given number of clusters.- Parameters:
k- the number of clusters (k >= 1)- Returns:
- a new k-means clusterer
-
newSpectral
Returns a new spectral clusterer using the supplied distance measure and number of clusters.Uses a Gaussian kernel and the symmetric normalised Laplacian.
- Parameters:
measure- the distance measure for the kernelk- the number of clusters (k >= 1)- Returns:
- a new spectral clusterer
-
newSpectral
Returns a new spectral clusterer using squared Euclidean distance and the given number of clusters.- Parameters:
k- the number of clusters (k >= 1)- Returns:
- a new spectral clusterer
-
cluster
Clusters arbitrary items by first extracting their float feature representation.Each item is wrapped as a
Pointusing the extractor output. Clustering is then performed byClusteringAlgorithm.cluster(Collection). The result mirrors the internal clusters but maps back to the original items along with their feature vectors.- Type Parameters:
T- the item type- Parameters:
input- the items to cluster (not null)extractor- a function that returns a non-null float[] feature vector for an item- Returns:
- a list of clusters, each as a map from the original item to its feature vector, sorted by decreasing size
-
centroid
Function<Collection<Point>, Point> centroid()Returns a function that computes the centroid of a collection of points.- Returns:
- centroid function
-
distance
ToDoubleBiFunction<Point,Point> distance()Returns a function that computes the distance between two points.- Returns:
- distance function
-
distance
-
getThreshold
double getThreshold()Returns the median distance threshold used for greedy clustering and initialisation.- Returns:
- median distance threshold
-
initialiser
Function<Collection<Point>, List<Point>> initialiser()Returns a function that generates an initial set of centroids from the input points.- Returns:
- initialiser function
-
isSquared
boolean isSquared()Returns true if the configured distance measure is squared Euclidean.- Returns:
- true if squared Euclidean, false otherwise
-
setup
Prepares the internal distance cache for the given input points and distance measure.- Parameters:
input- the points to cache distances for
-