Class RegexBasedLocationExtractionStrategy
- java.lang.Object
-
- com.itextpdf.kernel.pdf.canvas.parser.listener.RegexBasedLocationExtractionStrategy
-
- All Implemented Interfaces:
IEventListener,ILocationExtractionStrategy
public class RegexBasedLocationExtractionStrategy extends java.lang.Object implements ILocationExtractionStrategy
This class is designed to search for the occurrences of a regular expression and return the resultant rectangles. Do note that this class holds all text locations and can't be used for processing multiple pages. If you want to extract text from several pages of pdf document you have to create a new instance ofRegexBasedLocationExtractionStrategyfor each page.Here is an example of usage with new instance per each page:
PdfDocument document = new PdfDocument(new PdfReader("...")); for (int i = 1; i <= document.getNumberOfPages(); ++i) { RegexBasedLocationExtractionStrategy extractionStrategy = new RegexBasedLocationExtractionStrategy(""); PdfCanvasProcessor processor = new PdfCanvasProcessor(extractionStrategy); processor.processPageContent(document.getPage(i)); for (IPdfTextLocation location : extractionStrategy.getResultantLocations()) { //process locations ... } }
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static classRegexBasedLocationExtractionStrategy.PdfTextLocationComparator
-
Field Summary
Fields Modifier and Type Field Description private static floatEPSprivate java.util.List<CharacterRenderInfo>parseResultprivate java.util.regex.Patternpattern
-
Constructor Summary
Constructors Constructor Description RegexBasedLocationExtractionStrategy(java.lang.String regex)RegexBasedLocationExtractionStrategy(java.util.regex.Pattern pattern)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voideventOccurred(IEventData data, EventType type)Called when some event occurs during parsing a content stream.private static java.lang.IntegergetEndIndex(java.util.Map<java.lang.Integer,java.lang.Integer> indexMap, int index)java.util.Collection<IPdfTextLocation>getResultantLocations()Returns theRectangles that have been processed so far.private static java.lang.IntegergetStartIndex(java.util.Map<java.lang.Integer,java.lang.Integer> indexMap, int index, java.lang.String txt)java.util.Set<EventType>getSupportedEvents()Provides the set of event types this listener supports.private voidremoveDuplicates(java.util.List<IPdfTextLocation> sortedList)protected java.util.List<CharacterRenderInfo>toCRI(TextRenderInfo tri)ConvertTextRenderInfotoCharacterRenderInfoThis method is public and not final so that custom implementations can choose to override it.protected java.util.List<Rectangle>toRectangles(java.util.List<CharacterRenderInfo> cris)ConvertsCharacterRenderInfoobjects toRectangles This method is protected and not final so that custom implementations can choose to override it.
-
-
-
Field Detail
-
EPS
private static final float EPS
- See Also:
- Constant Field Values
-
pattern
private final java.util.regex.Pattern pattern
-
parseResult
private final java.util.List<CharacterRenderInfo> parseResult
-
-
Method Detail
-
getResultantLocations
public java.util.Collection<IPdfTextLocation> getResultantLocations()
Returns theRectangles that have been processed so far.- Specified by:
getResultantLocationsin interfaceILocationExtractionStrategy- Returns:
Collection<IPdfTextLocation> instance with the current resultant IPdfTextLocations
-
eventOccurred
public void eventOccurred(IEventData data, EventType type)
Called when some event occurs during parsing a content stream.- Specified by:
eventOccurredin interfaceIEventListener- Parameters:
data- Combines the data required for processing corresponding event type.type- Event type.
-
getSupportedEvents
public java.util.Set<EventType> getSupportedEvents()
Provides the set of event types this listener supports. Returns null if all possible event types are supported.- Specified by:
getSupportedEventsin interfaceIEventListener- Returns:
- Set of event types supported by this listener or null if all possible event types are supported.
-
toCRI
protected java.util.List<CharacterRenderInfo> toCRI(TextRenderInfo tri)
ConvertTextRenderInfotoCharacterRenderInfoThis method is public and not final so that custom implementations can choose to override it. Other implementations ofCharacterRenderInfomay choose to store different properties than merely theRectangledescribing the bounding box. E.g. a custom implementation might choose to storeColorinformation as well, to better match the content surrounding the redactionRectangle.- Parameters:
tri-TextRenderInfoobject- Returns:
- a list of
CharacterRenderInfos which represents the passedTextRenderInfo?
-
toRectangles
protected java.util.List<Rectangle> toRectangles(java.util.List<CharacterRenderInfo> cris)
ConvertsCharacterRenderInfoobjects toRectangles This method is protected and not final so that custom implementations can choose to override it. E.g. other implementations may choose to add padding/margin to the Rectangles. This method also offers a convenient access point to the mapping ofCharacterRenderInfotoRectangle. This mapping enables (custom implementations) to match color of text in redacted Rectangles, or match color of background, by the mere virtue of offering access to theCharacterRenderInfoobjects that generated theRectangle.- Parameters:
cris- list ofCharacterRenderInfoobjects- Returns:
- an array containing the elements of this list
-
removeDuplicates
private void removeDuplicates(java.util.List<IPdfTextLocation> sortedList)
-
getStartIndex
private static java.lang.Integer getStartIndex(java.util.Map<java.lang.Integer,java.lang.Integer> indexMap, int index, java.lang.String txt)
-
getEndIndex
private static java.lang.Integer getEndIndex(java.util.Map<java.lang.Integer,java.lang.Integer> indexMap, int index)
-
-