Class WebcrawlerConnector

  • All Implemented Interfaces:
    org.apache.manifoldcf.core.interfaces.IConnector, org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector

    public class WebcrawlerConnector
    extends org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
    This is the Web Crawler implementation of the IRepositoryConnector interface. This connector may be superceded by one that calls out to python, or by a entirely python Connector Framework, depending on how the winds blow.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      java.lang.String addSeedDocuments​(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity activities, org.apache.manifoldcf.core.interfaces.Specification spec, java.lang.String lastSeedVersion, long seedTime, int jobMode)
      Queue "seed" documents.
      protected java.lang.String[] calculateDocumentEvents​(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities, java.lang.String documentIdentifier)
      Calculate events that should be associated with a document.
      java.lang.String check()
      Check status of connection.
      protected int checkFetchAllowed​(java.lang.String documentIdentifier, java.lang.String protocol, java.lang.String hostIPAddress, int port, PageCredentials credential, org.apache.manifoldcf.connectorcommon.interfaces.IKeystoreManager trustStore, java.lang.String hostName, java.lang.String[] binNames, long currentTime, java.lang.String pathString, org.apache.manifoldcf.crawler.interfaces.IProcessActivity versionActivities, int connectionLimit, java.lang.String proxyHost, int proxyPort, java.lang.String proxyAuthDomain, java.lang.String proxyAuthUsername, java.lang.String proxyAuthPassword)
      Check robots to see if fetch is allowed.
      void clearThreadContext()
      Clear out any state information specific to a given thread.
      protected static void compileList​(java.util.List<java.util.regex.Pattern> output, java.util.List<java.lang.String> input)
      Compile all regexp entries in the passed in list, and add them to the output list.
      void deinstall​(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext)
      Uninstall the connector.
      void disconnect()
      Close the connection.
      protected java.lang.String doCanonicalization​(WebcrawlerConnector.DocumentURLFilter filter, WebURL url)
      Code to canonicalize a URL.
      protected java.lang.String documentIdentifiertoFileName​(java.lang.String documentIdentifier)
      Convert a document identifier to filename.
      protected static java.lang.String extractContentType​(java.lang.String contentType)  
      protected static java.lang.String extractEncoding​(java.lang.String contentType)  
      protected boolean extractLinks​(java.lang.String documentIdentifier, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, WebcrawlerConnector.DocumentURLFilter filter)
      Code to extract links from an already-fetched document.
      protected static java.lang.String extractMimeType​(java.lang.String contentType)  
      protected static java.util.Set<java.lang.String> findExcludedHeaders​(org.apache.manifoldcf.core.interfaces.Specification spec)
      Read a document specification to get a set of excluded headers
      protected FormData findHTMLForm​(java.lang.String currentURI, LoginParameters lp)
      Find matching HTML form data, if present.
      protected java.lang.String findHTMLLinkURI​(java.lang.String currentURI, LoginParameters lp)
      Find HTML link URI, if present, making sure specified preference is matched.
      protected java.lang.String findPreferredRedirectionURI​(java.lang.String currentURI, LoginParameters lp)
      Find a preferred redirection URI, if it exists
      protected java.lang.String findRedirectionURI​(java.lang.String currentURI)
      Find a redirection URI, if it exists
      protected java.lang.String findSpecifiedContent​(java.lang.String currentURI, LoginParameters lp)
      Find existence of specific content on the page (never finds a URL)
      protected static java.lang.String[] getAcls​(org.apache.manifoldcf.core.interfaces.Specification spec)
      Grab forced acl out of document specification.
      java.lang.String[] getActivitiesList()
      Return the list of activities that this connector supports (i.e.
      java.lang.String[] getBinNames​(java.lang.String documentIdentifier)
      Get the bin name string for a document identifier.
      int getConnectorModel()
      Tell the world what model this connector uses for getDocumentIdentifiers().
      int getMaxDocumentRequest()
      Get the maximum number of documents to amalgamate together into one batch, for this connector.
      protected PageCredentials getPageCredential​(java.lang.String documentIdentifier)
      Get the page credentials for a given document identifier (URL)
      java.lang.String[] getRelationshipTypes()
      Return the list of relationship types that this connector recognizes.
      protected SequenceCredentials getSequenceCredential​(java.lang.String documentIdentifier)
      Get the sequence credentials for a given document identifier (URL)
      protected void getSession()
      Start a session
      protected org.apache.manifoldcf.connectorcommon.interfaces.IKeystoreManager getTrustStore​(java.lang.String documentIdentifier)
      Get the trust store for a given document identifier (URL)
      protected void handleHTML​(java.lang.String documentURI, IHTMLHandler handler)
      Handle document references from HTML
      protected static void handleIOException​(java.io.IOException e, java.lang.String context)  
      protected void handleRedirects​(java.lang.String documentURI, IRedirectionHandler handler)
      Handle extracting the redirect link from a redirect response.
      protected void handleXML​(java.lang.String documentURI, IXMLHandler handler)
      Handle document references from XML.
      void install​(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext)
      Install the connector.
      protected boolean isContentInteresting​(org.apache.manifoldcf.crawler.interfaces.IFingerprintActivity activities, java.lang.String documentIdentifier, int response, java.lang.String contentType)
      Code to check if data is interesting, based on response code and content type.
      protected boolean isDocumentText​(java.lang.String documentURI)
      Is the document text, as far as we can tell?
      protected static boolean isStrange​(byte x)
      Check if character is not typical ASCII or utf-8.
      protected static boolean isText​(byte[] beginChunk, int chunkLength)
      Test to see if a document is text or not.
      protected static boolean isWhiteSpace​(byte x)
      Check if a byte is a whitespace character.
      protected void loginAndFetch​(WebcrawlerConnector.FetchStatus fetchStatus, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, java.lang.String documentIdentifier, SequenceCredentials sessionCredential, java.lang.String globalSequenceEvent)  
      protected int lookupIPAddress​(java.lang.String documentIdentifier, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, java.lang.String hostName, long currentTime, java.lang.StringBuilder ipAddressBuffer)
      Look up an ipaddress given a non-canonical host name.
      protected java.lang.String makeDNSEventName​(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities, java.lang.String hostNameKey)
      Calculate the event name for DNS access.
      protected java.lang.String makeDocumentIdentifier​(java.lang.String parentIdentifier, java.lang.String rawURL, WebcrawlerConnector.DocumentURLFilter filter, org.apache.manifoldcf.crawler.interfaces.IHistoryActivity activities)
      Convert an absolute or relative URL to a document identifier.
      protected java.lang.String makeRobotsEventName​(org.apache.manifoldcf.crawler.interfaces.INamingActivity versionActivities, java.lang.String robotsKey)
      Construct a name for the global web-connector robots event.
      protected static java.lang.String makeRobotsKey​(java.lang.String protocol, java.lang.String hostName, int port)
      Construct the robots key for a host.
      protected java.lang.String makeSessionLoginEventName​(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities, java.lang.String sequenceKey)
      Calculate the event name for session login.
      void outputConfigurationBody​(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.lang.String tabName)
      Output the configuration body section.
      void outputConfigurationHeader​(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.util.List<java.lang.String> tabsArray)
      Output the configuration header section.
      void outputSpecificationBody​(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber, int actualSequenceNumber, java.lang.String tabName)
      Output the specification body section.
      void outputSpecificationHeader​(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber, java.util.List<java.lang.String> tabsArray)
      Output the specification header section.
      void poll()
      This method is periodically called for all connectors that are connected but not in active use.
      java.lang.String processConfigurationPost​(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
      Process a configuration post.
      protected void processDocument​(org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, java.lang.String documentIdentifier, java.lang.String versionString, boolean indexDocument, java.util.Map<java.lang.String,​java.util.Set<java.lang.String>> metaHash, java.lang.String[] acls, WebcrawlerConnector.DocumentURLFilter filter)  
      void processDocuments​(java.lang.String[] documentIdentifiers, org.apache.manifoldcf.crawler.interfaces.IExistingVersions statuses, org.apache.manifoldcf.core.interfaces.Specification spec, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, int jobMode, boolean usesDefaultAuthority)
      Process a set of documents.
      java.lang.String processSpecificationPost​(org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber)
      Process a specification post.
      protected static java.util.List<java.lang.String> stringToArray​(java.lang.String input)
      Read a string as a sequence of individual expressions, urls, etc.
      void viewConfiguration​(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
      View configuration.
      void viewSpecification​(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, java.util.Locale locale, org.apache.manifoldcf.core.interfaces.Specification ds, int connectionSequenceNumber)
      View specification.
      • Methods inherited from class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector

        getFormCheckJavascriptMethodName, getFormPresaveCheckJavascriptMethodName, requestInfo
      • Methods inherited from class org.apache.manifoldcf.core.connector.BaseConnector

        connect, getConfiguration, isConnected, outputConfigurationBody, outputConfigurationHeader, outputConfigurationHeader, pack, packFixedList, packList, packList, processConfigurationPost, setThreadContext, unpack, unpackFixedList, unpackList, viewConfiguration
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
      • Methods inherited from interface org.apache.manifoldcf.core.interfaces.IConnector

        connect, getConfiguration, isConnected, setThreadContext
    • Field Detail

      • RESULTSTATUS_NOTYETDETERMINED

        protected static final int RESULTSTATUS_NOTYETDETERMINED
        See Also:
        Constant Field Values
      • interestingMimeTypeArray

        protected static final java.lang.String[] interestingMimeTypeArray
        This represents a list of the mime types that this connector knows how to extract links from. Documents that are indexable are described by the output connector.
      • interestingMimeTypeMap

        protected static final java.util.Set<java.lang.String> interestingMimeTypeMap
      • understoodProtocols

        protected static final java.util.Set<java.lang.String> understoodProtocols
      • ACTIVITY_PROCESS

        public static final java.lang.String ACTIVITY_PROCESS
        See Also:
        Constant Field Values
      • ACTIVITY_ROBOTSPARSE

        public static final java.lang.String ACTIVITY_ROBOTSPARSE
        See Also:
        Constant Field Values
      • ACTIVITY_LOGON_START

        public static final java.lang.String ACTIVITY_LOGON_START
        See Also:
        Constant Field Values
      • ACTIVITY_LOGON_END

        public static final java.lang.String ACTIVITY_LOGON_END
        See Also:
        Constant Field Values
      • reservedHeaders

        protected static final java.util.Set<java.lang.String> reservedHeaders
      • potentiallyExcludedHeaders

        protected static final java.util.List<java.lang.String> potentiallyExcludedHeaders
      • robotsUsage

        protected int robotsUsage
        Robots usage flag
      • metaRobotsTagsUsage

        protected int metaRobotsTagsUsage
        Meta robots tag usage flag
      • userAgent

        protected java.lang.String userAgent
        The user-agent for this connector instance
      • from

        protected java.lang.String from
        The email address for this connector instance
      • connectionTimeoutMilliseconds

        protected int connectionTimeoutMilliseconds
        Connection timeout, milliseconds.
      • socketTimeoutMilliseconds

        protected int socketTimeoutMilliseconds
        Socket timeout, milliseconds
      • throttleGroupName

        protected java.lang.String throttleGroupName
        Throttle group name
      • throttleDescription

        protected ThrottleDescription throttleDescription
        The throttle description
      • trustsDescription

        protected TrustsDescription trustsDescription
        The trusts description
      • robotsManager

        protected RobotsManager robotsManager
        The robots manager currently used by this instance
      • dnsManager

        protected DNSManager dnsManager
        The DNS manager currently used by this instance
      • cookieManager

        protected CookieManager cookieManager
        The cookie manager used by this instance
      • isInitialized

        protected boolean isInitialized
        This flag is set when the instance has been initialized
      • cache

        protected static DataCache cache
        This is where we keep data around between the getVersions() phase and the processDocuments() phase.
      • proxyHost

        protected java.lang.String proxyHost
        Proxy host
      • proxyPort

        protected int proxyPort
        Proxy port
      • proxyAuthDomain

        protected java.lang.String proxyAuthDomain
        Proxy auth domain
      • proxyAuthUsername

        protected java.lang.String proxyAuthUsername
        Proxy auth user name
      • proxyAuthPassword

        protected java.lang.String proxyAuthPassword
        Proxy auth password
      • SESSIONSTATE_NORMAL

        protected static final int SESSIONSTATE_NORMAL
        Normal fetch of content document. (For all we know, we're logged in already).
        See Also:
        Constant Field Values
      • SESSIONSTATE_LOGIN

        protected static final int SESSIONSTATE_LOGIN
        We're in 'login mode'
        See Also:
        Constant Field Values
      • RESULT_VERSION_NEEDED

        protected static final int RESULT_VERSION_NEEDED
        See Also:
        Constant Field Values
      • RESULT_RETRY_DOCUMENT

        protected static final int RESULT_RETRY_DOCUMENT
        See Also:
        Constant Field Values
    • Constructor Detail

      • WebcrawlerConnector

        public WebcrawlerConnector()
        Constructor.
    • Method Detail

      • getConnectorModel

        public int getConnectorModel()
        Tell the world what model this connector uses for getDocumentIdentifiers(). This must return a model value as specified above.
        Specified by:
        getConnectorModel in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
        Overrides:
        getConnectorModel in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
        Returns:
        the model type value.
      • install

        public void install​(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext)
                     throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Install the connector. This method is called to initialize persistent storage for the connector, such as database tables etc. It is called when the connector is registered.
        Specified by:
        install in interface org.apache.manifoldcf.core.interfaces.IConnector
        Overrides:
        install in class org.apache.manifoldcf.core.connector.BaseConnector
        Parameters:
        threadContext - is the current thread context.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • deinstall

        public void deinstall​(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext)
                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Uninstall the connector. This method is called to remove persistent storage for the connector, such as database tables etc. It is called when the connector is deregistered.
        Specified by:
        deinstall in interface org.apache.manifoldcf.core.interfaces.IConnector
        Overrides:
        deinstall in class org.apache.manifoldcf.core.connector.BaseConnector
        Parameters:
        threadContext - is the current thread context.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • getActivitiesList

        public java.lang.String[] getActivitiesList()
        Return the list of activities that this connector supports (i.e. writes into the log).
        Specified by:
        getActivitiesList in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
        Overrides:
        getActivitiesList in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
        Returns:
        the list.
      • getRelationshipTypes

        public java.lang.String[] getRelationshipTypes()
        Return the list of relationship types that this connector recognizes.
        Specified by:
        getRelationshipTypes in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
        Overrides:
        getRelationshipTypes in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
        Returns:
        the list.
      • clearThreadContext

        public void clearThreadContext()
        Clear out any state information specific to a given thread. This method is called when this object is returned to the connection pool.
        Specified by:
        clearThreadContext in interface org.apache.manifoldcf.core.interfaces.IConnector
        Overrides:
        clearThreadContext in class org.apache.manifoldcf.core.connector.BaseConnector
      • getSession

        protected void getSession()
                           throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Start a session
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • poll

        public void poll()
                  throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        This method is periodically called for all connectors that are connected but not in active use.
        Specified by:
        poll in interface org.apache.manifoldcf.core.interfaces.IConnector
        Overrides:
        poll in class org.apache.manifoldcf.core.connector.BaseConnector
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • check

        public java.lang.String check()
                               throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Check status of connection.
        Specified by:
        check in interface org.apache.manifoldcf.core.interfaces.IConnector
        Overrides:
        check in class org.apache.manifoldcf.core.connector.BaseConnector
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • disconnect

        public void disconnect()
                        throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Close the connection. Call this before discarding the repository connector.
        Specified by:
        disconnect in interface org.apache.manifoldcf.core.interfaces.IConnector
        Overrides:
        disconnect in class org.apache.manifoldcf.core.connector.BaseConnector
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • getBinNames

        public java.lang.String[] getBinNames​(java.lang.String documentIdentifier)
        Get the bin name string for a document identifier. The bin name describes the queue to which the document will be assigned for throttling purposes. Throttling controls the rate at which items in a given queue are fetched; it does not say anything about the overall fetch rate, which may operate on multiple queues or bins. For example, if you implement a web crawler, a good choice of bin name would be the server name, since that is likely to correspond to a real resource that will need real throttle protection.
        Specified by:
        getBinNames in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
        Overrides:
        getBinNames in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
        Parameters:
        documentIdentifier - is the document identifier.
        Returns:
        the bin name.
      • addSeedDocuments

        public java.lang.String addSeedDocuments​(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity activities,
                                                 org.apache.manifoldcf.core.interfaces.Specification spec,
                                                 java.lang.String lastSeedVersion,
                                                 long seedTime,
                                                 int jobMode)
                                          throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                                 org.apache.manifoldcf.agents.interfaces.ServiceInterruption
        Queue "seed" documents. Seed documents are the starting places for crawling activity. Documents are seeded when this method calls appropriate methods in the passed in ISeedingActivity object. This method can choose to find repository changes that happen only during the specified time interval. The seeds recorded by this method will be viewed by the framework based on what the getConnectorModel() method returns. It is not a big problem if the connector chooses to create more seeds than are strictly necessary; it is merely a question of overall work required. The end time and seeding version string passed to this method may be interpreted for greatest efficiency. For continuous crawling jobs, this method will be called once, when the job starts, and at various periodic intervals as the job executes. When a job's specification is changed, the framework automatically resets the seeding version string to null. The seeding version string may also be set to null on each job run, depending on the connector model returned by getConnectorModel(). Note that it is always ok to send MORE documents rather than less to this method. The connector will be connected before this method can be called.
        Specified by:
        addSeedDocuments in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
        Overrides:
        addSeedDocuments in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
        Parameters:
        activities - is the interface this method should use to perform whatever framework actions are desired.
        spec - is a document specification (that comes from the job).
        seedTime - is the end of the time range of documents to consider, exclusive.
        lastSeedVersion - is the last seeding version string for this job, or null if the job has no previous seeding version string.
        jobMode - is an integer describing how the job is being run, whether continuous or once-only.
        Returns:
        an updated seeding version string, to be stored with the job.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
        org.apache.manifoldcf.agents.interfaces.ServiceInterruption
      • processDocuments

        public void processDocuments​(java.lang.String[] documentIdentifiers,
                                     org.apache.manifoldcf.crawler.interfaces.IExistingVersions statuses,
                                     org.apache.manifoldcf.core.interfaces.Specification spec,
                                     org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities,
                                     int jobMode,
                                     boolean usesDefaultAuthority)
                              throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                     org.apache.manifoldcf.agents.interfaces.ServiceInterruption
        Process a set of documents. This is the method that should cause each document to be fetched, processed, and the results either added to the queue of documents for the current job, and/or entered into the incremental ingestion manager. The document specification allows this class to filter what is done based on the job. The connector will be connected before this method can be called.
        Specified by:
        processDocuments in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
        Overrides:
        processDocuments in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
        Parameters:
        documentIdentifiers - is the set of document identifiers to process.
        statuses - are the currently-stored document versions for each document in the set of document identifiers passed in above.
        activities - is the interface this method should use to queue up new document references and ingest documents.
        jobMode - is an integer describing how the job is being run, whether continuous or once-only.
        usesDefaultAuthority - will be true only if the authority in use for these documents is the default one.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
        org.apache.manifoldcf.agents.interfaces.ServiceInterruption
      • loginAndFetch

        protected void loginAndFetch​(WebcrawlerConnector.FetchStatus fetchStatus,
                                     org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities,
                                     java.lang.String documentIdentifier,
                                     SequenceCredentials sessionCredential,
                                     java.lang.String globalSequenceEvent)
                              throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                     org.apache.manifoldcf.agents.interfaces.ServiceInterruption
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
        org.apache.manifoldcf.agents.interfaces.ServiceInterruption
      • processDocument

        protected void processDocument​(org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities,
                                       java.lang.String documentIdentifier,
                                       java.lang.String versionString,
                                       boolean indexDocument,
                                       java.util.Map<java.lang.String,​java.util.Set<java.lang.String>> metaHash,
                                       java.lang.String[] acls,
                                       WebcrawlerConnector.DocumentURLFilter filter)
                                throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                       org.apache.manifoldcf.agents.interfaces.ServiceInterruption
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
        org.apache.manifoldcf.agents.interfaces.ServiceInterruption
      • extractContentType

        protected static java.lang.String extractContentType​(java.lang.String contentType)
      • extractEncoding

        protected static java.lang.String extractEncoding​(java.lang.String contentType)
      • extractMimeType

        protected static java.lang.String extractMimeType​(java.lang.String contentType)
      • handleIOException

        protected static void handleIOException​(java.io.IOException e,
                                                java.lang.String context)
                                         throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                                org.apache.manifoldcf.agents.interfaces.ServiceInterruption
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
        org.apache.manifoldcf.agents.interfaces.ServiceInterruption
      • getMaxDocumentRequest

        public int getMaxDocumentRequest()
        Get the maximum number of documents to amalgamate together into one batch, for this connector.
        Specified by:
        getMaxDocumentRequest in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
        Overrides:
        getMaxDocumentRequest in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
        Returns:
        the maximum number. 0 indicates "unlimited".
      • outputConfigurationHeader

        public void outputConfigurationHeader​(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
                                              org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                                              java.util.Locale locale,
                                              org.apache.manifoldcf.core.interfaces.ConfigParams parameters,
                                              java.util.List<java.lang.String> tabsArray)
                                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                              java.io.IOException
        Output the configuration header section. This method is called in the head section of the connector's configuration page. Its purpose is to add the required tabs to the list, and to output any javascript methods that might be needed by the configuration editing HTML.
        Specified by:
        outputConfigurationHeader in interface org.apache.manifoldcf.core.interfaces.IConnector
        Overrides:
        outputConfigurationHeader in class org.apache.manifoldcf.core.connector.BaseConnector
        Parameters:
        threadContext - is the local thread context.
        out - is the output to which any HTML should be sent.
        parameters - are the configuration parameters, as they currently exist, for this connection being configured.
        tabsArray - is an array of tab names. Add to this array any tab names that are specific to the connector.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
        java.io.IOException
      • outputConfigurationBody

        public void outputConfigurationBody​(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
                                            org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                                            java.util.Locale locale,
                                            org.apache.manifoldcf.core.interfaces.ConfigParams parameters,
                                            java.lang.String tabName)
                                     throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                            java.io.IOException
        Output the configuration body section. This method is called in the body section of the connector's configuration page. Its purpose is to present the required form elements for editing. The coder can presume that the HTML that is output from this configuration will be within appropriate <html>, <body>, and <form> tags. The name of the form is "editconnection".
        Specified by:
        outputConfigurationBody in interface org.apache.manifoldcf.core.interfaces.IConnector
        Overrides:
        outputConfigurationBody in class org.apache.manifoldcf.core.connector.BaseConnector
        Parameters:
        threadContext - is the local thread context.
        out - is the output to which any HTML should be sent.
        parameters - are the configuration parameters, as they currently exist, for this connection being configured.
        tabName - is the current tab name.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
        java.io.IOException
      • processConfigurationPost

        public java.lang.String processConfigurationPost​(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
                                                         org.apache.manifoldcf.core.interfaces.IPostParameters variableContext,
                                                         java.util.Locale locale,
                                                         org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
                                                  throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Process a configuration post. This method is called at the start of the connector's configuration page, whenever there is a possibility that form data for a connection has been posted. Its purpose is to gather form information and modify the configuration parameters accordingly. The name of the posted form is "editconnection".
        Specified by:
        processConfigurationPost in interface org.apache.manifoldcf.core.interfaces.IConnector
        Overrides:
        processConfigurationPost in class org.apache.manifoldcf.core.connector.BaseConnector
        Parameters:
        threadContext - is the local thread context.
        variableContext - is the set of variables available from the post, including binary file post information.
        parameters - are the configuration parameters, as they currently exist, for this connection being configured.
        Returns:
        null if all is well, or a string error message if there is an error that should prevent saving of the connection (and cause a redirection to an error page).
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • viewConfiguration

        public void viewConfiguration​(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
                                      org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                                      java.util.Locale locale,
                                      org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
                               throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                      java.io.IOException
        View configuration. This method is called in the body section of the connector's view configuration page. Its purpose is to present the connection information to the user. The coder can presume that the HTML that is output from this configuration will be within appropriate <html> and <body>tags.
        Specified by:
        viewConfiguration in interface org.apache.manifoldcf.core.interfaces.IConnector
        Overrides:
        viewConfiguration in class org.apache.manifoldcf.core.connector.BaseConnector
        Parameters:
        threadContext - is the local thread context.
        out - is the output to which any HTML should be sent.
        parameters - are the configuration parameters, as they currently exist, for this connection being configured.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
        java.io.IOException
      • outputSpecificationHeader

        public void outputSpecificationHeader​(org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                                              java.util.Locale locale,
                                              org.apache.manifoldcf.core.interfaces.Specification ds,
                                              int connectionSequenceNumber,
                                              java.util.List<java.lang.String> tabsArray)
                                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                              java.io.IOException
        Output the specification header section. This method is called in the head section of a job page which has selected a repository connection of the current type. Its purpose is to add the required tabs to the list, and to output any javascript methods that might be needed by the job editing HTML. The connector will be connected before this method can be called.
        Specified by:
        outputSpecificationHeader in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
        Overrides:
        outputSpecificationHeader in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
        Parameters:
        out - is the output to which any HTML should be sent.
        locale - is the locale the output is preferred to be in.
        ds - is the current document specification for this job.
        connectionSequenceNumber - is the unique number of this connection within the job.
        tabsArray - is an array of tab names. Add to this array any tab names that are specific to the connector.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
        java.io.IOException
      • outputSpecificationBody

        public void outputSpecificationBody​(org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                                            java.util.Locale locale,
                                            org.apache.manifoldcf.core.interfaces.Specification ds,
                                            int connectionSequenceNumber,
                                            int actualSequenceNumber,
                                            java.lang.String tabName)
                                     throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                            java.io.IOException
        Output the specification body section. This method is called in the body section of a job page which has selected a repository connection of the current type. Its purpose is to present the required form elements for editing. The coder can presume that the HTML that is output from this configuration will be within appropriate <html>, <body>, and <form> tags. The name of the form is always "editjob". The connector will be connected before this method can be called.
        Specified by:
        outputSpecificationBody in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
        Overrides:
        outputSpecificationBody in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
        Parameters:
        out - is the output to which any HTML should be sent.
        locale - is the locale the output is preferred to be in.
        ds - is the current document specification for this job.
        connectionSequenceNumber - is the unique number of this connection within the job.
        actualSequenceNumber - is the connection within the job that has currently been selected.
        tabName - is the current tab name. (actualSequenceNumber, tabName) form a unique tuple within the job.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
        java.io.IOException
      • processSpecificationPost

        public java.lang.String processSpecificationPost​(org.apache.manifoldcf.core.interfaces.IPostParameters variableContext,
                                                         java.util.Locale locale,
                                                         org.apache.manifoldcf.core.interfaces.Specification ds,
                                                         int connectionSequenceNumber)
                                                  throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Process a specification post. This method is called at the start of job's edit or view page, whenever there is a possibility that form data for a connection has been posted. Its purpose is to gather form information and modify the document specification accordingly. The name of the posted form is always "editjob". The connector will be connected before this method can be called.
        Specified by:
        processSpecificationPost in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
        Overrides:
        processSpecificationPost in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
        Parameters:
        variableContext - contains the post data, including binary file-upload information.
        locale - is the locale the output is preferred to be in.
        ds - is the current document specification for this job.
        connectionSequenceNumber - is the unique number of this connection within the job.
        Returns:
        null if all is well, or a string error message if there is an error that should prevent saving of the job (and cause a redirection to an error page).
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • viewSpecification

        public void viewSpecification​(org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                                      java.util.Locale locale,
                                      org.apache.manifoldcf.core.interfaces.Specification ds,
                                      int connectionSequenceNumber)
                               throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                      java.io.IOException
        View specification. This method is called in the body section of a job's view page. Its purpose is to present the document specification information to the user. The coder can presume that the HTML that is output from this configuration will be within appropriate <html> and <body>tags. The connector will be connected before this method can be called.
        Specified by:
        viewSpecification in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
        Overrides:
        viewSpecification in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
        Parameters:
        out - is the output to which any HTML should be sent.
        locale - is the locale the output is preferred to be in.
        ds - is the current document specification for this job.
        connectionSequenceNumber - is the unique number of this connection within the job.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
        java.io.IOException
      • makeSessionLoginEventName

        protected java.lang.String makeSessionLoginEventName​(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities,
                                                             java.lang.String sequenceKey)
        Calculate the event name for session login.
      • makeDNSEventName

        protected java.lang.String makeDNSEventName​(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities,
                                                    java.lang.String hostNameKey)
        Calculate the event name for DNS access.
      • lookupIPAddress

        protected int lookupIPAddress​(java.lang.String documentIdentifier,
                                      org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities,
                                      java.lang.String hostName,
                                      long currentTime,
                                      java.lang.StringBuilder ipAddressBuffer)
                               throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                      org.apache.manifoldcf.agents.interfaces.ServiceInterruption
        Look up an ipaddress given a non-canonical host name.
        Returns:
        appropriate status.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
        org.apache.manifoldcf.agents.interfaces.ServiceInterruption
      • makeRobotsKey

        protected static java.lang.String makeRobotsKey​(java.lang.String protocol,
                                                        java.lang.String hostName,
                                                        int port)
        Construct the robots key for a host. This is used to look up robots info in the database, and to form the corresponding event name.
      • makeRobotsEventName

        protected java.lang.String makeRobotsEventName​(org.apache.manifoldcf.crawler.interfaces.INamingActivity versionActivities,
                                                       java.lang.String robotsKey)
        Construct a name for the global web-connector robots event.
      • checkFetchAllowed

        protected int checkFetchAllowed​(java.lang.String documentIdentifier,
                                        java.lang.String protocol,
                                        java.lang.String hostIPAddress,
                                        int port,
                                        PageCredentials credential,
                                        org.apache.manifoldcf.connectorcommon.interfaces.IKeystoreManager trustStore,
                                        java.lang.String hostName,
                                        java.lang.String[] binNames,
                                        long currentTime,
                                        java.lang.String pathString,
                                        org.apache.manifoldcf.crawler.interfaces.IProcessActivity versionActivities,
                                        int connectionLimit,
                                        java.lang.String proxyHost,
                                        int proxyPort,
                                        java.lang.String proxyAuthDomain,
                                        java.lang.String proxyAuthUsername,
                                        java.lang.String proxyAuthPassword)
                                 throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                        org.apache.manifoldcf.agents.interfaces.ServiceInterruption
        Check robots to see if fetch is allowed.
        Returns:
        appropriate resultstatus code.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
        org.apache.manifoldcf.agents.interfaces.ServiceInterruption
      • makeDocumentIdentifier

        protected java.lang.String makeDocumentIdentifier​(java.lang.String parentIdentifier,
                                                          java.lang.String rawURL,
                                                          WebcrawlerConnector.DocumentURLFilter filter,
                                                          org.apache.manifoldcf.crawler.interfaces.IHistoryActivity activities)
                                                   throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Convert an absolute or relative URL to a document identifier. This may involve several steps at some point, but right now it does NOT involve converting the host name to a canonical host name. (Doing so would destroy the ability of virtually hosted sites to do the right thing, since the original host name would be lost.) Thus, we do the conversion to IP address right before we actually fetch the document.
        Parameters:
        parentIdentifier - the identifier of the document in which the raw url was found, or null if none.
        rawURL - the starting, un-normalized, un-canonicalized URL.
        filter - the filter object, used to remove unmatching URLs.
        Returns:
        the canonical URL (the document identifier), or null if the url was illegal.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • doCanonicalization

        protected java.lang.String doCanonicalization​(WebcrawlerConnector.DocumentURLFilter filter,
                                                      WebURL url)
                                               throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                                      java.net.URISyntaxException
        Code to canonicalize a URL. If URL cannot be canonicalized (and is illegal) return null.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
        java.net.URISyntaxException
      • isContentInteresting

        protected boolean isContentInteresting​(org.apache.manifoldcf.crawler.interfaces.IFingerprintActivity activities,
                                               java.lang.String documentIdentifier,
                                               int response,
                                               java.lang.String contentType)
                                        throws org.apache.manifoldcf.agents.interfaces.ServiceInterruption,
                                               org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Code to check if data is interesting, based on response code and content type.
        Throws:
        org.apache.manifoldcf.agents.interfaces.ServiceInterruption
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • documentIdentifiertoFileName

        protected java.lang.String documentIdentifiertoFileName​(java.lang.String documentIdentifier)
                                                         throws java.net.URISyntaxException
        Convert a document identifier to filename.
        Parameters:
        documentIdentifier -
        Throws:
        java.net.URISyntaxException
      • findRedirectionURI

        protected java.lang.String findRedirectionURI​(java.lang.String currentURI)
                                               throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Find a redirection URI, if it exists
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • findHTMLForm

        protected FormData findHTMLForm​(java.lang.String currentURI,
                                        LoginParameters lp)
                                 throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Find matching HTML form data, if present. Return null if not.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • findPreferredRedirectionURI

        protected java.lang.String findPreferredRedirectionURI​(java.lang.String currentURI,
                                                               LoginParameters lp)
                                                        throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Find a preferred redirection URI, if it exists
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • findSpecifiedContent

        protected java.lang.String findSpecifiedContent​(java.lang.String currentURI,
                                                        LoginParameters lp)
                                                 throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Find existence of specific content on the page (never finds a URL)
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • findHTMLLinkURI

        protected java.lang.String findHTMLLinkURI​(java.lang.String currentURI,
                                                   LoginParameters lp)
                                            throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Find HTML link URI, if present, making sure specified preference is matched.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • extractLinks

        protected boolean extractLinks​(java.lang.String documentIdentifier,
                                       org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities,
                                       WebcrawlerConnector.DocumentURLFilter filter)
                                throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                       org.apache.manifoldcf.agents.interfaces.ServiceInterruption
        Code to extract links from an already-fetched document.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
        org.apache.manifoldcf.agents.interfaces.ServiceInterruption
      • handleRedirects

        protected void handleRedirects​(java.lang.String documentURI,
                                       IRedirectionHandler handler)
                                throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Handle extracting the redirect link from a redirect response.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • handleXML

        protected void handleXML​(java.lang.String documentURI,
                                 IXMLHandler handler)
                          throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                 org.apache.manifoldcf.agents.interfaces.ServiceInterruption
        Handle document references from XML. Right now we only understand RSS.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
        org.apache.manifoldcf.agents.interfaces.ServiceInterruption
      • handleHTML

        protected void handleHTML​(java.lang.String documentURI,
                                  IHTMLHandler handler)
                           throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Handle document references from HTML
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • isDocumentText

        protected boolean isDocumentText​(java.lang.String documentURI)
                                  throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Is the document text, as far as we can tell?
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • isText

        protected static boolean isText​(byte[] beginChunk,
                                        int chunkLength)
        Test to see if a document is text or not. The first n bytes are passed in, and this code returns "true" if it thinks they represent text. The code has been lifted algorithmically from products/Sharecrawler/Fingerprinter.pas, which was based on "perldoc -f -T".
      • isStrange

        protected static boolean isStrange​(byte x)
        Check if character is not typical ASCII or utf-8.
      • isWhiteSpace

        protected static boolean isWhiteSpace​(byte x)
        Check if a byte is a whitespace character.
      • stringToArray

        protected static java.util.List<java.lang.String> stringToArray​(java.lang.String input)
        Read a string as a sequence of individual expressions, urls, etc.
      • compileList

        protected static void compileList​(java.util.List<java.util.regex.Pattern> output,
                                          java.util.List<java.lang.String> input)
                                   throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Compile all regexp entries in the passed in list, and add them to the output list.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • getPageCredential

        protected PageCredentials getPageCredential​(java.lang.String documentIdentifier)
        Get the page credentials for a given document identifier (URL)
      • getSequenceCredential

        protected SequenceCredentials getSequenceCredential​(java.lang.String documentIdentifier)
        Get the sequence credentials for a given document identifier (URL)
      • getTrustStore

        protected org.apache.manifoldcf.connectorcommon.interfaces.IKeystoreManager getTrustStore​(java.lang.String documentIdentifier)
                                                                                           throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Get the trust store for a given document identifier (URL)
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • getAcls

        protected static java.lang.String[] getAcls​(org.apache.manifoldcf.core.interfaces.Specification spec)
        Grab forced acl out of document specification.
        Parameters:
        spec - is the document specification.
        Returns:
        the acls.
      • findExcludedHeaders

        protected static java.util.Set<java.lang.String> findExcludedHeaders​(org.apache.manifoldcf.core.interfaces.Specification spec)
                                                                      throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Read a document specification to get a set of excluded headers
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • calculateDocumentEvents

        protected java.lang.String[] calculateDocumentEvents​(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities,
                                                             java.lang.String documentIdentifier)
        Calculate events that should be associated with a document.