Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-16

JCIFS connector's document fingerprinting feature is not general enough

    XMLWordPrintableJSON

Details

    Description

      The JCIFS connector has a feature, called "fingerprinting", which allows it to classify documents according to ability of the back-end to index that content. Right at the moment, this fingerprinter is capable of recognizing PDFs, Microsoft Office files, and text files as being indexable. One could imagine, though, that different SOLR plugins, etc. might have more capability than that. Also, other connectors could potentially benefit from similar technology, specifically any connector that deals with binary documents.

      One approach to solving this problem would be to remove the feature entirely, and allow whatever pipeline exists in SOLR determine the indexability after the fact. The reason this feature was added at MetaCarta, however, is that it may be possible to exclude an un-useful document without having to fetch the whole thing, and (at least for MetaCarta clients) the number of unindexable files of gigantic size was a big concern.

      Another approach might be to tie the functionality in with the output connector interface, so that an output connector would (somehow) determine applicability of a document. This would require some care to make it possible to fingerprint without having to download the entire document, but would otherwise have the correct overall structure.

      Attachments

        Activity

          People

            kwright@metacarta.com Karl Wright
            kwright@metacarta.com Karl Wright
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: