- add metadata extraction to msooxml.c

- add metadata extraction to koffice.c

- create a speed comparison script between pdftotext and texterize
  using the 1000 pdf files which i downloaded.

- add a test case for device files, pipes, and other nonstandard files
  also files for which we don't have permission, etc. To see if
  texterize and the library itself handle all cases well.

- create a memory stream which is a stream which only reads and seeks
  in a buffer which is supplied at openfile time and which can be
  modified from the outside. This can be used to write unit tests with.

- add a convertor for Lotus Word Pro and AmiWord

- find mswrite documents with ole objects

- figure out if we need to handle mac encoded word files specially,
  does that happen at all?

- fix text extraction of russian (KOI8?) text from
  RolSredstv30_01_02.doc
  OpenOffice and texterize get it wrong but Abiword and Google do it
  right.

- run a conversion of a big (known) set of files and check with gcov
  which code does not get run.

- create a script which searches google for xls/doc/pdf/rtf, etc and
  downloads those and then runs texterize on it.
  Important is to keep the downloaded files.

- handle user defined metadata from ole files
  We only parse the SummaryInformation stream now. Also parse the
  DocumentSummaryInformation stream.

- handle extraction from Form XObjects in pdf, see U20D0.pdf
  (see problems/readme for more info)

- Fix speed of olepack for larger files. Maybe use profiling
  to find out what makes it so slow.

- Add tests to arlib_test.sh
    * (maybe) add test for sparse files
    * add test with multiple "increasing value" files in one archive
      check streaming two streams at once (this will fail initially)
    * test the generated filenames by the archive layers
    * add test which opens an arstream directly without recurseArchive()
      (using openstream() or a newly to develop function which can take
      in those uri's we generate).
    * add test for fseek to a negative offset (before start of file)
      (this will fail for gzip at least)

- Check out what the text-master and text-web documents (opendocument)
  are about.

- maybe the default output should not be utf-8 but the current locale

- extend html metadata support (only title is supported now).

- add a test for proper html text extraction (easy)

- add a test for charset detection code (easy)

- find a big-endian encoded tiff/jpeg (with exif) file (maybe create one
  on the mac?). and test that everything still works ok.

- create tiff.c and tiff.h and put the tiff extraction code from jpeg there

- find out what is in jfif and if there are tags there
  Find out what gimp does with its comment when saving a jpeg file.

- make a php library of this
  PHP wrappers to be able to write the same examples (or different ones)
  using this PHP API. Best way to test this would be to make a php page
  which runs the library function(s) after someone uploads a file and
  then displays the plain text (or xml) (also nice to show the metadata)
  or even better to just dump the xml and hook up some css/xsl
  (whichever is what we need) to display it nicely.

  This would also make sure we get some testing documents and gives the
  wrapper some testing and make sure i keep it uptodate.

  Also make a version of this which uses the latest version of the code
  (people should normally use the stable one ofcourse).

- perl wrapper (yikes, maybe do python first to promote the better language)

- python, ruby, etc wrappers (much later)

- make the --charset option use a wrapper around g_iconv which has extra
  handlers for characters like "fi" (used in PDFReference.pdf).

- make the output functions expand \u2028 (line seperator) and \u2029
  (paragraph separator). in the output we need extra newlines for
  readability.

- add checksum testing to arlib for the formats which have this
  (zip/tar/gz at least). This would make us more confident that arlib
  is working properly.

- create a proper testset of documents
  only include documents which can be shared with the rest of the world

- fix recognition of staroffice files (.sxw and others)
  some work (OO1.1) with the mimetype in the beginning of the zipfile
  but most (OO1.0) don't

- look into ID3v2 TCON tag. If it is the same as genre (probably) we
  should check if the content is a number and resolve it through the
  ID3v1 genre list.

- add AC_FUNC_STRNLEN to automake/autoconf scripts
  (see
  http://www.gnu.org/software/autoconf/manual/html_node/Particular-Functions.html
  for details).
  maybe needs an AC_LIBOBJ replacment as explained in
  http://www.gnu.org/software/autoconf/manual/html_node/Generic-Functions.html

- add wp6 handling of prefix packet type 6 (0x6)
  This sets the character mapping. It also refers to MAP files. Find out
  how to handle MAP files as well.

- attempt to map existing meta data items to dublin core. This might include
  a change to the cb_meta callback to add a scheme. Also the output of cb_meta
  in texterize should be changed to the dublin core format.
  An example of how to do this is at dc-dot
  (http://www.ukoln.ac.uk/metadata/dcdot/)
  It also takes msword doc urls
  (i.e. http://ech.eastkingdom.org/ILoIs/2003-Nov-10/2003-Nov-10-ILoI.pdf)
  Don't forget that there is more than the iso part of the spec and there
  are some more dc elements which we can use here. Look at
  http://dublincore.org/documents/dcmi-terms/

- create a wordperfect testfile with user-defined document summary items
  I don't have a testfile where name is set. (search wordperfect.c for
  name_char and fix the code to properly handle it and pass the name
  to cb_meta).

- implement msword/msexcel meta data extraction for non-ole files
  (check which of the two (ole/file itself) is more accurate)

PDF items:
- remove '-' when at end of line and next line has continuation of the
  word. The whole word should go on this line (or the next, whatever).

- vertical writing mode (lots of work i guess)

- check pdf_parse_cid_font(). Need to properly handle Type2 CID fonts
  with the truetype font embedded and a CIDToGidMap (which needs to
  be parsed).

- make pdf_parse_cid_font() able to use predefined orderings.  These are
  specified as "/Ordering (Japan1)" or something similar.  For these we
  could use the poppler-data package. Look into why this is distributed
  separately. I want to build this into the static binary if possible.
  If this is not possible the whole thing should be optional (and then
  the program would have to read separate files from
  /usr/local/share/texterize or something).


===================DONE===================================================
- options handling in texterize (and maybe unpack as well)

- proper filenames in archive streams (uri) + fix 0Table/1Table handling
  to use this filename. This fixes a bug where a WordDocument is placed
  in an objectpool subdir and the Table file as well (currently it tries
  to open the toplevel Table for this WordDocument).

- options handling in texterize (and maybe unpack as well)

- output formatting

- fix ole reader to return short item count and not 0 when not
  all that is asked for is available.
  This problem is partly situated in the ole_stream_read_bb() and
  ole_stream_read_sb() functions.
  We can work around this by trying it once as a whole (for speed) and
  if it fails try it nmemb times until it fails and return how many
  times it succeeded.

- make the archive streams produce a proper filename (uri)

- make a good file test for the file wrappers with:
  files/small  (< 64K file)
  files/small.gz
  files/small.tar.gz

  files/big (> 64K file, i.e. 1MB file)
  files/big.bz
  files/big.tar.gz

  files/huge (> 2GB file)
  files/huge.gz
  files/huge.tar.gz

  (the huge one is for later as are the .tar.gz files)

  the test should use and test all functions which arfile/arstream has.

  The files should be generated by a small program which i will make.
  This program will generate files which contain integers with the
  position in the file. This way we can seek to a position and read the
  integer there and see if it corresponds to where we think we are in
  the file. This makes verification much easier.

- replace g_string_sprintf by g_string_printf
  (g_string_sprintf is deprecated since glib version ...)

- Fix ole fseek beyond end of file. Normal fseek() supports this
  and libole2 should support this as well.
  Or work around it in olefile.c (this is probably best because
  modifying libole2 is asking for trouble).

- Fix gzip fseek with whence==SEEK_END (currently unsupported)

- Add tests to arlib_test.sh
    * add seeking from end (SEEK_END) test (will fail for gz files)
    * add test with multiple layers of archiving (e.g. .ole.zip.tar.gz.zip)
    * add the "increasing value file" to baselib/magic.mime
      and test that it's detected properly in arlib_test.c
    * add test for large files (>2GB, >4GB, >8GB)
      which should be tested once in a while (not always because
      it takes long to test for this)
    * make sure that the file is big enough to do the tests in arlib_test.c

- find the html character set (encoding) and pass this to libxml2
  We should use ascmagic.c from file-4.17 (or later) for this. Maybe we
  can copy the file into baselib or ext and make it work for this
  purpose.  Later it could even be extended to recognize more.

- add support for other opendocument formats (presentation and
  spreadsheet are the most important)

- find out the endianness off jpeg files (i am not sure but i'm suspecting
  that it might be big endian because it's an ITU (telecom) standard).
  It is Big Endian (the spec says "the most significant byte comes first").

- Fix bug with one time -v argument to texterize (it stops instantly)

- check if it's better to accumulate the text in pdf-text.c:write_string()
  or to actually call the callback cb_text() directly (which it does now).
  (doing accumulate per page now).

- pdf encryption support

- use BaseFont and map to pdf-encodings.c builtin_uni_fonts table

- pdf: parse Differences in font dictionary / encoding

- pdf: fix charname_to_charno() when unimap is defined

- fix pdfGetEntity() to support nested arrays and ']' in strings embedded
  in TJ arrays (currently breaks, old code and new nested-array code).
  Use files/PDFReference.pdf to test.

- hexstring instead of string for printing
  (see files/terrorismekrantburojansen\&janssen.pdf)

- multibyte fonts (see files/terrorismekrantburojansen\&janssen.pdf)

- parse embedded fontfiles (but only when there is no ToUnicode cmap and
  no Differences)

- finish truetype font parser

- allow opening two (or more) streams within the same arfile and
  alternating reading from them. This means we have to restore the
  file position in the upper stream when switching reading between
  the two streams.
  This is needed for msword files (at least) because we have to open
  the WordDocument and the 0Table (or 1Table) stream from the same
  ole file.

- handle wordperfect files inside ole files
  [worked without problems]

- add pdf meta data extraction

- pdf 1.6 specific features (xref streams, etc)

- find out what goes wrong with PCP142.doc
  it looks like it is endless looping in the arlib/olefile.c
  (and allocating a lot of memory in the process!!!)
  [FIXED]

- make a C library version
  most important is one or two .h files which define the API
  then we at least have an API which we can stick to. And we can try to
  use it from texterize and other testprograms to show what can be done
  with it.

- microsoft word metadata extraction (through DocumentSummary)
  [Added metadata extraction for word/excel/powerpoint now]

- fix edittime in msole meta data extraction
  display as hours/minutes/seconds i guess, not as a date
  [FIXED by printing in the ISO8601 duration format]

- msole-summary.c: find out why the length of a section is sometimes 4
                   bytes off, my parser trips over this. (search for
                   "HACK", the -4 after that should be unneeded).
  [handled by reading what is there and fixing parse() to be able to cope]

- handle codepage != 1252 in msole-summary.c
  I have seen codepages of 949 (CP949 korean), 950 (CP950 Chinese Big5)
  and 1250 (CP1250 Latin2 Central European)
  [fixed by using the codepage to convert strings to Utf8 now]

- check tarfile_nextfile() in tarfile.c
  it doesn't work when used directly. I think it skips files because
  it accidentally already read the file header when processing the
  previous file and then it loses track.
  [fixed]

- handle non-standard (non-ustar) tar files (example is Core14_AFMs.tar)
  Other examples can be created by using the option -o on the tar command
  line. Really awkward is the fact that 'make dist' does this so a
  run of ./txtools/texterize texterize-0.0.8.tar.gz doesn't work (extracts
  only the Core14_AFMs.tar which is in there).
  [fixed]

- handle the pdf date format and format it as ISO-8601 before calling
  cb_meta on it
  [fixed]

- use timezone information when printing the pdf ModDate en CreationDate
  meta fields.
  [fixed]

- add code to msword.c to handle HYPERLINK/EMBED, etc tags
  read the spec to figure out how this should be handled and what
  other things to expect
  [fixed by skipping the first part of a "field", codes 19,20 and 21]

- investigate inflateEnd() problems
  extracting postfix.zip misses master.cf now since 0.0.9 (find out why)
  [fixed by adding MAGIC_NO_CHECK_TOKENS to magic_open flags.
   The master.cf was identified as text/x-pascal before]

- handle comments in pdf properly (see PDFReference16.pdf paragraph 3.1.2)
  test with /mnt/windows/any2xml-testfiles/pdf/PCP133.doc.pdf (meta data
  extraction triggers it).
  [fixed]

- find out why Yugoslavia_UN_757.pdf and Approvedtanks-WebsiteList.pdf
  give random data (last character(s)) in the metadata
  [fixed by better checking for length in convertDocEncodingString]

- see if SEARCH_TF_BEFORE_BT influences performance badly
  [not really, the small differences i see are within the measurement error]

- fix problem with "get_biff_string(): not enough data" which causes
  us to miss quite a few strings from the SST table.
  [fixed, we used len where we should have used len*2 in a calculation]

- handle codepage -535 (65001 unsigned) which is utf-8 i think
  also handle some other codepages (1200=UTF16_LE 1201=UTF16_BE)
  Maybe nice to create a function to convert code page numbers to
  character set names (strings). This function should be defined
  in baselib i think.
  [fixed, added that function CodepageToCharset() in baselib/codepage.c]

- implement rtf text and metadata extraction
  [done, basic rtf text and metadata extraction is working now]

- check that libxml parses
  <meta http-equiv="Content-Type" content="text/html; charset=windows-1250">
  and takes the charset specification.
  If it doesn't (which i suspect) txlib/html.c needs to check if the first
  4K contains one of these. We read the first 4K already to guess
  a character set already (guess_charset() does this based on occuring
  characters).
  [added the function get_html_charset() in html.c which handles this]

- fix problem with SP-599-05-04E.pdf on macosx (Bus Error)
  It looks like it gets Length from a stream dictionary but the entry
  returned is NULL and we don't check for that (easy to fix).
  The scary part however is that this does not happen on linux, why?
  [fixed by using object generation numbers in pdf files, big change]
  [fixed some more by not adding duplicate xref entries]

- use generation number for pdf objects to make SP-599-05-04E.pdf work
  (see problems/readme for more info)
  [fixed, see previous item]

- remove all instances of MSG_FATAL in things which end up in the
  library. Best to remove the definition of MSG_FATAL, that case
  we are sure this does not happen.
  [fixed]

- map codes between 128 and 160 in msword.c (default case in the switch)
  [fixed by using toUtf8 at the right moment]

- remove CFLAGS modifications from configure.ac. Make a script which
  calls configure with these flags so that configure.ac doesn't need
  to be changed every time.
  [fixed by putting the CFLAGS in devel-configure.sh]

- create a fault injection stream.
  Cool to do long testruns with faults occuring in the data to test if
  all the extractors can handle all cases of messed up files.
  [added the fuzzstream]

- Don't create temporary file in current working directory but in
  /tmp or other system-wide temporary file directory. Maybe
  glib has a function for this.
  [using g_get_tmp_dir() from glib for this now]

- add copyright for 2008
  [done]

- add a convertor for MS Write documents
  [done]

- check 'man 3 iconv_open' and read about the //TRANSLIT option,
  that just sounds great.
  [it does and it works already from the command line of texterize]

- find mswrite documents with graphics in it
  [see files/wri/filtertest.wri found on 
   http://websvn.kde.org/trunk/tests/kofficetests/documents/import/mswrite/filtertest.wri]

- add a convertor for KWord files (at least 1.2 and 1.3, 1.1 is a bit weird)
  [done, >= 1.2 is supported now. Very simple though]
