TODO.txt

ToDo:

Immediate tasks:
- Change file access to read the source file sequentially (without a File object) and map the file into memory
  on a successful match. This will result in lots of mappings so it should only be done for files > 1g or so
  For a REAL solution, see below.
- Make MPEG Ripper correctly identify and preserve ID3v2 tags (currently they are stripped off)

Small Changes:
- Optionally write a .log file with all (hex-)offsets of the found files, and with debug info of the modules

Major changes:
- Remove/change "class File" to not map the whole file in memory at once.
  The input file is read in bytewise (block-buffered) and not memory-mapped as a whole.
  Only on a possible match is the file then mapped into memory, and the corresponding 
  ripper module is invoked. The mapping can start at where the pattern is at (optionally
  X bytes earlier which is specified in the header already and needed for MOD files for example).
  Need to check if it's not too slow mapping and unmapping the file thousands of times in a row.
  If it turns out to be too slow, the mappings could be reused on a high-watermark-based system:
  on the first HIT, map as much of the file as possible into memory, starting from 
  current_position - header_offset, and mapping MIN(filesize, 2gb). Then, on a later HIT, check
  to see if the file is not mapped till the end AND the current position is like 50% into the 
  mapping (for example at 1g into the mapping for a really huge file). If so, remap with the new
  current_position - header_offset as base.
  We also need to check on each HIT whether current_position-header_offset is smaller than the 
  currently mapped file portion (which can happen if header_offset is >0) and re-map in that case too
- Make the program 64-bit aware (by changing hardcoded assumptions about sizeof(short) and sizeof(int))
- Make the program big-endian aware (by wrapping all data accessors in wrapper functions)
- change the Aho-Corasick algorithm from object-based to table-based, where all "pointers" to the next state
  are simply indices into the table instead of "real" pointers. Should be a bit faster (esp. for bigger graphs)
  because of data locality
- Maybe: change the current system of "strong" or "weak" hit to a continuous one, where each ripper module
  awards "points" for each header-feature it tests. +1 or -1, depending on whether the feature indicates a
  good or "bad/weak" file. So a bit-depth of 8/15/16/24/32 bit would award a "+1", while a bit-depth of 13 bits
  would award "-1" (or even a few "-1"'s). Since we only do +1/-1 in each step we can later calculate a percentage
  of hits vs. misses, and we could flag e.g. all files with <80% confidence as "weak" and reject all with <40% 
  confidence.

New Ripper Modules:
- Microsoft DOC/XLS/... (this is a special chunked file format inofficially called LAOLA which is used by all office programs and some other programs too)
- PNG (should be easy)
- OGM (should be straightforward)
- FLI/FLC should be straightforward but not 100% foolproof (false positives?)
- more MODULE formats (XM, IT, S3M, ULT, DMF, MTM, NST, STM, ...)
- ZIP files? (detect from the beginning and walk from entry to entry, like pkzipfix)
- TIFF, TGA detection
- Maybe EXE detection? At least MZ/NE/PE headers, probably LE/LX/P3 headers for old games?

Better detection:
- JPEG (see source code comments)
- GIF needs more testing (esp. with images generated by FractInt for example, as they use some undocumented extensions)
- MOD needs testing as well (I don't have *that* many different MOD file formats here). There seems to be an issue with some(?) 6CHN modules which are detected too short (from 6 missing bytes to over 7k of missing data)
- some more 669 testing (expecially extended-669's) 
- more IFF formats/subtypes?
- Test IFF extraction with some PBM IFF images (I have some of them *somewhere*), instead of the common ILBM types
- PDF needs more testing to make sure it's foolproof (esp. PDF-1.6)
- DDS needs some more testing, I think it's a bit broken, but I don't have adequate tools to check
- MP3 needs testing (VBR files at least. LayerI/II files, or MPEG2/2.5 files too, although those should be currently skipped)