Parser Implementation Plan 

  PlainText DOMParser SpreadsheetParser DocumentParser BusinessCardsParser
.txt 0.5        
.xml 0.5 1.1      
.csv 0.5   0.5    
.xls 1.2   0.5    
.xlsx 1.2    0.5    
.docx 1.1     0.5  
.rtf 0.7 1.2      
.eml 1.1     0.5  
.htm 0.5 0.8      
.html 0.5 0.8      
.pdf 1.0     1.1  
.vcf 1.1       0.6

 

Toxy 0.5
Goal: Make text/data extraction possible
Supported formats: .txt, .xml (plain text), .csv, .xls, .xlsx, .docx

Toxy 0.8
New supported formats: .htm/.html(plain text), .vcard ,  .rtf (plain text)

Toxy 1.0
Goal: Convert documents to unified structures
Supported formats: .txt, .xml, .csv, .xls, .xlsx, .docx, .rtf, .eml, .htm/.html, pdf, . vcf

Toxy 1.5

  • supports more document formats like .doc, .ppt/.pptx, xps and so on
  • supports extracting metadata of graphic formats like jpeg, gif, png and so on
  • integrate with Lucene.NET

Toxy 1.8

  • supports extracting metadata of audio formats like wav, mp3 and so on
  • supports extracting metadata of video formats like mp4, mpeg and so on

Toxy 2.0

  • Auto language identifier
  • Auto content type detector

Toxy 2.5

  • Entity recognition for some specific language like Chinese and English 

 

Last edited Dec 12, 2014 at 6:54 PM by tonyqus, version 14