SoftwareSphere Home

PDFHTML

Program for extracting words from an HTML file which was generated by a PDF to HTML converter.

The available programs which convert files from PDF format to HTML produces just lists of grpahical elements with absolute screen coordinates in pixels, but they don't reconstruct the character sequence. This program takes as input the file produced by a PDF to HTML converter, extract the concrete strings and uses them to recreate the text, to fill an index of the page and to add the meta tags which improve the indexing process by web search engines.