KC Lemson

By KC Lemson [MS]

Blogs

More on cleaning up Word's HTML

  • Comments 6
  • Likes

Brian Alvey points to a Word HTML cleaner, as well as explains why Word puts all those extra tags into HTML when you save a word doc as HTML, or copy from Word into another HTML editor.

Also see my previous post about reducing the size of the HTML that Word creates when used as an email editor.

[via Scoble]

Comments
  • Too bad there is no source code to clean up Word's HTML. Or is there somewhere? Someone?

  • I think HtmlTidy can clean up word HTML, and they have .NET and COM bindings. You can look at the source too if you really want to.

    http://tidy.sourceforge.net/

  • I found that the combination of the Office 2000 HTML Filter 2.0 (http://www.microsoft.com/downloads/details.aspx?FamilyID=209ADBEE-3FBD-482C-83B0-96FB79B74DED&displaylang=EN) and the Textism web tool worked well for me. I had more than 20K, so the Office tool got me part way and the Textism filter did the remaining cleanup.

    The tool says it's for Office 2000, but worked okay on Office XP.

    This should be a feature of Word - export as clean HTML or something.

  • Lucky you, Word does have this feature. Save a file as "Web page, filtered" and it will strip out all Office specific tags.

    Using the sample of HTML from my previous post (linked above):

    word.doc: 24KB
    word-as-web-page.htm: 7KB
    word-as-filtered-web-page.htm: 3KB

  • Oops. Lucky. Thanks. You can delete my post to cover for me any time... (in my defence, I'm stuck using Office 2000 at work)

  • I already used the service (Textism) by paying annual subscription and that service is not what it promises.

    The servers are always down and it only cleans part of the code.

    There are still a lot of problems with login issues and misrepresentation.

    Because og being a web-based service, he cuts the service whenever he desires manipulating access. I only got to use it 4 months and we were cut off without previous notice.

    He does not have ethics and professionalism.

    I would never recommend it. But I will recommend Word HTML Cleaner by Mambosoft; at last you get to keep the software and servers down won't interfere. You can also have lifetime FREE update.

    My suggestion: use Textism only to clean small files for free, that is, one-page documents, but not worthy to pay.

    Magnolia