This is SerkanYersen's TypePad Profile.
Join TypePad and start following SerkanYersen's activity
SerkanYersen
Recent Activity
Jeff, I have ported your code to JavaScript.
Thank you, it helped so much.
function cleanWord(str){
// get rid of unnecessary tag spans (comments and title)
str = str.replace(/\<\!--(\w|\W)+?--\>/gim, '');
str = str.replace(/\<title\>(\w|\W)+?\<\/title\>/gim, '');
// Get rid of classes and styles
str = str.replace(/\s?class=\w+/gim, '');
str = str.replace(/\s+style=\'[^\']+\'/gim, '');
// Get rid of unnecessary tags
str = str.replace(/<(meta|link|\/?o:|\/?style|\/?div|\/?st\d|\/?head|\/?html|body|\/?body|\/?span|!\[)[^>]*?>/gim, '');
// Get rid of empty paragraph tags
str = str.replace(/(<[^>]+>)+ (<\/\w+>)/gim, '');
// remove bizarre v: element attached to <img> tag
str = str.replace(/\s+v:\w+=""[^""]+""/gim, '');
// remove extra lines
str = str.replace(/"(\n\r){2,}/gim, '');
// Fix entites
str = str.replace("“", "\"");
str = str.replace("”", "\"");
str = str.replace("—", "–");
return str;
}
Cleaning Word's Nasty HTML
I recently wrote a Word 2003 document that I later turned into a blog post. The transition between Word doc and HTML presented some problems. Word offers two HTML options in its save dialog: "Save as HTML" and "Save as Filtered HTML". In practice, that means you get to choose between totally na...
SerkanYersen is now following The Typepad Team
Apr 22, 2010
Subscribe to SerkanYersen’s Recent Activity
