This is D7samurai's Typepad Profile.
Join Typepad and start following D7samurai's activity
Join Now!
Already a member? Sign In
D7samurai
Recent Activity
LOL.. now I understand bobince's persistence in MY post about regex vs HTML: http://stackoverflow.com/questions/3951485/regex-extracting-only-the-visible-page-text-from-a-html-source-document (..and maybe some of you would be amused by my own persistence, too :) However, as I stated numerous times in my comments, I wasn't out to parse the HTML per se, but "merely" interested in a much coarser extraction. And for my purposes, the regex approach works - it's a tradeoff between efficiency and total robustness. But the outcome is surprisingly solid. The final implementation can be found here: http://www.martinwardener.com/regex/ Mind you, regarding the "secondary" issue (extracting all links/URLs from an HTML document), it is of no concern that this implementation is over-eager (by design, btw) and picks out a few invalid URLs (mostly pertaining to script blocks) - those will be filtered out during the subsequent URL validation anyway.
Toggle Commented Oct 25, 2010 on Parsing Html The Cthulhu Way at Coding Horror
D7samurai is now following The Typepad Team
Oct 25, 2010