Pages

8/01/2002

Regular Expressions::Removing HTML



ASPN : Rx Cookbook
From: http://aspn.activestate.com/ASPN/Cookbook/Rx/Recipe/59820 - a site containing excellent examples of useful regular expressions

When writing CGI scripts which suck in textual content from users (such as discussion threads, for example), it's often useful to be able to detect and/or remove HTML tags in user-submitted content. This regular expression, documented in perlfaq6, is relatively effective at getting rid of HTML:
while(<>) {
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs;
}

No comments: