Ever had the pleasure of extracting some <img src="...">
from the HTML tag soup called the Internet? I did recently. The amount of invalid or “custom” ways people embed an image inside some HTML document is just mind blowing. Here are some examples:
<img src="http://www.asdf.com/some-image.gif" width="x" height="y" border="0" />
<img border="0" alt="" src="http://www.asdf.com/some-image.gif"/>
<IMG ALT="" src='http://www.asdf.com/some-image.gif'>
<IMG ALT="" SRC = 'http://www.asdf.com/some-image.gif' >
And so on, and so forth… /sigh
Just imagine every possible combination of mixed quoting styles, arbitrary spacing between attributes, lowercased, uppercased, etc. The differences are sometimes subtle (and not to mention invalid HTML), but people are blind to the fact that browsers are correcting their tag soup for them.
But I was on a mission: given some HTML string, I needed to find the first occurrence of an <img>
element and get that <img> element’s src
attribute value.
To keep things short, sweet and to the point, here’s the PHP function I slapped together:
/**
* Searches for the first occurence of an html <img> element in a string
* and extracts the src if it finds it. Returns boolean false in case an
* <img> element is not found.
* @param string $str An HTML string
* @return mixed The contents of the src attribute in the
* found <img> or boolean false if no <img>
* is found
*/
function str_img_src($html) {
if (stripos($html, '<img') !== false) {
$imgsrc_regex = '#<\s*img [^\>]*src\s*=\s*(["\'])(.*?)\1#im';
preg_match($imgsrc_regex, $html, $matches);
unset($imgsrc_regex);
unset($html);
if (is_array($matches) && !empty($matches)) {
return $matches[2];
} else {
return false;
}
} else {
return false;
}
}
The crux is in the regular expression used: <\s*img [^\>]*src\s*=\s*(["\'])(.*?)\1
, which passed several simple test (and a few edge) cases for me.
I’m no regex master, so it can probably be improved and/or optimized (performance-wise).
P.S.
Test data was obtained by grepping some 80+ different RSS and ATOM feeds which contained (x)html escaped and unescaped in so many ways it was just hilarious.
So, depending on your (X)HTML corpus, YMMV, but I hope you find this function useful somehow.