Extracting images from HTML using regular expressions +
Ever had the pleasure of extracting some <img src="...">
from the HTML tag soup called the Internet? I did recently. The amount of invalid or “custom” ways people embed an image inside some HTML document is just mind blowing. Here are some examples:
<img src="http://www.asdf.com/some-image.gif" width="x" height="y" border="0" />
<img border="0" alt="" src="http://www.asdf.com/some-image.gif"/>
<IMG ALT="" src='http://www.asdf.com/some-image.gif'>
<IMG ALT="" SRC = 'http://www.asdf.com/some-image.gif' >
And so on, and so forth… /sigh
Just imagine every possible combination of mixed quoting styles, arbitrary spacing between attributes, lowercased, uppercased, etc. The differences are sometimes subtle (and not to mention invalid HTML), but people are blind to the fact that browsers are correcting their tag soup for them.
But I was on a mission: given some HTML string, I needed to find the first occurrence of an <img>
element and get that <img> element’s src
attribute value.
To keep things short, sweet and to the point, here’s the PHP function I slapped together:
/**
* Searches for the first occurence of an html <img> element in a string
* and extracts the src if it finds it. Returns boolean false in case an
* <img> element is not found.
* @param string $str An HTML string
* @return mixed The contents of the src attribute in the
* found <img> or boolean false if no <img>
* is found
*/
function str_img_src($html) {
if (stripos($html, '<img') !== false) {
$imgsrc_regex = '#<\s*img [^\>]*src\s*=\s*(["\'])(.*?)\1#im';
preg_match($imgsrc_regex, $html, $matches);
unset($imgsrc_regex);
unset($html);
if (is_array($matches) && !empty($matches)) {
return $matches[2];
} else {
return false;
}
} else {
return false;
}
}
The crux is in the regular expression used: <\s*img [^\>]*src\s*=\s*(["\'])(.*?)\1
, which passed several simple test (and a few edge) cases for me.
I’m no regex master, so it can probably be improved and/or optimized (performance-wise).
P.S.
Test data was obtained by grepping some 80+ different RSS and ATOM feeds which contained (x)html escaped and unescaped in so many ways it was just hilarious.
So, depending on your (X)HTML corpus, YMMV, but I hope you find this function useful somehow.
speaking of hilarious RSS feeds… omg look at this one’s source :)
i think your script could come very handy in situations like this one
WOOT! My first post on your blog! :) Ha! :) Btw, blog is great, I love it!!
And now to the pressing matters. Based on my recent findings, I wouldn’t recommend parsing (X)HTML with regexp. especially in this case you are having. Why?
a) it’s too slow
b) lots of things can go wrong.
Instead I would suggest you to try parsing with ordinary DOM and SimpleXML…or you can go Rambo style and do streaming parser with XMLReader. It’s x times faster and you can get something out of shitty broken xhtml.
Yo seven!
Thx for the feedback.
I read about XMLReader previously, and noticed it is primarily aimed at efficient parsing of large (well-formed) xml documents, which is not my use case. I just have random chunks of (sometimes invalid) (x)html, which are usually just several kilobytes in size.
http://www.ibm.com/developerworks/library/x-pullparsingphp.html#N10253
Modifying php.ini to turn track_errors off (because XMLReader::read() returning false on non-well-formed structures and emitting a warning) and then capturing the error, and then still not being able to do shit about the contents — just doesn’t work for my use case.
Do you have some working code examples of faster ways of doing this kind of stuff?
Because profiling results for my approach (with my corpus) using KCacheGrind on my local dev machine turned out to be rather fast…
I have to use XMLReader for parsing 500mb large xml feed (actually 40 of them). Can’t use anything else because it dies. :) Just checked, it’s XMLReader->expand() that fails if dom is f*cked up. You’ll see lots of errors like this:
Warning: (php_errorHandler) XMLReader::expand() [function.XMLReader-expand]: <size/> in
My problem now is how to detect why dom is broken? Is it because xml file didn’t download correctly or somebody missed a tag. Any ideas?
Does it really matter why it’s broken? :)
And how come you even get to calling expand()? Did you try validating the whole document first?
http://hr.php.net/manual/en/function.xmlreader-isvalid.php (check first comment too)
If all else fails, maybe http://vtd-xml.sourceforge.net/ can help… you can probably parse it with vtd externally and then do with the contents whatever you needed doing in the first place?
As to how to check if an xml file has been tampered with… If the tool that produces them is *positive* about their well-formedness, make the tool generate a hash. If it can’t, have a script that generates a hash for all the files periodically or smtn. After download, compare the hashes. If the hashes are identical and the doc doesn’t validate — somebody missed something (tool’s fault?). Hash mismatch – download trouble. Something in between – just figure out who to blame :) The hash thing is critical, everything else is just about interpreting the fact that there haven’t been any download issues :p
validation = too slow, unnecessary, I’m raising my own exceptions.
can’t influence 3rd party who delivers feeds. screw them. I parse what I get. :)
btw vtd-xml probably have their purpose, but not for my special need. XMLReader still outperforms it.
enough with the darn commenting. drag your ass for a bear man!! :)
Just wanted to say thanks, I’ve been looking for a method of doing this for quite a long time.
Just wanted to point out that this doesn’t account for various capitalization.
For example:
it also doesn’t account for no quotation after the equals sign, e.g. (if that’s still allowed)
pretty easily fixed, though, I think.
Hi,
I’ve just started learning regex’s and understand most of
'#]*src\s*=\s*(["\'])(.*?)\1#im’;
but can someone shed some light on a few issues?
1. Why does it use '#' instead of '/' as delimiters?
2. Why does the not angle bracket part( [^/>] ) need an '*' afterwards?
3. Why does the (.*?) need the '?'
4. Oh and what does '\1' at the end mean?
Ok that was more questions than I thought but would be grateful if someone could clarify?
@Richard:
1. Using ‘#’ instead of a forward slash is an old habit I picked up. You can use almost any character, as long as they’re unique and identical. With forward slashes being a rather common occurence inside a src attribute, I chose to avoid them as a delimiter. It’s true that a ‘#’ might also appear inside a src — but I didn’t test if the regex compilation in php would fail in such cases. It could.
2. Because I wanted to capture zero or more of that preceding block. Learning by example might be easier: go to http://erik.eae.net/playground/regexp/regexp.html , paste in the 4 example img tags from the post, and paste in the regex from the post. Now you can change the regex there interactively and see what works and what doesn’t.
3. It doesn’t necessarily. The question mark makes it non-greedy, and since I wasn’t interested in the remaining attributes, I figured it would make it faster. It could very well make it slower — I didn’t test for that :)
4. That’s a backrefrence. Backreferences allow you to reuse a part of the regex match. It can get complicated, but it’s also quite powerful. Google for more.
Thank you so much … this was extremely helpful :)
Nice. Here is another script using biterscripting that will extract the images. To make sure it handles all the input formats you listed (thanks), I am using
– case-insensitive search
– regular expression with charsets (so char can be either ‘ or “, etc.)
# Script ExtractImages.txt
var str page, content, image
cat $page > $content
while ( { sen -r -c “^ 0 )
do
stex -r -c “^ null # Discards portion upto ” $image # Extracts “src…=…(‘”)…(‘”)” into $image
stex -r -c “^(‘\”)^]” $image > null # Discards portion upto the opening ‘ or “.
stex -r -c “[^(‘\”)^” $image > null # Discards portion starting at the closing ‘ or “.
echo $image # Lists the image path
done
Save the script in file C:/Scripts/ExtractImages.txt, run it as
script “C:/Scripts/ExtractImages.txt” page(“http://www.somesite.com/somepage.html”)
The documentation for the sen, stex, etc. commands is at http://www.biterscripting.com/helppages_editors.html . You may find some useful goodies in there.
I found the following regex works quite well for extracting the image source path:
\]+src\s*=\s*”([^”]+?)”[^\>]+\>
Try it on this page: http://heritage.stsci.edu/gallery/gallery_category.html
with http://erik.eae.net/playground/regexp/regexp.html
Looks like the first part of my regex got eaten for some reason…
Hi…this gets the first occurence…what if I want to get all occurences of image tag….can you please help me..i want to extract image tags saved in my database by tinymce…
Don’t use regexes for that, go with something like this: http://stackoverflow.com/a/15895882