Comments on: Extracting images from HTML using regular expressions

By: zytzagoo

zytzagoo — Wed, 10 Sep 2014 15:55:56 +0000

In reply to Christy.

Don’t use regexes for that, go with something like this: http://stackoverflow.com/a/15895882

By: Christy

Christy — Wed, 10 Sep 2014 15:22:39 +0000

Hi…this gets the first occurence…what if I want to get all occurences of image tag….can you please help me..i want to extract image tags saved in my database by tinymce…

By: Robert

Robert — Wed, 17 Mar 2010 12:19:31 +0000

Looks like the first part of my regex got eaten for some reason…

By: Robert

Robert — Wed, 17 Mar 2010 12:18:30 +0000

I found the following regex works quite well for extracting the image source path:

\]+src\s*=\s*”([^”]+?)”[^\>]+\>

Try it on this page: http://heritage.stsci.edu/gallery/gallery_category.html

with http://erik.eae.net/playground/regexp/regexp.html

By: JenniC

JenniC — Thu, 11 Feb 2010 15:18:21 +0000

Nice. Here is another script using biterscripting that will extract the images. To make sure it handles all the input formats you listed (thanks), I am using

– case-insensitive search
– regular expression with charsets (so char can be either ‘ or “, etc.)

# Script ExtractImages.txt
var str page, content, image
cat $page > $content
while ( { sen -r -c “^ 0 )
do
stex -r -c “^ null # Discards portion upto ” $image # Extracts “src…=…(‘”)…(‘”)” into $image
stex -r -c “^(‘\”)^]” $image > null # Discards portion upto the opening ‘ or “.
stex -r -c “[^(‘\”)^” $image > null # Discards portion starting at the closing ‘ or “.
echo $image # Lists the image path
done

Save the script in file C:/Scripts/ExtractImages.txt, run it as

script “C:/Scripts/ExtractImages.txt” page(“http://www.somesite.com/somepage.html”)

The documentation for the sen, stex, etc. commands is at http://www.biterscripting.com/helppages_editors.html . You may find some useful goodies in there.

By: clouseau

clouseau — Mon, 19 Jan 2009 00:58:30 +0000

Thank you so much … this was extremely helpful :)

By: zytzagoo

zytzagoo — Thu, 18 Sep 2008 23:19:43 +0000

@Richard:

1. Using ‘#’ instead of a forward slash is an old habit I picked up. You can use almost any character, as long as they’re unique and identical. With forward slashes being a rather common occurence inside a src attribute, I chose to avoid them as a delimiter. It’s true that a ‘#’ might also appear inside a src — but I didn’t test if the regex compilation in php would fail in such cases. It could.

2. Because I wanted to capture zero or more of that preceding block. Learning by example might be easier: go to http://erik.eae.net/playground/regexp/regexp.html , paste in the 4 example img tags from the post, and paste in the regex from the post. Now you can change the regex there interactively and see what works and what doesn’t.

3. It doesn’t necessarily. The question mark makes it non-greedy, and since I wasn’t interested in the remaining attributes, I figured it would make it faster. It could very well make it slower — I didn’t test for that :)

4. That’s a backrefrence. Backreferences allow you to reuse a part of the regex match. It can get complicated, but it’s also quite powerful. Google for more.

By: Richard

Richard — Thu, 18 Sep 2008 22:23:48 +0000

Hi, I've just started learning regex's and understand most of

'#]*src\s*=\s*(["\'])(.*?)\1#im’; but can someone shed some light on a few issues?

1. Why does it use '#' instead of '/' as delimiters?

2. Why does the not angle bracket part( [^/>] ) need an '*' afterwards?

3. Why does the (.*?) need the '?'

4. Oh and what does '\1' at the end mean?

Ok that was more questions than I thought but would be grateful if someone could clarify?



By: mintyfresh
mintyfresh — Tue, 05 Aug 2008 20:26:22 +0000
Just wanted to point out that this doesn’t account for various capitalization.

For example:
it also doesn’t account for no quotation after the equals sign, e.g.  (if that’s still allowed)
pretty easily fixed, though, I think.



By: Ben
Ben — Fri, 25 Jul 2008 20:20:47 +0000
Just wanted to say thanks, I’ve been looking for a method of doing this for quite a long time.