<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Extracting images from HTML using regular expressions</title>
	<atom:link href="http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/feed/" rel="self" type="application/rss+xml" />
	<link>http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/</link>
	<description>On life, web dev and everything in between.</description>
	<lastBuildDate>Thu, 11 Feb 2010 15:18:21 +0100</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: JenniC</title>
		<link>http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/comment-page-1/#comment-1783</link>
		<dc:creator>JenniC</dc:creator>
		<pubDate>Thu, 11 Feb 2010 15:18:21 +0000</pubDate>
		<guid isPermaLink="false">http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/#comment-1783</guid>
		<description>Nice. Here is another script using biterscripting that will extract the images. To make sure it handles all the input formats you listed (thanks), I am using

- case-insensitive search
- regular expression with charsets (so char can be either &#039; or &quot;, etc.)

# Script ExtractImages.txt
var str page, content, image
cat $page &gt; $content
while ( { sen -r -c &quot;^ 0 )
do
    stex -r -c &quot;^ null # Discards portion upto &quot; $image # Extracts &quot;src...=...(&#039;&quot;)...(&#039;&quot;)&quot; into $image
    stex -r -c &quot;^(&#039;\&quot;)^]&quot; $image &gt; null # Discards portion upto the opening &#039; or &quot;.
    stex -r -c &quot;[^(&#039;\&quot;)^&quot; $image &gt; null # Discards portion starting at the closing &#039; or &quot;.
    echo $image   # Lists the image path
done



Save the script in file C:/Scripts/ExtractImages.txt, run it as

script &quot;C:/Scripts/ExtractImages.txt&quot; page(&quot;http://www.somesite.com/somepage.html&quot;)


The documentation for the sen, stex, etc. commands is at http://www.biterscripting.com/helppages_editors.html . You may find some useful goodies in there.</description>
		<content:encoded><![CDATA[<p>Nice. Here is another script using biterscripting that will extract the images. To make sure it handles all the input formats you listed (thanks), I am using</p>
<p>- case-insensitive search<br />
- regular expression with charsets (so char can be either &#8216; or &#8220;, etc.)</p>
<p># Script ExtractImages.txt<br />
var str page, content, image<br />
cat $page &gt; $content<br />
while ( { sen -r -c &#8220;^ 0 )<br />
do<br />
    stex -r -c &#8220;^ null # Discards portion upto &#8221; $image # Extracts &#8220;src&#8230;=&#8230;(&#8217;&#8221;)&#8230;(&#8217;&#8221;)&#8221; into $image<br />
    stex -r -c &#8220;^(&#8217;\&#8221;)^]&#8221; $image &gt; null # Discards portion upto the opening &#8216; or &#8220;.<br />
    stex -r -c &#8220;[^(&#8217;\&#8221;)^&#8221; $image &gt; null # Discards portion starting at the closing &#8216; or &#8220;.<br />
    echo $image   # Lists the image path<br />
done</p>
<p>Save the script in file C:/Scripts/ExtractImages.txt, run it as</p>
<p>script &#8220;C:/Scripts/ExtractImages.txt&#8221; page(&#8221;http://www.somesite.com/somepage.html&#8221;)</p>
<p>The documentation for the sen, stex, etc. commands is at <a href="http://www.biterscripting.com/helppages_editors.html" rel="nofollow">http://www.biterscripting.com/helppages_editors.html</a> . You may find some useful goodies in there.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: clouseau</title>
		<link>http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/comment-page-1/#comment-1500</link>
		<dc:creator>clouseau</dc:creator>
		<pubDate>Mon, 19 Jan 2009 00:58:30 +0000</pubDate>
		<guid isPermaLink="false">http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/#comment-1500</guid>
		<description>Thank you so much ... this was extremely helpful :)</description>
		<content:encoded><![CDATA[<p>Thank you so much &#8230; this was extremely helpful :)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: zytzagoo</title>
		<link>http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/comment-page-1/#comment-1471</link>
		<dc:creator>zytzagoo</dc:creator>
		<pubDate>Thu, 18 Sep 2008 23:19:43 +0000</pubDate>
		<guid isPermaLink="false">http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/#comment-1471</guid>
		<description>@Richard: 

1. Using &#039;#&#039; instead of a forward slash is an old habit I picked up. You can use almost any character, as long as they&#039;re unique and identical. With forward slashes being a rather common occurence inside a src attribute, I chose to avoid them as a delimiter. It&#039;s true that a &#039;#&#039; might also appear inside a src -- but I didn&#039;t test if the regex compilation in php would fail in such cases. It could.

2. Because I wanted to capture zero or more of that preceding block. Learning by example might be easier: go to http://erik.eae.net/playground/regexp/regexp.html , paste in the 4 example img tags from the post, and paste in the regex from the post. Now you can change the regex there interactively and see what works and what doesn&#039;t.

3. It doesn&#039;t necessarily. The question mark makes it non-greedy, and since I wasn&#039;t interested in the remaining attributes, I figured it would make it faster. It could very well make it slower -- I didn&#039;t test for that :)

4. That&#039;s a backrefrence. Backreferences allow you to reuse a part of the regex match. It can get complicated, but it&#039;s also quite powerful. Google for more.</description>
		<content:encoded><![CDATA[<p>@Richard: </p>
<p>1. Using &#8216;#&#8217; instead of a forward slash is an old habit I picked up. You can use almost any character, as long as they&#8217;re unique and identical. With forward slashes being a rather common occurence inside a src attribute, I chose to avoid them as a delimiter. It&#8217;s true that a &#8216;#&#8217; might also appear inside a src &#8212; but I didn&#8217;t test if the regex compilation in php would fail in such cases. It could.</p>
<p>2. Because I wanted to capture zero or more of that preceding block. Learning by example might be easier: go to <a href="http://erik.eae.net/playground/regexp/regexp.html" rel="nofollow">http://erik.eae.net/playground/regexp/regexp.html</a> , paste in the 4 example img tags from the post, and paste in the regex from the post. Now you can change the regex there interactively and see what works and what doesn&#8217;t.</p>
<p>3. It doesn&#8217;t necessarily. The question mark makes it non-greedy, and since I wasn&#8217;t interested in the remaining attributes, I figured it would make it faster. It could very well make it slower &#8212; I didn&#8217;t test for that :)</p>
<p>4. That&#8217;s a backrefrence. Backreferences allow you to reuse a part of the regex match. It can get complicated, but it&#8217;s also quite powerful. Google for more.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Richard</title>
		<link>http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/comment-page-1/#comment-1470</link>
		<dc:creator>Richard</dc:creator>
		<pubDate>Thu, 18 Sep 2008 22:23:48 +0000</pubDate>
		<guid isPermaLink="false">http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/#comment-1470</guid>
		<description>Hi,
I&#039;ve just started learning regex&#039;s and understand most of &lt;code&gt;&#039;#]*src\s*=\s*([&quot;\&#039;])(.*?)\1#im’;&lt;code&gt; but can someone shed some light on a few issues?

1. Why does it use &#039;#&#039; instead of &#039;/&#039; as delimiters?

2. Why does the not angle bracket part( [^/&gt;] ) need an &#039;*&#039; afterwards?

3. Why does the (.*?) need the &#039;?&#039;

4. Oh and what does &#039;\1&#039; at the end mean?

Ok that was more questions than I thought but would be grateful if someone could clarify?</description>
		<content:encoded><![CDATA[<p>Hi,<br />
I&#8217;ve just started learning regex&#8217;s and understand most of <code>'#]*src\s*=\s*(["\'])(.*?)\1#im’;</code><code> but can someone shed some light on a few issues?</p>
<p>1. Why does it use '#' instead of '/' as delimiters?</p>
<p>2. Why does the not angle bracket part( [^/&gt;] ) need an '*' afterwards?</p>
<p>3. Why does the (.*?) need the '?'</p>
<p>4. Oh and what does '\1' at the end mean?</p>
<p>Ok that was more questions than I thought but would be grateful if someone could clarify?</code></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: mintyfresh</title>
		<link>http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/comment-page-1/#comment-1442</link>
		<dc:creator>mintyfresh</dc:creator>
		<pubDate>Tue, 05 Aug 2008 20:26:22 +0000</pubDate>
		<guid isPermaLink="false">http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/#comment-1442</guid>
		<description>Just wanted to point out that this doesn&#039;t account for various capitalization.
For example:




it also doesn&#039;t account for no quotation after the equals sign, e.g.  (if that&#039;s still allowed)

pretty easily fixed, though, I think.</description>
		<content:encoded><![CDATA[<p>Just wanted to point out that this doesn&#8217;t account for various capitalization.<br />
For example:</p>
<p>it also doesn&#8217;t account for no quotation after the equals sign, e.g.  (if that&#8217;s still allowed)</p>
<p>pretty easily fixed, though, I think.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ben</title>
		<link>http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/comment-page-1/#comment-1434</link>
		<dc:creator>Ben</dc:creator>
		<pubDate>Fri, 25 Jul 2008 20:20:47 +0000</pubDate>
		<guid isPermaLink="false">http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/#comment-1434</guid>
		<description>Just wanted to say thanks, I&#039;ve been looking for a method of doing this for quite a long time.</description>
		<content:encoded><![CDATA[<p>Just wanted to say thanks, I&#8217;ve been looking for a method of doing this for quite a long time.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: guycalledseven</title>
		<link>http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/comment-page-1/#comment-21</link>
		<dc:creator>guycalledseven</dc:creator>
		<pubDate>Tue, 29 Jan 2008 00:35:10 +0000</pubDate>
		<guid isPermaLink="false">http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/#comment-21</guid>
		<description>validation = too slow, unnecessary, I&#039;m raising my own exceptions.

can&#039;t influence 3rd party who delivers feeds. screw them. I parse what I get. :)

btw vtd-xml probably have their purpose, but not for my special need. XMLReader still outperforms it. 

enough with the darn commenting. drag your ass for a bear man!! :)</description>
		<content:encoded><![CDATA[<p>validation = too slow, unnecessary, I&#8217;m raising my own exceptions.</p>
<p>can&#8217;t influence 3rd party who delivers feeds. screw them. I parse what I get. :)</p>
<p>btw vtd-xml probably have their purpose, but not for my special need. XMLReader still outperforms it. </p>
<p>enough with the darn commenting. drag your ass for a bear man!! :)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: zytzagoo</title>
		<link>http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/comment-page-1/#comment-17</link>
		<dc:creator>zytzagoo</dc:creator>
		<pubDate>Mon, 28 Jan 2008 09:19:28 +0000</pubDate>
		<guid isPermaLink="false">http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/#comment-17</guid>
		<description>Does it really matter why it&#039;s broken? :)

And how come you even get to calling expand()? Did you try validating the whole document first?
http://hr.php.net/manual/en/function.xmlreader-isvalid.php (check first comment too)

If all else fails, maybe http://vtd-xml.sourceforge.net/ can help... you can probably parse it with vtd externally and then do with the contents whatever you needed doing in the first place?

As to how to check if an xml file has been tampered with... If the tool that produces them is *positive* about their well-formedness, make the tool generate a hash. If it can&#039;t, have a script that generates a hash for all the files periodically or smtn. After download, compare the hashes. If the hashes are identical and the doc doesn&#039;t validate -- somebody missed something (tool&#039;s fault?). Hash mismatch - download trouble. Something in between - just figure out who to blame :) The hash thing is critical, everything else is just about interpreting the fact that there haven&#039;t been any download issues :p</description>
		<content:encoded><![CDATA[<p>Does it really matter why it&#8217;s broken? :)</p>
<p>And how come you even get to calling expand()? Did you try validating the whole document first?<br />
<a href="http://hr.php.net/manual/en/function.xmlreader-isvalid.php" rel="nofollow">http://hr.php.net/manual/en/function.xmlreader-isvalid.php</a> (check first comment too)</p>
<p>If all else fails, maybe <a href="http://vtd-xml.sourceforge.net/" rel="nofollow">http://vtd-xml.sourceforge.net/</a> can help&#8230; you can probably parse it with vtd externally and then do with the contents whatever you needed doing in the first place?</p>
<p>As to how to check if an xml file has been tampered with&#8230; If the tool that produces them is *positive* about their well-formedness, make the tool generate a hash. If it can&#8217;t, have a script that generates a hash for all the files periodically or smtn. After download, compare the hashes. If the hashes are identical and the doc doesn&#8217;t validate &#8212; somebody missed something (tool&#8217;s fault?). Hash mismatch &#8211; download trouble. Something in between &#8211; just figure out who to blame :) The hash thing is critical, everything else is just about interpreting the fact that there haven&#8217;t been any download issues :p</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: guycalledseven</title>
		<link>http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/comment-page-1/#comment-16</link>
		<dc:creator>guycalledseven</dc:creator>
		<pubDate>Sun, 27 Jan 2008 13:56:52 +0000</pubDate>
		<guid isPermaLink="false">http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/#comment-16</guid>
		<description>I have to use XMLReader for parsing 500mb large xml feed (actually 40 of them). Can&#039;t use anything else because it dies. :)  Just checked, it&#039;s XMLReader-&gt;expand() that fails if dom is f*cked up. You&#039;ll see lots of errors like this:
Warning: (php_errorHandler) XMLReader::expand() [&lt;a href=&#039;function.XMLReader-expand&#039; rel=&quot;nofollow&quot;&gt;function.XMLReader-expand&lt;/a&gt;]: &lt;size/&gt; in

My problem now is how to detect why dom is broken? Is it because xml file didn&#039;t download correctly or somebody missed a tag. Any ideas?</description>
		<content:encoded><![CDATA[<p>I have to use XMLReader for parsing 500mb large xml feed (actually 40 of them). Can&#8217;t use anything else because it dies. :)  Just checked, it&#8217;s XMLReader-&gt;expand() that fails if dom is f*cked up. You&#8217;ll see lots of errors like this:<br />
Warning: (php_errorHandler) XMLReader::expand() [<a href='function.XMLReader-expand' rel="nofollow">function.XMLReader-expand</a>]: &lt;size/&gt; in</p>
<p>My problem now is how to detect why dom is broken? Is it because xml file didn&#8217;t download correctly or somebody missed a tag. Any ideas?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: zytzagoo</title>
		<link>http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/comment-page-1/#comment-11</link>
		<dc:creator>zytzagoo</dc:creator>
		<pubDate>Fri, 25 Jan 2008 13:45:34 +0000</pubDate>
		<guid isPermaLink="false">http://zytzagoo.net/blog/2008/01/23/extracting-images-from-html-using-regular-expressions/#comment-11</guid>
		<description>Yo seven!

Thx for the feedback.
I read about XMLReader previously, and noticed it is primarily aimed at efficient parsing of large (well-formed) xml documents, which is not my use case. I just have random chunks of (sometimes invalid) (x)html, which are usually just several kilobytes in size.

http://www.ibm.com/developerworks/library/x-pullparsingphp.html#N10253

Modifying php.ini to turn track_errors off (because XMLReader::read() returning false on non-well-formed structures and emitting a warning) and then capturing the error, and then still not being able to do shit about the contents -- just doesn&#039;t work for my use case.

Do you have some working code examples of faster ways of doing this kind of stuff?
Because profiling results for my approach (with my corpus) using KCacheGrind on my local dev machine turned out to be rather fast...</description>
		<content:encoded><![CDATA[<p>Yo seven!</p>
<p>Thx for the feedback.<br />
I read about XMLReader previously, and noticed it is primarily aimed at efficient parsing of large (well-formed) xml documents, which is not my use case. I just have random chunks of (sometimes invalid) (x)html, which are usually just several kilobytes in size.</p>
<p><a href="http://www.ibm.com/developerworks/library/x-pullparsingphp.html#N10253" rel="nofollow">http://www.ibm.com/developerworks/library/x-pullparsingphp.html#N10253</a></p>
<p>Modifying php.ini to turn track_errors off (because XMLReader::read() returning false on non-well-formed structures and emitting a warning) and then capturing the error, and then still not being able to do shit about the contents &#8212; just doesn&#8217;t work for my use case.</p>
<p>Do you have some working code examples of faster ways of doing this kind of stuff?<br />
Because profiling results for my approach (with my corpus) using KCacheGrind on my local dev machine turned out to be rather fast&#8230;</p>
]]></content:encoded>
	</item>
</channel>
</rss>
