ID:199063
 
Keywords: export, html, parse, website
(See the best response by Nadrew.)
Code:
client/verb
Test_Any_Site(var/url as text) //eg: www.google.com
var/site = ""
world << "TESTING - [url]"
site = get_website("http://[url]")
world << "site length [length(site)]"

proc
get_website(url)
var/list/site = new()
var/site_text = ""
var/refresh = 0
while(!site_text && refresh<10)
refresh++
site = world.Export(url)

if(!site) //Site is not loading at all
world << "ATTEMPT [refresh] FAILED @ <[url]> - NO SITE - NEXT TRY [2*refresh] seconds"
sleep(refresh*20) //Pause between retrys for better chance of getting a page (longer each time)
continue

site = site["CONTENT"]
site_text = file2text(site)

if(length(site_text)<10) //Arbitrary length to catch bad sites
world << "ATTEMPT [refresh] FAILED @ <[url]> - BAD SITE- [site_text]- NEXT TRY [2*refresh] seconds"
site_text = ""
sleep(refresh*20) //Pause between retrys for better chance

if(!site_text) //If even after all that, no site was loaded return null
world << "# ERROR # ATTEMPT [refresh] - NO SITE @ <[url]>"
return

return site_text


Problem description:
While trying to retrieve a webpage in-order to parse its contents, I occasionally receive a bad site when I don't believe I should.

I am not sure if this is a code problem or a web page problem. Occasionally instead of getting the webpage, I get something along the lines of:

and that is it.

It doesn't happen all the time (maybe 10% of the time) and it doesn't happen with all sites. One site I seem to have the most trouble with is: http://www.politifact.com/personalities/


Why are you storing site within a list?
It should be returning as a plain text string?
You should try looking at the other data returned by the call to Export, maybe it has more information on why it failed. I know it has a SUCCESS value or something that gives a code indicating the success of the call (like, 200 OK means everything went as expected) sent with the request.

I'd just log all of the data returned by Export() when you get that weird result.
Here's a list of response codes Keeth was talking about.
http://en.wikipedia.org/wiki/List_of_HTTP_status_codes
Best response
world.Export() is entirely dependent on your connection. If it's failing there's something preventing it from connecting to the site in question -- this isn't overly uncommon because sometimes connections get refused by the remote box or just don't go through as expected. Most browsers and the sort have error handling to correct this by trying to request a few times -- if it doesn't work after a few it gives you the error page.

What you should be doing here is validating the content of the page, if it doesn't return as expected the first time, try a few more times -- if those fail you spit out an error. If it gives you that garbage text you should handle that too, I imagine that's being caused by some kind of hiccup in either your network or their network.

In short, don't expect the connection to work properly 100% of the time; expect it to fail a few times at least, this way you'll always know if it does error it did it after a few attempts and not just one goofy one.

And yes, you should heed the other replies to this by checking the other values that the Export() list returns, they could hold a clue as to why it's failing.
Nadrew, I figured if he was having trouble viewing the site on his web browser he would have said something?

By reading his post, it seems like this only happens with Export()?
I never said he would be having trouble viewing it in his browser, I said the browser handles errors better than Export() so you tend not to encounter issues that crop up if the initial connection fails or returns garbage. The browser tends to try a few times before giving up and spitting out an error.
Thanks guys for your replies. Yeah, I am not having problems accessing the page in my browser, only when using Export. I added the error processing in the proc specifically because of this issue, by the 10th try the site usually loads but I just wondered if there was something else going on I should look for.

I will look up the http status codes and see what else it being returned.

Flamesage, you said I shouldn't store the site as a list? Doesn't world.Export return an associative list with the html stored under "Content"? If there is a better way to do this I am interested. If you are just referring to the line:

site = site["Content"]


I am not really sure why I did that. I suppose I could remove that line and just do

site_text = file2text(site["Content"])




EDIT:
I just ran it with a check to see what was actually be returned and this is the dump of world.Export failing:

STATUS = 200 OK
SERVER = nginx/0.7.67
DATE = Thu, 26 Jan 2012 10:24:21 GMT
CONTENT-TYPE = text/html; charset=utf-8
CONNECTION = close
CONTENT-ENCODING = gzip
X-CACHEABLE = YES(null)
CONTENT-LENGTH = 36232
X-VARNISH = 2000088232 2000086929
AGE = 289
VIA = 1.1 varnish
X-SERVED-BY = ip-10-90-154-68
X-CACHE = HIT
X-CACHE-HITS = 10
X-CACHE-BACKEND = default
CONTENT = .html



And this is what I get when it is successful:

STATUS = 200 OK
SERVER = nginx/0.7.67
DATE = Thu, 26 Jan 2012 10:24:24 GMT
CONTENT-TYPE = text/html; charset=utf-8
CONNECTION = close
X-CACHEABLE = YES(null)
CONTENT-LENGTH = 182340
X-VARNISH = 1487477892 1487476926
AGE = 223
VIA = 1.1 varnish
X-SERVED-BY = ip-10-90-174-200
X-CACHE = HIT
X-CACHE-HITS = 1
X-CACHE-BACKEND = default
CONTENT = .html


Does that mean anything to anyone? They look basically the same to me.
CONTENT-ENCODING = gzip << from this it seems the 'failing' call is returning a compressed file of sorts. Maybe you could try ftp()ing site["CONTENT"] to yourself and see what's up?
In response to Metamorphman
Thank you for this. It seems that the site is sending me a gzip compressed version of itself instead of just plain html. This really just make this even more confusing.

I ftped it to myself and extracted the file and it was just a compressed version of the html. I did some digging on CONTENT-ENCODING gzip and I found a little information. It seems being able to accept a compressed sites is something that BYOND can't do but most modern browsers can do. The sites I found suggested that in order not to receive a gzip file, the browser (in this case BYOND) should send an "Accept-encoding: gzip, deflate" header. Is this something BYOND can change? Could we perhaps get support for compressed sites? Or am I asking the wrong questions?

Here is the site I found.

http://betterexplained.com/articles/ how-to-optimize-your-site-with-gzip-compression/

I am not sure I understand all of it, or at least how it relates to this problem.

Also:
http://www.websiteoptimization.com/speed/tweak/compress/
Anyone have any ideas? I kind of feel this should now either be a feature request or a bug report but I am not sure which is more appropriate.

It seems that either: BYOND shouldn't accept compressed pages through world.Export (which is a matter of changing the headers) or it should know how to handle them and decompress them.
This is most likely not a problem with compressed pages. It's a problem with the communication. You now know how to determine if a request has failed -- the content encoding value will be "gzip". When you get this response, log it away and make a few more attempts. If it fails after four or five more tries, then mark it down as a failed request that isn't going to start working any time soon.

From what you said, it seems to just do this from time to time, meaning even when it does fail, it starts working again right after, right? If that is the case, the "retry" option seems to be your best bet for making your system "work". Just log the failure away, and try again.

As for the statement that "it (BYOND) should know how to handle them and decompress them," the idea that the site you're querying periodically sends out compressed pages seems a bit odd. I'm pretty most web servers, like Apache, compress all outgoing data.

Though, I suppose the only person who can say what is really happening here is someone like Tom or Lummox.
Thank you for your response Keeth. At this point, it really comes down more to my curiosity than anything else. It is an annoying issue since the entire point of my program is to compile data from certain websites and then run some statistical analysis on them. If I can't access the sites then I really can't do much of anything.

The work around work most of the time and I am happy to use them however, if this is something that BYOND can do check out in the future I would like to put in a request. The problem is, I really don't know enough about the issue to know if this is a bug or a missing feature.

The 10% figure is an estimate based on all the sites I have tried. On some sites, (the site listed above: http://www.politifact.com/personalities/) it happens about 60% of the time. On others, not at all.