badly formed HTML...if statements are server side directives?

sneakyimp

I was recently tasked with parsing some HTML documents to try and remove a particular section and replace it with some kind of server-side-include or PHP code. I tried to parse these documents using DOMDocument but I get all kinds of complains. Looking at the HTML, I see some HTML5 stuff which looks a bit new to me but I also see this:

<!--[if IE 7]>
<html class="ie ie7" lang="en-US">
<![endif]-->
<!--[if IE 8]>
<html class="ie ie8" lang="en-US">
<![endif]-->
<!--[if !(IE 7) | !(IE 8) ]><!-->
<html lang="en-US" style="background: none repeat scroll 0% 0% transparent;">
<!--<![endif]-->

<!--[if gte IE 9]>
  <style type="text/css">
    .gradient {
       filter: none;
    }
  </style>
<![endif]-->

This obviously seems like code to check for versions of IE and respond accordingly but it seems to be neither HTML nor Javascript to me. What on earth are these commented bits of code? Is this some kind of server-side include?

Also, I noticed that one can specify a version and encoding when using the DOMDocument constructor, but the documentation only provides a weird, non-informative description of what the version value indicates:

php docs wrote:

version

The version number of the document as part of the XML declaration.

First, the example in the docs is supplying the string "1.0" as the version number. Second, version of what? Would I put "5" in here for HTML 5?

Weedpacket

The second question is more straightforward to answer than the first so I'll answer it first: it refers to the version of XML used (1.0 or 1.1).

Details of the difference are in the 1.1 spec of course, but basically it's in how one decides whether a given character is allowable in a given situation: 1.0 lists the characters that are allowed and forbids all others, 1.1 lists characters that aren't and allows all others; but if you're not getting wacky with your element names (and in compliant HTML you won't be) there's no difference: stick with 1.0 here.

Pedantic niggle: the HTML5 spec defines two syntaxes for HTML5 documents, the HTML-based "HTML" syntax and the XML-based "XHTML" syntax. You might want to see which one you're producing but unless you're getting fancy with namespaces the most you'd need to do is check the doctype. (I don't actually know the situation re: the DOM extension's support for HTML5).

The first one is Microsoft's client-side mechanism used for browser-sniffing in Internet Explorer (https://msdn.microsoft.com/en-us/library/ms537512%28v=vs.85%29.aspx,http://www.quirksmode.org/css/condcom.html) as their solution for handling the incompatibilities between Internet Explorer n and other browsers (including Internet Explorer n-1 and Internet Explorer n+1). It never caught on because (as well as breaking the HTML - two <html> opening tags?) no other browser had so many compatibility problems.

It was introduced in IE5, and I think discontinued in IE10 - (not to mention that IE11 or maybe 12 is supposed to be the last version of Internet Explorer - but it wouldn't be the first time Microsoft has announce the end of life for something that they then kept going).

Bonesnap

This wouldn't happen to be a WordPress site, would it? That code looks pretty much identical to a lot of the default themes' headers. I normally nuke about 95% of the default crap and use my own, partially for this reason.

And yeah, according to Microsoft, the conditional comments are ignored by IE10 and later. IE11 is the last version of Internet Explorer and their new browser, "Spartan", will be its successor. I have a hunch they're going to stick with the name Spartan much like how they stuck with Windows 7.

sneakyimp

Thanks for the input, guys.

Weedpacket;11046425 wrote:
The second question is more straightforward to answer than the first so I'll answer it first: it refers to the version of XML used (1.0 or 1.1).

Thanks for the links here. I think it's curious that DOMDocument has both a loadHTML method and also a loadXML method and both seem to choke on this HTML source file I'm working with. The loadXML method throws a LOT more errors than the loadHTML method does which seems backwards to me. I also find it bothersome that all these parse errors appear as E_WARNING notifications, thereby sneaking their way into the script output. Seems it would be better to collect such warnings in an array or something rather than barfing them out to STDIO or STDERR.

loadHTML complains about:
a <head> tag before the opening <body> tag, which makes sense ("htmlParseStartTag: misplaced <head> tag in Entity")
a <header> tag, which is valid HTML5 unless I'm mistaken ("Tag header invalid in Entity")
a <nav> tag (also valid HTML5?? --- "Tag nav invalid in Entity")
a <section> tag (also valid HTML5?? --- "Tag nav invalid in Entity")
* a <footer> tag (also valid HTML5?? --- "Tag nav invalid in Entity")

This makes me think that DOMDocument->loadHTML leaves something to be desired as far as parsing valid HTML documents. These errors seem entirely unrelated to choice of character set.

Weedpacket;11046425 wrote:
Pedantic niggle: the HTML5 spec defines two syntaxes for HTML5 documents, the HTML-based "HTML" syntax and the XML-based "XHTML" syntax. You might want to see which one you're producing but unless you're getting fancy with namespaces the most you'd need to do is check the doctype. (I don't actually know the situation re: the DOM extension's support for HTML5).

Hahaha which one I'm producing bwahahhaHAHAHA. No this is someone else's formerly-wordpress site which some genius decided to export to static HTML rather than battle wordpress. I'm expecting it will be my job to turn it back into some kind of dynamically generated site at some point.

Weedpacket;11046425 wrote:
The first one is Microsoft's client-side mechanism used for browser-sniffing in Internet Explorer (https://msdn.microsoft.com/en-us/library/ms537512%28v=vs.85%29.aspx,http://www.quirksmode.org/css/condcom.html) as their solution for handling the incompatibilities between Internet Explorer n and other browsers (including Internet Explorer n-1 and Internet Explorer n+1). It never caught on because (as well as breaking the HTML - two <html> opening tags?) no other browser had so many compatibility problems.

For the love of GOD this microsoft-breaking-**** saga is still going on? When will it end?

Weedpacket;11046425 wrote:
It was introduced in IE5, and I think discontinued in IE10 - (not to mention that IE11 or maybe 12 is supposed to be the last version of Internet Explorer - but it wouldn't be the first time Microsoft has announce the end of life for something that they then kept going).

DIE DIE DIE DIE DIE.

Bonesnap wrote:
This wouldn't happen to be a WordPress site, would it? That code looks pretty much identical to a lot of the default themes' headers. I normally nuke about 95% of the default crap and use my own, partially for this reason.

You hit the nail on the head. The decision was made that WP presented a security problem so the dynamic website was sacrificed and mummified into static HTML. Brilliant!, don't you think?

Bonesnap wrote:
And yeah, according to Microsoft, the conditional comments are ignored by IE10 and later. IE11 is the last version of Internet Explorer and their new browser, "Spartan", will be its successor. I have a hunch they're going to stick with the name Spartan much like how they stuck with Windows 7.

You'd think with that $40B cash hoard MS used to have that they could have written a decent browser without ruining the internet for everyone.

Weedpacket

sneakyimp wrote:
The loadXML method throws a LOT more errors than the loadHTML method does which seems backwards to me.

Not really: XML is a lot more strict than HTML; that's why the HTML syntax gets a full parser-level description while the XHTML syntax just gets "oh, just look at the XML spec". In HTML, for example, you can have opening tags without closing tags, some elements don't have closing tags, and sometimes the opening tag can be missing as well. Try validating this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<title</><body>

I wouldn't be surprised if something similar could be done in HTML5 (though I haven't tried to put it together. Yet).

This makes me think that DOMDocument->loadHTML leaves something to be desired as far as parsing valid HTML documents.

Sounds like it (or rather lixml) isn't up to HTML5 standards yet (indeed, it only claims to understand HTML4 and work on 5 seems to have stalled for the last few years).

I also find it bothersome that all these parse errors appear as E_WARNING notifications, thereby sneaking their way into the script output. Seems it would be better to collect such warnings in an array or something rather than barfing them out to STDIO or STDERR.

Since DOMDocument runs on top of libxml, [man]libxml_use_errors[/man] should catch at least most of them. I don't know if that's before or after the message is generated, though; if it's after, then you might need to suppress them manually.

You'd think with that $40B cash hoard MS used to have that they could have written a decent browser without ruining the internet for everyone.

How do you think they made that $40B? It was cheaper to invent conditional comments: turn the job of making the browser compatible with the web into the job of making the web compatible with the browser - i.e., someone else's problem.

johanafm

Weedpacket;11046463 wrote:
Since DOMDocument runs on top of libxml, [man]libxml_use_errors[/man] should catch at least most of them. I don't know if that's before or after the message is generated, though; if it's after, then you might need to suppress them manually.

There is [man]libxml_use_internal_errors[/man] which means no such errors will show in regular error log. You will have to check for them actively using libxml functions. Without using internal erros, they are all passed to regular error handling functions.

sneakyimp

Thanks for the input. both of you. Clear and helpful as usual.