Help: divide html code in units

dagon · Apr 16, 2009

What are you actually trying to achieve? What's the purpose of this?

GoRide_ · Apr 16, 2009

i am making site where users can register and create html/wml pages, and have options to edit them. one of edit options must be UNITS edit, so i need to divide html code of that file in units.

that's why i asked for help. it is PROBABLY possible with preg_replace, so please, if somebody can help....

theese are tags which need to be recognized and divided into units:

<a href=""> </a> - UNIT LINK
<img src="" alt=""> - UNIT IMAGE
 - UNIT PARAGRAPH
 - UNIT BREAK

and when there is some text(without tags), it should be recognized as UNIT TEXT.
if tag from html file is not listed above, it should be shown as UNKNOWN UNIT.

thats it....

dagon · Apr 16, 2009

how are are defining UNIT it does not exist in the specs in the way you are using the word

GoRide_ · Apr 16, 2009

ok, my english is bad, and i am confusing. i will try to explain it more simple.

i need php script which will open html file located on same server ...fopen("test.html") ..

then i need to read line by line and if php script find in html file, for example

<a href="http://google.com">GOOGLE</a>, i want to print that in next format:

LINK: google(http://google.com)

so, script need to recognize it is link and print in format above.

i need same for <img> tag, , .... if script find BR in text it will show BREAK LINE

i did that for A HREF

   if(preg_match('/<a href=\"(.*?)\">.*?<\/a>/i', $buff2))
   	   							{
      								$buff2 = preg_replace('/<a href=\"(.*?)\">(.*?)<\/a>/i', "\n".'LINK: $1 ($2)', $buff2);
								}

but, i tried to do for <img> and it doesn't work .

   elseif(preg_match('/<img src=\"(.*?)\" alt=\"(.*?)\"\/>/i', $buff2))
   	   							{
      								$buff2 = preg_replace('/<img src=\"(.*?)\" alt=\"(.*?)\"\/>/i', "\n".'IMAGE: $1', $buff2);
								}

also, when there is only text, which is not located between some tags, it need to be recognized as TEXT: part of text(10 characters for example).

      $buff2 = preg_replace("/(.+)\n/i", 'TEXT: $1' . "\n", $buff2);

problem is that, when there is empty line in html file, php script show it as TEXT: and there is nothing....

maybe now it's easier to understand.

Weedpacket · Apr 16, 2009

GoRide! wrote:
can i load wml file using dom->loadXML(); ?

Since WML is an application of XML, the answer is yes.

Surprisingly, the DOM extension can also read HTML, even though HTML is not XML.

GoRide_ · Apr 17, 2009

hm okay...but, can any body tell me what's the problem with this code

 elseif(preg_match('/<img src=\"(.*?)\" alt=\"(.*?)\"\/>/i', $buff2))
                                      {
                                      $buff2 = preg_replace('/<img src=\"(.*?)\" alt=\"(.*?)\"\/>/i', "\n".'IMAGE: $1', $buff2);
                                }

Weedpacket · Apr 17, 2009

Apart from the fact that you're not using DOM?

The most obvious thing to note is that the preg_match() test gains you nothing except having to do everything twice if there is a match.

It also requires

both src= and alt= attributes
in that order
no other attributes
a single space between the "img" and the start of the src= attribute
double-quoted attribute values
no spaces around the '=' separating either attribute name from its value
a single space between the ending quote on the src= attribute and the start of the alt= attribute
no space after the end of the alt= attribute
nothing between the end of the alt= attribute and the / at the end of the tag
and a / at the end of the tag.

If any of these conditions are not met then the pattern will not match.

Apart from that, there's nothing wrong with the pattern (the double quotes don't need to be escaped as they're not considered significant, but since they aren't significant the extra escapes get ignore anyway). Obviously it won't work if $buff2 is the wrong variable to begin with, but like I said, that's obvious.

GoRide_ · Apr 17, 2009

Ok, so, i need to work with php DOM?
is there any example how to do this with dom?

Weedpacket · Apr 18, 2009

$doc = new DOMDocument();
$doc->loadHTMLFile('source.html');

$images = $doc->GetElementsByTagName('img');
while($images->length)
{
	$image = $images->item(0);
	$label = "IMAGE: ".$image->getAttribute('src');
	$image->parentNode->replaceChild($doc->CreateTextNode($label), $image);
}

$doc->saveHTMLFile('target.html');

Of course, that's only an example. The real thing would no doubt require a bit of thought.

GoRide_ · Apr 18, 2009

hm, but how can I get atributes for all tags?

i am selecting all tags from html file with

$tags = $doc->GetElementsByTagName('*');

problem is that i cannot select only images or some tag, it must be sorted as in file.

if first line in file is bla, it must be first in "unit" list.

any help?

Weedpacket · Apr 18, 2009

The list of elements returned by GetElementsByTagName is in document order.

Now that you've got a list of all the elements in the document, the obvious thing to do would be to go through each element of the list and, depending on what sort of element it is, do something appropriate.

GoRide_ · Apr 18, 2009

sorry, i bother

but, if you have time please write example for working with some tag (no matter which), when all tags are loaded?

Weedpacket · Apr 19, 2009

$doc = new DOMDocument();
$doc->loadHTMLFile('test.html');

$elements = $doc->GetElementsByTagName('*');
$length = $elements->length;
for($i=0; $i<$length; ++$i)
{
	$element = $elements->item($i);
	switch($element->tagName)
	{
	case 'img':
		echo "An image, src = ".$element->getAttribute('src')."\n";
		break;
	case 'p':
		echo "A paragraph\n";
		break;
	case 'br':
		echo "Break\n";
		break;
	}
}

GoRide_ · Apr 19, 2009

thank you very much!!!! you have made my day

GoRide_ · Apr 19, 2009

uhh, forgot something!
how do I recognize clear text with no tags??

GoRide_ · Apr 19, 2009

it's going good for me...
is it possible when i select some tag to get all html code between that tag?

for example, if i getElementsByTagName("form"), can i get code between <Form> and </form>, ....? all inputs....

Weedpacket · Apr 20, 2009

See the user notes on the [man]DomElement[/man] page.

GoRide_ · Apr 20, 2009

ok,ok,here we go again... i am looking 4 hours to do this, and i don't know how
i have code

<a href="http://google.com"><font size="5">google link</font></a>

i made that you can change href,link title, font size...but, when there is only

<a href="http://google.com">google</a>

how can i add , and current link title beewteen font tags? i made that, but all code link title is recognized is node, i need to separate it because of later change...

GoRide_ · May 5, 2009

Help: divide html code in units

Ddagon

GGoRide_

Ddagon

GGoRide_

Weedpacket

GGoRide_

Weedpacket

GGoRide_

Weedpacket

GGoRide_

Weedpacket

GGoRide_

Weedpacket

GGoRide_

GGoRide_

GGoRide_

Weedpacket

GGoRide_

GGoRide_