Hi

I have two questions:

  1. How do you remove non-special characters from a string?
  2. How do you remove all instances of html from a string?

For example let's say you have the following string:

$str = "Check this out <a href=&#65533;http://www.somewebsite.com&#65533;>Somewebsite</a>, this is a great website";

How do you remove characters such as "&#65533;" from this string, as well as the html code?

Thank you

    1. Allowing only alfanumeric data:
      $output = preg_replace("/[^A-Za-z0-9]/","",$input);
      
    2. Stripping html and php tags: [man]strip_tags[/man]
      Remove tags first and then any remaining non-alphanumeric characters.

    EDIT:

    //this may be better and includes punctuation as well
    $output = preg_replace("/[^[:alnum:][:punct:]]/","",$input);
    

      Depending on exactly what you want to do, you might also want to look into [man]htmlentities/man and [man]htmlspecialchars/man.

        Thanks for your reply wilku. The output seems to now include ascii numerals in the places of the characters taken out such as:

        "60 a href http www somewebsite com 62 somewebsite 60 a 62"

        Is there a way to remove these?

        Furthermore, how do prevent it from removing full-stops?

        Thanks again.

          I would just do:

          echo htmlentities(strip_tags($str), ENT_QUOTES);
          
            NogDog wrote:

            I would just do:

            echo htmlentities(strip_tags($str), ENT_QUOTES);
            

            Thanks for this. However, now the output resembles this:

            " & # 60;a href=&#65533;http://www.somewebsite.com&#65533;& # 62;Somewebsite&# 60;/a &# 62"

            (I've added spaces otherwise the characters here get modified by this forum script)

            What I would like is for it to take out all these types of characters (such as "&#60" as well as HTML references) and keep the output clean. Any suggestions?

              What I would like is for it to take out all these types of characters (such as "&#60" as well as HTML references) and keep the output clean. Any suggestions?

              What characters do you want to allow? Remnove everything else.

              Actually, why do you want to do this? Removing characters can be harmful to the data.

                laserlight wrote:

                What characters do you want to allow? Remnove everything else.

                I would like to remove non-standard characters such as "" & # 60;a" and "&#65533;"

                Actually, why do you want to do this? Removing characters can be harmful to the data.

                I am removing these characters only to generate an validated XML output (I do not manipulate the data in the database in which the data is stored).

                  I would like to remove non-standard characters such as "" & # 60;a" and "?"

                  You did not answer my question 😉
                  I asked you what you wanted to keep, not want you wanted to remove.

                  I am removing these characters only to generate an validated XML output (I do not manipulate the data in the database in which the data is stored).

                  It sounds like you do not actually want to remove these characters. Functions like htmlspecialchars() and htmlentities() should be what you want since they substitute the special characters with their escape sequences.

                    Perhaps your problems could be addressed via the use of CDATA tags in your XML?

                    $str = '<![CDATA[' . strip_tags($str) . ']]>';
                    

                    Also, make sure that the encoding attribute of your <?xml?> tag matches the character encoding of the source of your text. For example, if the text is coming from an input form on a web page, make sure that if that web page/form uses UTF-8 encoding then that your resultant XML page begins with:

                    <?xml version="1.0" encoding="UTF-8"?>
                    
                      3 months later

                      <?php echo htmlspecialchars($string); ?> produces valid xml output

                        Sorry to say, B1sh0p, but you were beaten to that suggestion by yours truly more than three months ago. Kindly do not resurrect old threads without good reason.

                          Write a Reply...