What I want to do is limit text input to my form to basic Plain Text. This IS working code, but I'm wondering whether my function is redundant or not useful for my purposes.
I want to eliminate any tags, hex digits, etc. and insure the best I can that I simply get printable, expected results as it's rather difficult to imagine the various ways it may fail during an attack or cross-site, whatever. I fully understand there is no such thing as 100% security for forms, but I'd like to at least have some decent security in place. In particular, it doesn't seem to inhibit hex numbers.

The function is:

function check_input($data)
{
  $data = trim($data);
	$data = stripslashes($data);
	$data = strip_tags($data);
	$data = htmlspecialchars($data);
	$data = htmlentities( $data);  // In the case of a foreign language NOT English!
    return $data;
}

Any comments, advice or critiques appreciated; I've no problem with being shown that I'm wrong 🙂. Am I? lol

TIA,

Rivet`

    Rivet wrote:

    What I want to do is limit text input to my form to basic Plain Text.

    Define "basic Plain Text". For example, maybe to you that means "alphanumeric characters and whitespace". Then you can write a regex pattern to match that. This way, you can choose to either check that the input validates according to your regex pattern, e.g., by using [man]preg_match[/man] and checking that the pattern is matched from the start to the end. If not, you get the user to change the input. Or, you can remove parts of the input that does not match the pattern, e.g., by negating the pattern and using [man]preg_replace[/man] to replace with an empty string.

    Rivet wrote:

    I want to eliminate any tags, hex digits, etc. and insure the best I can that I simply get printable, expected results as it's rather difficult to imagine the various ways it may fail during an attack or cross-site, whatever.

    Yeah, that's why I suggest that you define "basic Plain Text", i.e., come up with a whitelist of what you want to accept rather than trying to come up with a blacklist of what you want to reject.

    You should still use htmlspecialchars or htmlentities when printing the text to some HTML page though, just like how you should still prevent SQL inject if you are going to store the text in a relational database. This ensures that even if the whitelist changes, your code will remain secure.

      laserlight;11032855 wrote:

      Define "basic Plain Text". For example, maybe to you that means "alphanumeric characters and whitespace". Then you can write a regex pattern to match that. This way, you can choose to either check that the input validates according to your regex pattern, e.g., by using [man]preg_match[/man] and checking that the pattern is matched from the start to the end. If not, you get the user to change the input. Or, you can remove parts of the input that does not match the pattern, e.g., by negating the pattern and using [man]preg_replace[/man] to replace with an empty string.

      Yeah, that's why I suggest that you define "basic Plain Text", i.e., come up with a whitelist of what you want to accept rather than trying to come up with a blacklist of what you want to reject.

      You should still use htmlspecialchars or htmlentities when printing the text to some HTML page though, just like how you should still prevent SQL inject if you are going to store the text in a relational database. This ensures that even if the whitelist changes, your code will remain secure.

      Well, by Plain Text, I mean in the sense of Plain Text vs html or rich text for e-mails. letters, digits, punctuation, dashes, underscores, & probably a few things that haven't yet occurred to me. In other cases, like textboxes, ctype alphanumeric, ctype_digits & a dash in unpredictable locations.
      I tried regex methods at first but it quickly got so unwieldy I tried ctype_ ... and had some success but not enough. So far in my testing it's working but there's no way I'm good enough at testing to trust myself much; too much of a newbie in several areas of PHP. So perhaps regex is going to be the way to go, at least partially here.

      Thanks much,

      Rivet`

        I think the important question here is what is this form for? Or more specifically, this text box or text field? Are you echoing the result back to the user? Storing it in a database? Emailing it? Until we know this, the question is too open-ended.

          So you're saying if I chose a password of "H3||() W<>r1d¡", you would reject that because it's not "plain text" ?

          (In other words, I too am struggling to find any legitimate use of what it is you're describing.)

            Just to pick another nit: you are not checking the input, as implied by the name of the function, but changing the input, or you might say "filtering" it. Which raises the question, have you considered the [man]filter_var/man function along with either validate filters or sanitize filters?

              Bonesnap;11032895 wrote:

              I think the important question here is what is this form for? Or more specifically, this text box or text field? Are you echoing the result back to the user? Storing it in a database? Emailing it? Until we know this, the question is too open-ended.

              Well, there are about 3 or 4 total forms that will be created from this one. This particular form is as much for learning & experience as it is for the actual current goal of:

              1. A General form which asks for a randomly generated "code" which contains dashes to be entered, first, last names, a valid email, whether the user wants a response or not, and a textarea for their message which will be in this case limited to about 200 words. As such, there is no need for anything more than standard punctuation (periods, comma, semi-colon, colon and apostrophe, parens, say).
                So nothing more than conversational English is the target and any accented letters are also not to be allowed (foreign languages).
              2. I will show the collected information and the Message, after cleanup, to the user so they can see what they entered before Submitting in the event my filtering ruined their message or they simply decided not to send it.
              3. So far there are no includes or run_once type things; I'll do that after I'm satisfied as I can get about protection against miscreants and bots.
              4. I use SESSIONS where I've moved far enough that $_POST won't work and unset them as soon as I no longer need them. There's more but I think I've gone well beyond the original question you asked.

              I'm hoping to re-use modified pieces of these particular scripts for other, more technical uses such as reporting bad links, webmaster & abuse forms and a couple of others so I realize not everything I'm going here will apply to those forms.

              If this doesn't cover what you need to know, just ask; I'll respond promptly as I can. The only limitation I have on what I can say is I can't provide complete scripts nor certain portions of the overall scripts.

              I'm on win 7 Home Premium, XAMPP, PHP 5.3.x on both local and remote servers.

              Thanks for asking for clarification; hope I've answered what you need.

              Regards & thanks,

              Rivet`

                No, I'm not saying that. This is not being used for passwords. More generally, YES, I would reject that in a text box or textarea.

                  NogDog;11032901 wrote:

                  Just to pick another nit: you are not checking the input, as implied by the name of the function, but changing the input, or you might say "filtering" it. Which raises the question, have you considered the [man]filter_var/man function along with either validate filters or sanitize filters?

                  Go ahead and nitpick; that's fine by me as long as it's relevant 🙂,

                  I have considered, and tried, filter_var. In fact, I use them for e-mail, will use it for URLs and FILTER_VALIDATE_INT, and I also use ctype_... . But either they don't always cover off what I want, or I can't get them to do what I want.
                  As for Sanitizing, no, I haven't used it much. I'm not interested much in correcting typos for instance; If someone can't spell in Plain Text without using tags et al, that's NOT typos; It's a "harm-attempt" of some sort. There is one in particular I'm trying to get to work though, and that's "Remove all characters except digits, +- and optionally .,eE. ".

                  Thanks for asking; I'm always open to suggestions and good advice!

                  Regards,

                  Rivet`

                    You might wanna check out [man]char/man and [man]ord/man ... they come in useful sometimes in low-level character hacking....

                    Seems like there are some functions there in the user notes, even...

                      dalecosp;11032997 wrote:

                      You might wanna check out [man]char/man and [man]ord/man ... they come in useful sometimes in low-level character hacking....

                      Seems like there are some functions there in the user notes, even...

                      I'm peripherally familiar with them but hadn't thought about them recently. Thanks for the suggestion.

                      Rivet`

                        Someone might regard "Parece haver um &#8211; sapo &#8211; na minha bidé." to be "plain text"...

                          Weedpacket;11033019 wrote:

                          Someone might regard "Parece haver um – sapo – na minha bidé." to be "plain text"...

                          If they can't read "English Only" then it's not my problem if it becomes mangled. Their message would be meaningless to me and thus of no use so even if it got thru, it'd get dropped on the floor as soon as I saw it.

                          A bit nit-picky, don't you think? Or was that meant to be rhetorical?

                            "Remove all characters except digits, +- and optionally .,eE. "

                            $filteredText = preg_replace('#[^0-9\.,eE +-]+#', '', $inputText);
                            

                            (Not sure if you really need to escape the "." within a character class like that. I do know that if the "-" is not the last character, then you do need to escape it.)

                              Rivet;11033029 wrote:

                              If they can't read "English Only" then it's not my problem if it becomes mangled.

                              What if I'm trying to talk to you about the résumé I've prepared for a job application? Or I'd like to get your opinion on how piña coladas taste? Or what if I'm writing on behalf of my über annoyed fiancée in complaint about the naïveté of your form's input-mangler? 😉

                                Parece haver um – sapo – na minha bidé

                                There are also a couple of en-dashes in there – not to be confused with hyphens…. Would those be sufficient reason to discard it?

                                  Weedpacket;11033019 wrote:

                                  Someone might regard "Parece haver um &#8211; sapo &#8211; na minha bidé." to be "plain text"...

                                  I suppose the same could be said for "&#33509;&#12363;&#12387;&#12383;&#26178;&#12398;&#20889;&#30495;&#12381;&#12398;&#65298;&#12290;&#39640;&#26657;&#12398;&#20462;&#23398;&#26053;&#34892;&#12290;&#22580;&#25152;&#19981;&#26126;" as well, but I don't think he's too worried about those fringe cases.

                                    Weedpacket;11033041 wrote:

                                    There are also a couple of en-dashes in there – not to be confused with hyphens…. Would those be sufficient reason to discard it?

                                    IMHhumbleO, yes they're sufficient reason to discard when the purpose and use plus instructions within the form, they can write in North American English if they want to be heard or read, as the case may be. Characters with text decorations just aren't need to make one's point known, such for "resume" writing.

                                    I take your input as food for thought, even though it doesn't change my opinions. Right now at least 😃 - one never knows, right? Besides, it's something I hadn't thought of for my 'list' of characters used or not used.

                                    Cheers,

                                    Rivet`

                                      There is nothing even remotely similar to résumé, resumé, or the also misspelled 'resume' on the site. And if there were, the form 'meat' would be quite different to start with. Not relevant, really.