I'd like to be able to carefully select a string that could be quoted with either single or double quotes, and may contain the opposite type of quote symbol too.

E.g. "weren't", 'the "man" said', "plain", 'simple'

I thought this (fragment from a) PCRE pattern would do it:

(['\"]?)([^ >]*?)\1

where the first parenthesised expression captures the quote symbol used to open the string, and the \1 back reference requires that the string ends with that same symbol.

But it doesn't work. It returns no matches.

Am I using the back reference incorrectly?

    Replace smart double quotes with straight double quotes.
    Unicode version for use with Unicode regex engines.

    [\u201C\u201D\u201E\u201F\u2033\u2036]

    Replace smart double quotes with straight double quotes.
    ANSI version for use with 8-bit regex engines and the Windows code page 1252.

    [\x84\x93\x94]

    Replace smart single quotes and apostrophes with straight single quotes.
    ANSI version for use with 8-bit regex engines and the Windows code page 1252.

    [\x82\x91\x92]

    Replace smart single quotes and apostrophes with straight single quotes.
    Unicode version for use with Unicode regex engines.

    [\u2018\u2019\u201A\u201B\u2032\u2035]

    Replace straight apostrophes with smart apostrophes

    \b'\b

    Replace straight double quotes with smart double quotes.
    Unicode version for use with Unicode regex engines.

    \B"\b(["\u201C\u201D\u201E\u201F\u2033\u2036\r\n]+)\b"\B

    Replace straight double quotes with smart double quotes.
    ANSI version for use with 8-bit regex engines and the Windows code page 1252.

    \B"\b(["\x84\x93\x94\r\n]+)\b"\B

    Replace straight single quotes with smart single quotes.
    ANSI version for use with 8-bit regex engines and the Windows code page 1252.

    \B'\b(['\x82\x91\x92\r\n]+)\b'\B

    Replace straight single quotes with smart single quotes.
    Unicode version for use with Unicode regex engines.

    \B'\b(['\u2018\u2019\u201A\u201B\u2032\u2035\r\n]+)\b'\B

    Hope this help.

      I'm grateful for the help, but I don't have the experience to understand what you're saying.

      Are you telling me to alter the text that will be searched? I cannot modify the input text, only the regex pattern. And if it's the pattern you're telling me to modify, I've never seen anything like the above appear in a regular expression before.

        Sorry, I think there was some character replacement after I posted the snippets as code.

          I finally found out why the back reference wasn't working.

          PHP requires a double-backslash when using back references. So in a pattern \1 should be \1 in PHP code.

            And I just got the regex I think I need, using a conditional expression.

            /(['\"])?([^ >]*?)(?(1)\1|(?:>|\/>|\s))/si

            This looks horrendous, but it's the simplest thing I've come up with. The first parentheses contain a possible match for either an apostrophe or a double-quote symbol. The next parentheses contain the meat that we're interested in, and the matching continues non-greedily until a space or a right-bracket are encountered. The third parentheses contain the rather ugly conditional expression:

            (?(1)\1|(?:>|\/>|\s))

            which checks whether or not the first parentheses matched anything (i.e. whether an apostrophe or double-quote were found). If so, then the same symbol must be matched, using the back reference \1. If not, then a right-bracket, slash then right-bracket, or whitespace must be matched to cause success. (The right-bracket symbols are needed in my case because I using HTML elements as input.)

            This way, a quoted or unquoted string can be matched so that it must end with the same quote-symbol (if any) it began with, and the string contained in those quote symbols can contain the other type of symbol without problem.

              Write a Reply...