ansi escape sequence - regex

math · Oct 21, 2014

Hi,

I hate regex and I'm a noob with it...
In a websock-telnet app I receive data from client.
The data can containing ansi-color escape sequences, BUT not always the escape data come complete. Sometimes the escape-sequence is truncated.

For ex.
Before I recive "\e[" and only after I recive "0;37m".

I need to find a regex that can handle when the ending of data arrival contain a incomplete ansi-escape sequence.

Avaible sequences that I founded are: http://bluesock.org/~willg/dev/ansi.html#sequences

My try:

if ( preg_match("/^[0-9;\e[]+$/", substr($data, -1) ) == 1 )
{
   /* if the last $data char is a number
       or a ";"
       or a "\e"
       or a "["
       so it's possible that ending data contain a truncated ansi color sequence.
  */
}

Can someone help me to build a better regex to handle chunked escape data? (with comments if possible).

Thank you in advance
math

johanafm · Oct 29, 2014

First off, a lot of escape sequences, such as ESC [m (formatting) are specified as having a set of characters with specific meaning, while it is explicitly stated that "any other input should be ignored". At least for VT-100. As such, you might receive ESC [ =# - 0;1;5m which should be interpreted as
ESC [0;1;5m
It is also specified that you can receive multiple commands for ESC [m, such as ESC [0;1;5m. And while it makes no difference from the previous, ESC[7;8;0;0;01;9;0005;1;05m would achieve the same thing (go back to normal format mode, then activate bold + blink (or whatever 5 means). And yes, leading zeros are allowed but carry no meaning.

So, trying to match against only allowed characters, byte by byte from behind will not work. The one exception is if you immediately inspect the last character for ESC [. But, because all other cases still requires you to find the last occurrence of ESC [, and then start consuming characters until you find a terminating character, such as 'm', you might as well start with strrpos(chr(033), $buffer) to get the starting index "$lastIndex" of the last escape sequence in the buffer. You may then substring from lastIndex and proceed to try finding a complete escape sequence. If you find no such sequence, your sequence is incomplete and you will have to wait for more data. If you find a complete sequence you know that it is the last possible sequence in the current buffer and can immediately output the entire buffer.

Considering that you had the characters \e and [ in your character class above, you should be aware that "ESC [" is a way to represent the escape character (033) in human readable form, while \e is an escape sequence for PCRE and corresponds to 033. Thus, \e is the escape character, and [ is just a [.

math · Oct 29, 2014

johanafm;11043615 wrote:
First off, a lot of escape sequences, such as ESC [m (formatting) are specified as having a set of characters with specific meaning, while it is explicitly stated that "any other input should be ignored". At least for VT-100. As such, you might receive ESC [ =# - 0;1;5m which should be interpreted as
ESC [0;1;5m

thank you for reply, I'll try again now.

bye

ansi escape sequence - regex

Mmath

Jjohanafm

Mmath