trying to reverse-engineer the twitter tweet-length calculator

sneakyimp · Jan 24, 2014

I'm making a personal utility so that I can line up tweets in advance and have them automatically tweeted for me. I think we have the algorithm discussion well in hand, but I'm in the process of building my tweet-input form and I want to be sure I don't try to create any tweets that are too long. To that end, I'm trying to extract the javascript from the twitter website that calculates the length of a tweet typed in a textarea. This is a bit trickier than it sounds in that the tweet length calculation automatically takes into account when a user types in a URL. All tweeted URLS are shortened to a url with a length of 22 chars and the textarea length counter takes this into account. Likewise, if you specify an image, the counter knows this and will factor in a twitter image url of length 22 plus one space for a total of 23 chars. The url-sniffing javascript they have recognizes a variety of url types like these:
https://example.com (starts with HTTPS)
http://example.com (starts with HTTP)
example.com (domain with no subdomain)
foo.bar.foobar.barfoo.example.com (arbitrary number of subdomains)

It does not recognize any domains or url specifications that begin with slashes or [url]ftp://.[/url] Interestingly, even really short urls (like foo.com) are expanded to 22 chars.

I'd certainly appreciate any help in figuring out how to reverse-engineer their site's javascript. It is apparently minimized and I can't seem to figure out how to search all the javascript for references to the id of the html elements that are in play. If anyone has any tips about how to halt JS when loading a page or on a particular event such that I can step through the code, that would be nice.

Alternatively, I'm thinking I could concoct a Javascript function that uses a regex split along urls and with the resulting array, calculate the length myself. Could use a little help concocting a regex in JS that locates urls that a) start the tweet, b) are in the middle of the tweet and c) end the tweet.

johanafm · Jan 24, 2014

sneakyimp;11037643 wrote:
Alternatively, I'm thinking I could concoct a Javascript function that uses a regex split along urls and with the resulting array, calculate the length myself. Could use a little help concocting a regex in JS that locates urls that a) start the tweet, b) are in the middle of the tweet and c) end the tweet.

I'm not familiar with Twitter, so I might be missing some complications. But from what I understand…

If it's a personal utility, intended for someone who knows how to type in a fully qualified url… you can make the recognition process that much easier. If it starts with http(s)://… it's a URL. All you have to do is actually always this part (or until you grow tired of it and wish to add more recogniztion power). Moreover, what you are descibing for "images" and "urls" seem to be identical. 22 chars for http, 23 for https.

The easiest is probably
1. On some trigger, such as all keyups, or keyup + timer…
2. explode contents on whitespaces ("SPACE" and LF; tab?) into array
3. for each element, add its computed length:
starts-with "http(s)://" ? Math.max(string-length, 22) : string-length

May not be that efficient but it's a simple starting point. Could probably be improved rather easily by replacing 3 with
3. for each element replace string by object containing
- the replaced string
- start index
- end index
- computed length

Then if you move around inside the input / textarea, you find the element in question and only recalculate its values. And update indices for all following elements - and this can actually be done by adding a new property to each object: "length-modifier". As you traverse the array to check start / end positions to find out where you are, you also keep adding the value of length-modifier and use that to modify the start / end positions.

The only special case I see when using a purely white-spaced split approach is for urls is that do not end with a whitespace. Those that end sentences for http://example.com. But then you only need to inspect last character to find this out.

sneakyimp;11037643 wrote:
I'd certainly appreciate any help in figuring out how to reverse-engineer their site's javascript. It is apparently minimized and I can't seem to figure out how to search all the javascript for references to the id of the html elements that are in play. If anyone has any tips about how to halt JS when loading a page or on a particular event such that I can step through the code, that would be nice.

Actually it doens't seem to be that hard, assuming I'm in the right place. I googled twitter or tweet or some such and found: https://twitter.com/intent/tweet. But nevertheless, the same principles apply.

Inspect page source
locate the jQuery that does "stuff" to the input/textarea ($/jQuery('#status').each) and look at some of the surrounding code
Rclick and inspect element (I am using chrome, but your browser of choice should provide the same features, albeit perhaps differently than described here)
sources tab - expand the folders to find tfw/intents/tweetbox.js
semi-read / scan through the code until you find parts that seem to make some kind of sense. The interesting part seem to start around this.$textarea.bind("keyup",
which is later on followed by such function calls as getTextLength, getTweetLength, updateCounter.
Looking at getTweetLength which seems appropriate, you can also see on which object it is defined: twttr.txt.getTweetlength
Top right corner, "watch expressions", click the '+' and enter "twttr"
reload page
expand twttr, expand txt
double click getTweetLength to get the function definition and copy paste elsewhere for inspection
repeat for its function calls.

If you want to inspect the code as it is running, you would need to insert line-breaks to get meaningful break points. Not sure if it is allowed or not. But if it is, copy the appropriate js file, add line breaks, add break-points and go.

sneakyimp;11037643 wrote:
Could use a little help concocting a regex in JS that locates urls that a) start the tweet, b) are in the middle of the tweet and c) end the tweet.

Unless I miss something, the differences are trivial. start-of-string, white-space, end-of-string.

/* reads as: no-capture: start or whitespace, http or https ://, everything up until whitespace or end
 */
/(?:^|\s)http(s)?:\/\/.+(?:$|\s)

Just to be sure, check wether it should be .+ or .+? (after [url]https://)[/url]
I don't remember if js regexp is greedy or not. Greedy would make it match everything until the end, non-greedy until the first occurance of white-space or end-of-string. It should be non-greedy.

Also note that a url preceeding punctuation most likely has to be treated differently. Look at an url ending a sentence, such as http://example.com. Using my regexp, the . would be part of the url, but the full stop doens't belong to the url. Thus, check for trailing punctuation and remove those from url.

Derokorian · Jan 24, 2014

Here is a function I use on my blog to turn urls into links. Maybe it will help you do what you're after?

function linkify(str) {
    return str.replace(/((https?:\/\/)?([a-z0-9_-]+\.)+[a-z]{2,6}\/[^\s]*)/g, '<a href="$1">$1</a>');
}

sneakyimp · Jan 24, 2014

johanafm;11037647 wrote:
If it's a personal utility, intended for someone who knows how to type in a fully qualified url… you can make the recognition process that much easier. If it starts with http(s)://… it's a URL. All you have to do is actually always this part (or until you grow tired of it and wish to add more recogniztion power). Moreover, what you are descibing for "images" and "urls" seem to be identical. 22 chars for http, 23 for https.

Despite it being a personal utility, it's very important for the tool to calculate tweet length identically to how twitter does it -- otherwise, i will risk problems if tweets fail for being too long.Your point that http gets 22 chars and https gets 23 chars is correct!

johanafm;11037647 wrote:
The easiest is probably
1. On some trigger, such as all keyups, or keyup + timer…
2. explode contents on whitespaces ("SPACE" and LF; tab?) into array
3. for each element, add its computed length:
starts-with "http(s)://" ? Math.max(string-length, 22) : string-length

So far this sounds like the best suggestion. I was hoping for some way to regex-split the string at each urls and calculate that way, but I'm having great difficulty in getting a regex that captures the different types of urls which also returns a useful array. string.split(regex) in javascript creates a crazily structured array if you have multiple parentheses in your regex. I still can't figure out the logic of it. A regex split using whitespace, case insensitivity, and multiline flags sounds like it might work.

johanafm;11037647 wrote:
1. Inspect page source
2. locate the jQuery that does "stuff" to the input/textarea ($/jQuery('#status').each) and look at some of the surrounding code
3. Rclick and inspect element (I am using chrome, but your browser of choice should provide the same features, albeit perhaps differently than described here)
4. sources tab - expand the folders to find tfw/intents/tweetbox.js
5. semi-read / scan through the code until you find parts that seem to make some kind of sense. The interesting part seem to start around this.$textarea.bind("keyup",
which is later on followed by such function calls as getTextLength, getTweetLength, updateCounter.
6. Looking at getTweetLength which seems appropriate, you can also see on which object it is defined: twttr.txt.getTweetlength
7. Top right corner, "watch expressions", click the '+' and enter "twttr"
8. reload page
9. expand twttr, expand txt
10. double click getTweetLength to get the function definition and copy paste elsewhere for inspection
11. repeat for its function calls

I'm pretty familiar with most of these steps, but I was looking at https://twitter.com and not the url you specified. I have had trouble for a variety of reasons:
locating javascript that attacks the textarea has not been easy: multiple js files, can't seem to find reference to textarea's id in them
javascript uses lazy loading module type thing so js is not always explicitly mentioned in page
* code is minified
I'll follow your instructions and see what I can find...

johanafm;11037647 wrote:
If you want to inspect the code as it is running, you would need to insert line-breaks to get meaningful break points. Not sure if it is allowed or not. But if it is, copy the appropriate js file, add line breaks, add break-points and go.

With minified code I've seen so far, it's really hard to parse out what the heck is going on.

johanafm;11037647 wrote:
Unless I miss something, the differences are trivial. start-of-string, white-space, end-of-string.
/* reads as: no-capture: start or whitespace, http or https ://, everything up until whitespace or end
 */
/(?:^|\s)http(s)?:\/\/.+(?:$|\s)
Just to be sure, check wether it should be .+ or .+? (after [url]https://)[/url]
I don't remember if js regexp is greedy or not. Greedy would make it match everything until the end, non-greedy until the first occurance of white-space or end-of-string. It should be non-greedy.

Unless I'm mistaken, that pattern won't capture "foo.com" or other urls which don't have http(s) specified. The twitter code seems to catch these too.

The code on twitter also allows slashes, periods, dashes, question marks, ampersands, hash marks, etc.

sneakyimp · Jan 24, 2014

Thanks for the pattern. Looks to me like it will only recognize domains and not extended urls with ?, #, &, etc.

sneakyimp · Jan 24, 2014

johanafm, your instructions are very helpful. The other page I had been looking at was considerably more complex than this simple tweet page. I'm slowly digging out the JS functions like this one:

<script>
twttr.txt.getTweetLength=function(text,options){
  if(!options){
    options={
      short_url_length:22,
      short_url_length_https:23
    }
  }
  var textLength=twttr.txt.getUnicodeTextLength(text), urlsWithIndices=twttr.txt.extractUrlsWithIndices(text);

  twttr.txt.modifyIndicesFromUTF16ToUnicode(text,urlsWithIndices);
  for(var i=0; i<urlsWithIndices.length; i++){
    textLength+=urlsWithIndices[i].indices[0]-urlsWithIndices[i].indices[1];
    if(urlsWithIndices[i].url.toLowerCase().match(twttr.txt.regexen.urlHasHttps)){
      textLength+=options.short_url_length_https
    }else{
      textLength+=options.short_url_length
    }
  }
  return textLength
};

</script>

Derokorian · Jan 24, 2014

So my pattern was 2 characters off, needed to make / optional after tld, and add the insensitive flag. That's what I get for trying to recreate a function on the fly instead of just going to look at it. However, here is an example of it working pretty well with a bunch of URLs: http://jsfiddle.net/derokorian/UWFtk/

sneakyimp · Jan 24, 2014

twttr.txt.regexen.non_bmp_code_pairs=/[\uD800-\uDBFF][\uDC00-\uDFFF]/gm;

twttr.txt.getUnicodeTextLength=function(text){
  return text.replace(twttr.txt.regexen.non_bmp_code_pairs," ").length
};

twttr.txt.regexen.extractUrl=regexSupplant("("+"(#{validUrlPrecedingChars})"+"("+"(https?:\\/\\/)?"+"(#{validDomain})"+"(?::(#{validPortNumber}))?"+"(\\/#{validUrlPath}*)?"+"(\\?#{validUrlQueryChars}*#{validUrlQueryEndingChars})?"+")"+")","gi");


twttr.txt.extractUrlsWithIndices=function(text,options){
  if(!options){
    options={
      extractUrlsWithoutProtocol:true
    }
  }
  if(!text||(options.extractUrlsWithoutProtocol ? !text.match(/\./) : !text.match(/:/))){
    return[]
  }
  var urls=[];
  while(twttr.txt.regexen.extractUrl.exec(text)){
    var before=RegExp.$2,url=RegExp.$3,protocol=RegExp.$4,domain=RegExp.$5,path=RegExp.$7;
    var endPosition=twttr.txt.regexen.extractUrl.lastIndex,startPosition=endPosition-url.length;

if(!protocol){
  if(!options.extractUrlsWithoutProtocol||before.match(twttr.txt.regexen.invalidUrlWithoutProtocolPrecedingChars)){
    continue
  }
  var lastUrl=null,lastUrlInvalidMatch=false,asciiEndPosition=0;
  domain.replace(twttr.txt.regexen.validAsciiDomain,function(asciiDomain){
    var asciiStartPosition=domain.indexOf(asciiDomain,asciiEndPosition);
    asciiEndPosition=asciiStartPosition+asciiDomain.length;
    lastUrl={
      url:asciiDomain,
      indices:[startPosition+asciiStartPosition,startPosition+asciiEndPosition]
    };
    lastUrlInvalidMatch=asciiDomain.match(twttr.txt.regexen.invalidShortDomain);
    if(!lastUrlInvalidMatch){
      urls.push(lastUrl)
    }
  });
  if(lastUrl==null){
    continue
  }
  if(path){
    if(lastUrlInvalidMatch){
      urls.push(lastUrl)
    }
    lastUrl.url=url.replace(domain,lastUrl.url);
    lastUrl.indices[1]=endPosition
  }
}else{
  if(url.match(twttr.txt.regexen.validTcoUrl)){
    url=RegExp.lastMatch;
    endPosition=startPosition+url.length
  }
  urls.push({
    url:url,
    indices:[startPosition,endPosition]
  })
}
  }
  return urls
};

OH GAWD is this really worth it?

sneakyimp · Jan 25, 2014

EDIT: pasted wrong code, tweetbox.js is too long. will attach.

sneakyimp · Jan 25, 2014

OK so i reformatted the twitter code that is relevant to my task at hand and am trying to attach it here:
[ATTACH]4995[/ATTACH]

If anyone is interested, it works something like this:

<!-- jquery -->
<script type="text/javascript" src="/js/jquery-2.0.3.min.js"></script>
<!-- the twitter js -->
<script type="text/javascript" src="/js/tweetbox.js"></script>
<script type="text/javascript">

// OK tried to lift twitter code, gonna try this now
var twttr = twttr || {};
twttr.tco = {
	length: 22
};

// use .load() instead of you want to reference loading css or images
$( document ).ready(function() {
	$("#tweet_textarea").keyup(function(evtObj) {
		display_tweet_length();
	});

display_tweet_length();
});

function display_tweet_length() {
	var currentString = $("#tweet_textarea").val()
	$("#tweet_length").html(140-twttr.txt.getTweetLength(currentString));
}

tweetbox.js.zip

sneakyimp · Jan 25, 2014

Derokorian, thanks for sharing your pattern. It does look like it's working quite well.

Johanafm, thanks so much for your help, you were instrumental in locating the appropriate JS -- really elaborate, as it turns out.

trying to reverse-engineer the twitter tweet-length calculator

Ssneakyimp

Jjohanafm

DDerokorian

Ssneakyimp

Ssneakyimp

Ssneakyimp

DDerokorian

Ssneakyimp

Ssneakyimp

Ssneakyimp

Ssneakyimp