using flock to deal with concurrency issues

sneakyimp · Feb 7, 2022

I'm working on a page which uses AJAX to display viewer counts for a video. I'm told we might have 1000 concurrent viewers watching a video, with each of their browsers polling my AJAX endpoint to check the viewer count every 10 or 15 seconds. That could easily result in 60-100 requests per second to my AJAX endpoint.

My AJAX endpoint, a PHP script, fetches the viewer counts by contacting the MUX[.]com API. I cache the API response by storing it in a file. This cache is considered stale after API_RESPONSE_CACHE_LIFETIME seconds (currently 15 seconds). This mostly works fine.

HOWEVER, concurrency, coupled with the modest latency of the MUX API has resulted in bursts of requests going out to the API. When we have enough viewers, the requests come in fast enough that multiple invocations of the script detect a stale cache and they all launch an API request. This is problematic, so I want to use some kind of locking to control access to the API. I have a variety of questions.

For starters, I want to use flock, which requires you to first call fopen. My testing (Ubuntu 20) indicates this will NOT alter the mtime of a file. Can I be sure of that for other file systems?

$fp = fopen($jwt_cache_file, "c");

I think to be safe I need a more comprehensive understanding of the behavior of fopen with param 'c'.

I was also hoping for some critique of my approach with flock and a shared resource (i.e., a cache file). This function get_mux_jwt takes care to use flock to control access to the JWT cache file. This JWT is generated locally, but get cached in a file. I hope to use a nearly identical approach when hitting the MUX API. If anyone sees a problem, or room for improvement, I'd be grateful for advice. I'm especially concerned about deadlock gumming up the entire web server.

/**
 * Attempts to retrieve JWT for the specified id_type and video_id, checks cache first
 * and if that is missing, empty, or stale, will contact API to get fresh one
 * @param string $id_type
 * @param string $video_id
 * @throws Exception
 * @return NULL|string
 */
function get_mux_jwt($id_type, $video_id) {
	
	if (!is_dir(JWT_CACHE_DIR)) {
		if (!mkdir(JWT_CACHE_DIR)) {
			throw new Exception('unable to create jwt cache dir');
		}
	}
	if (!is_writable(JWT_CACHE_DIR)) {
		throw new Exception('jwt cache dir is not writable');
	}
	// we have to generate a different JWT for each video id / id type
	$jwt_cache_file = JWT_CACHE_DIR . $id_type . '_' . $video_id;
	
	
	$retval = NULL;
	$jwt_loop_start = microtime(TRUE);
	$jwt_fp = NULL;
	while (!$retval) {
		
		
		// FIRST, check to see if we need to time out
		$elapsed_time = microtime(TRUE) - $jwt_loop_start;
		if ($elapsed_time >= JWT_RENEWAL_TIMEOUT) {
			// TOO MUCH TIME ELAPSED important to free any locked resources to prevent deadlock
			if (!is_null($jwt_fp)) {
				@flock($jwt_fp, LOCK_UN); // fclose supposed to unlock, so this may be gratuitous but just to be safe
				@fclose($jwt_fp);
				// TODO perhaps we should return some JSON response here indicating error condition or N/A?
				// as currently written, $retval will likely be NULL, and certainly empty
				break; // exit the while loop
			}
		}
		
		
		// SECOND, check for a fresh jwt in the cache file
		$jwt_mtime = filemtime($jwt_cache_file); // returns FALSE if no such file exists or is in unreachable dir
		if ($jwt_mtime) {
			if (!is_readable($jwt_cache_file)) {
				throw new Exception("$jwt_cache_file is not readable");
			}
			
			// check if cache file too old, leave 10 second buffer to give us time to hit API endpoint
			$max_age = JWT_LIFETIME - 10;
			$file_age = time() - $jwt_mtime;
			
			if ($file_age < $max_age) {
				// cache file is fresh, however this could be empty
				$retval = trim(file_get_contents($jwt_cache_file));
				//echo "CACHED JWT: $retval\n";
				
				// TODO - validate the JWT...could be non-empty but still invalid
				
				// rather than return or break, we continue, because retval might be empty
				continue; // skip the rest of the while loop and start next iteration
			}
		}
		
		
		
		// if we reach this point, the JWT cache is nonexistent/empty/stale, and we must generate a new jwt
		// echo "JWT CACHE IS NONEXISTENT/EMPTY/STALE\n";
		
		
		// IMPORTANT! To prevent a swarm on the API due to high concurrency, we implement this looping/locking
		// open cache file usng "c" (write only, create if doesn't exist, DO NOT TRUNCATE!!!)
		// IMPORTANT -- 'r+' will fail if file does not exist and testing suggests this 'c' fopen
		// call will NOT affect the filemtime of the file
		// which is good because we don't want other processes thinking the file is fresh
		$jwt_fp = fopen($jwt_cache_file, "c");
		// try to lock the JWT cache file
		if (flock($jwt_fp, LOCK_EX | LOCK_NB)) { // USE LOCK_NB SO THIS DOESN'T BLOCK
			// WE HAVE EXCLUSIVELY LOCKED THE JWT CACHE FILE. Let us proceed with API request
			
			// IMPORTANT: these are *SIGNING KEYS* and are distinct from the API secret + key
			$payload = array(
					"sub" => $video_id,
					"aud" => $id_type,
					"exp" => time() + JWT_LIFETIME, // Expiry time in epoch - in this case now + 10 mins
					"kid" => MUX_SIGNING_KEY_ID
			);
			
			// TODO might this return a nonempty-but-invalid jwt? should we validate the jwt? could this function call throw an exception?
			$retval = JWT::encode($payload, base64_decode(MUX_SIGNING_PRIVATE_KEY), 'RS256');
			//echo "NEW JWT: $jwt\n";
			
			
			// truncate the file...NOTE there's a slim chance some process will jump in and read
			// the truncated/empty file before we write the new contents...processes should
			// take care to check for empty/degenerate JWT when reading this file
			if (!ftruncate($jwt_fp, 0)) {
				throw new Exception('Unable to truncate jwt_cache_file');
			}
			if (!fwrite($jwt_fp, $retval, 8192)) { // some googling says JWT shoudl be under 8K bytes
				throw new Exception('Unable to write JWT to jwt_cache_file');
			}
			if (!fflush($jwt_fp)) {
				throw new Exception('Unable to flush JWT to jwt_cache_file');
			}
			// unlock and close the cache file
			flock($jwt_fp, LOCK_UN);
			fclose($jwt_fp);
			
		} else {
			// unable to lock the file, which means that someone else has locked it
			// NOTE that the jwt was empty/stale, so let's sleep and try again
			// echo "UNABLE TO LOCK $jwt_cache_file\n";
			fclose($jwt_fp); // close the file first
			sleep(1);
		}
	}
	
	return $retval;
} // get_mux_jwt

NogDog · Feb 8, 2022

sneakyimp // USE LOCK_NB SO THIS DOESN'T BLOCK

Wonder if you actually should do that: If a process has reason to lock it, then maybe you want it to block anything else that is trying to do so? I.e.: if you try to lock it but are blocked, should that be a sign that you should wait a few milliseconds and then see if there is now an up-to-date "thing" for you and you don't have to update it now?

sneakyimp · Feb 8, 2022

NogDog the point of getting an exclusive lock on the file is to write the file. If some other process obtained the lock on the file, that process is performing the update and we want to avoid repeating the update process. It's a bit like "hey can I get a lock on this file? No? Ah OK I see [processX] is already updating it. I'll wait til they finish and just use whatever up-to-date information they fetch."

If you carefully follow the code, you'll see that any process failing to obtain the lock will close the file pointer it had to open to attempt the lock and then sleep for 1 second. My thinking was that 1 second would typically be enough time for the process that did lock the file to complete its API request but would not be an inordinate of time for an AJAX request to complete.

EDIT: i feel i should add -- and I might be wrong about this -- that if every process attempting to flock does in fact block, waiting to obtain exclusive access to the file, then we might a) have a bunch of processes blocked, waiting to obtain the exclusive lock and b) we might introduce unwanted congestion in the file system. On a very busy server, if we suddenly had ten processes trying to update the stale cache file, we really only want ONE of those processes to exclusively lock the file and update its contents. All the other nine processes should be content to just read the updated contents once that ONE process fetches the updated value.

NogDog · Feb 8, 2022

Probably me misunderstanding what process is affected by "It is also possible to add LOCK_NB as a bitmask to one of the above operations, if flock() should not block during the locking attempt". (I vaguely recall using flock() for some reason or other years ago, but have entirely forgotten why/how I did. The only thing I've done recently that is vaguely similar is database-driven, storing a JWT with an expiration timestamp, and to-date not having to really worry if a couple or 3 processes all try to refresh it.)

sneakyimp · Feb 8, 2022

NogDog Yeah in my experience, it's fairly unusual to have to worry about concurrency issues. In this particular situation, I'm generating the JWTs to hit an API (the API-locking code is not in this post, but looks almost identical to JWT code). I had noticed that a server that was only mildly busy was hitting the API about 3 times as much as it needed to. I had this bad feeling that a really busy server might be hitting the API about 60 or 100 times as much as it needed to. This might not only jam up the file system on my server with hundreds of processes trying to write the cache file at once, it might also result in a nasty billing surprise at the end of the month due to an outrageous number of completely unnecessary API requests.

I could use a database to store the JWT and/or the API response instead, but my feeling was that this would involve more effort. The flock code seemed simpler. If anyone thinks the DB might be more effective or efficient, I'd be interested in hearing more detail about why that might be.

NogDog · Feb 8, 2022

Yeah, in our case, we peak at around 50 rpm for a given client instance of the app, so it's probably not likely that more than a couple or 3 processes might try to update a token when it's close to expiring. (We have to call an external API to get a new JWT, which has a 30-minute lifetime, so that's probably much more significant than any DB processing.)

sneakyimp · Feb 9, 2022

It occurred to me that I might expect the $api_abuse factor -- the multiple indicating how many times too often we are hitting the API -- is roughly proportional to the number of $requests_per_second times the $seconds_per_api_response. I.e., if you have 10 requests per second and it takes one second for the API to return a response to your query, then your $api_abuse_factor would be ten. If the API only takes half a second, it'd be five. Given that I was told to expect 1,000 users, and we are polling about every 10 seconds, that comes out to about 100 requests per second. The API tends to return a result in somewhere between 300-1000 milliseconds. I was therefore looking at an $api_abuse_factor of somewhere between 30 and 100.

I tried some apache bench testing, sending 500 requests with a concurrency of 20, and the locking code seems to be working. My function that retrieves the API response (first by checking for cache, then hitting API if necessary) does seem to correctly lock access to the API, and runs completely in an average of 83ms.

I'm still dreading some novel/unusual/irreproducible problem to occur in the production environment. It will surely be some deadlock that prevents any request from successfully completing. If anyone has any hard-won concurrency battle stories to contribute, I'm all ears.

using flock to deal with concurrency issues

Ssneakyimp

NogDog

Ssneakyimp

NogDog

Ssneakyimp

NogDog

Ssneakyimp