Building a scalable PHP application using the Cloud

sneakyimp

I'm essentially an atheist, but this seemed appropriate:

1 Corinthians 13:11 wrote:
When I was a child, I spake as a child, I understood as a child, I thought as a child: but when I became a man, I put away childish things.

I've spent some time looking at Amazon EC2 and I've got some experience with Rackspace and I want to understand now how I might use the cloud to build a PHP application that can serve millions of users per day. I've noticed a couple of things about cloud services:

They let you instantiate virtual servers either using PHP code or manually through a control panel.
EC2 offers a Load Balancer which 'automatically' distributes requests across some collection of virtual servers. Apparently it can also create new computing instances using Auto Scaling. Rackspace has a short tutorial on load balancing but this does not appear to take into account any session-related info so I'm not sure how well it will work if you have a distributed database situation.
I haven't seen any kind of auto-scaling mass storage system for Rackspace, but Amazon has SimpleDB which is non-relational.
Zend Framework tries to abstract interfacing with the cloud for File Storage, DB Storage, and Job Queuing.

I want to know the ins and outs of this before attempting to build anything. If anyone has additional input here, I'd sure love to get some more information. My questions:

Non-relational DB -- what does it mean?
SimpleDB is non-relational. As I understand it, this means no JOINs, no GROUP BY, no ORDER BY, no indexes. You can get more than one record at once with some simplified WHERE-type stuff. What other sorts of limitations can I expect? Is this in any way advantageous? I guess it's kind of nice that I can just create some kind of arbitrary record with multiple fields and associate it with some ID.

Is this SimpleDB beast going to respond as quickly as MySQL does?
Seems to me that a globally available distributed database which is being accessed by potentially hundreds of thousands of computers might have some latency issues. Additionally, the access methods I've seen would appear to be routed through DNS requests, through load balancers, etc. I'm guessing that I can't expect the tens of milliseconds access times that I get from a MySQL database running on localhost or on the same LAN as my server.

What about data coherency?
I'm also guessing that this black box SimpleDB might have some issues in keeping data consistent between users in China and users in Los Angeles. For instance, if I enter a new status in my social networking application, how soon before my pen pal in Shanghai can see it? Are there problems that people have experienced due to this data incoherency?

What about sessions?
Is it reasonable to handle sessions using SimpleDB or do I need to account for them in both the local system and also make my load balancer session-aware? I know pretty little about session-aware load balancing so any input here would be much appreciated -- especially if it describes an application architecture to handle sessions in a cloud sytem.

How does one deploy an application to an automatically created computing instance?
If I'm using some kind of load balancer to create and destroy virtual servers in response to fluctuating demand, I will need to deploy my PHP application (and possibly an httpd.conf with rewrite rules in it, database, etc.) to each new instance that gets created. I know that you can create instance 'images' which have some flavor of linux and all of your other server config stuff, but what if you have machine-specific values? How does one handle setup of applications or data that are specific to a machine instance?

I'm really hoping this thread might serve as a resource for folks who would like to learn about implementing scalable PHP applications using the Cloud. Any responses or anecdotes would be greatly appreciated.

jazz_snob

I'm confused about what "cloud computing" is. Heck, I'm still trying to figure out what web 2.0 is supposed to be.

sneakyimp

I never paid much attention to "Web 2.0" and I agree there's some kind of 'emperor's new clothes' aspect to this cloud computing buzzword. Or maybe it's meet the new boss, same as the old boss. However, I'd be happy to learn enough about this charade to trick some prospective employers into paying me more for awhile.

The most salient features I have been able to find are non-relational data storage and the promise of massive scalability. I haven't bothered looking into Google App Engine much because it requires python or java code. The documentation on these data storage systems promises worldwide availability and independence from concern about hardware or queries at the expense of all those handy relational features.

I have spent some time looking at Amazon EC2. They have some nice systems in place wherein you can allocate a virtual server with your choice of many linux variants, make an image of whatever software you install on it, and then -- via an API -- you can allocate as many clones of this virtual machine as you like. Alternatively, you can set up a load balancer instance which will automatically keep tabs on the 'health' of your server instances and allocate more or deallocate them as necessary within certain parameters you set.

The most obvious thing I need to figure out here is how to take the applications I've traditionally dealt with and construct them in such a way that they work in this situation. My hope is that I'll get some good advice or see some examples before I actually have to set one up.

jazz_snob

to trick some prospective employers into paying me more for awhile

I like that idea.

What I've seen touted as cloud computing is essentially making API requests to some URL, not much different than webservices or REST or _______ (proverbial blank to fill in). I've also seen where you can deploy testing or production environments quickly in "the cloud" which to me is the same as getting a VPS somewhere (virtualized hardware).

sneakyimp

The Amazon Web Services SDK for PHP is a PEAR-like sdk that allows you manipulate amazon's various cloud-related services using PHP. I've played around with it and from what I can tell it uses REST and/or SOAP and/or _____ so that you can use write PHP code to interact with such Amazon services as Simple Storate Service (S3), Elastic Compute Cloud (EC2), and SimpleDB, CloudFront, etc. This lets you do things in PHP like:
allocate a new virtual machine or "computing instance". handy if you need to automate the creation of new computing instances from within a PHP application.
Store a file using S3 -- handy if you want to host movies or set up a CDN
* interact with SimpleDB. It's not relational, but it is global, 'scalable', and simple. It basically allows storage of arbitrary data sets with a unique key. It allows some simple queries based on the fields on your data set.

I have used S3 to host large flash movies and this has resulted in faster response/downloads for visitors because instead of one puny server hosted in some small town, Amazon's big, crazy global network somehow manages to push this file out to various data centers.

I've used to EC2 to allocate one of several types of 'computing instances' running Debian Lenny which as far as I can tell behaves identically to a dedicated server. It has an IP address. You can login via SSH and install Apache and stuff. You can take a snapshot of all the software you've installed and save it as an Amazon Machine Image (AMI) which you can use as a template for allocating additional instances -- either through the web interface or by using the SDK or through Elastic Load Balancing or Auto Scaling.

Still wondering a few things:
1) how do I point my domain to this 'cloud' ? I'm guessing I change DNS to point to a load balancer that I create thru AWS or something like that. Have yet to see how this is done.
2) Once my domain points to this load balancer, do I have any control over how it allocates visitors to a particular 'compute instance' ? Does this result in a redirect to a new subdomain or does the visitor see 'mydomain.com' ?
3) What if I need to bind a particular session to a particular server? The Elastic Load Balancing info page says "Elastic Load Balancing supports the ability to stick user sessions to specific EC2 instances" but I haven't seen any more detail.
4) How is a PHP app that works on this system different than a traditional one? Obviously such an app would either need to use the SimpleDB or some other massively distributed data system OR I would need to set up my own MySQL replication/clustering situation by hand.

sneakyimp

Very interesting synopsis of SimpleDB here. Some interesting things about SimpleDB:
no normalization. you can store multiple distinct values in a 'column' or whatever they are calling it and then run a query which will check all values in that column. for instance record X has field PhoneNumber containing 3 phone numbers. You can grab it thusly:

SELECT * FROM MyRecords WHERE PhoneNumber='555 555-1212'

no joins which is good and bad i suppose. you kind of have to determine what your primary data type is and structure things around it. any data field can store multiple values for a single record so you don't need to bother with joins or additional queries to get them.
no schemas. you want to change the data in a record you just do so without changing any schemas or anything.
Simpler SQL. Because it's non relational, select syntax boils down to this:

	SELECT output_list
	FROM domain_name
	[where expression]
	[sort_instructions]
	[limit limit]

* EVERYTHING IS A UTF-8 STRING. This is a big deal. It means this query:

SELECT * FROM Sample_Qty WHERE Quantity= '1'

will not find any records with Quantity of 1.0 or 1.00. Strings have to match exactly! Sorting can also be screwy. For instance, a value of 100 will come before a value of 2. You will have to normalize/sanitize data values when inserting or extracting from db.
* Eventual consistency - This can also be a big deal and must be considered at design time. Because the data storage system is distributed, it cannot guarantee data consistency across all nodes immediately. The basic idea is that when you retrieve data, there's a chance it could be slightly out of date. Not a big deal for facebook wall postings, but definitely a big deal for session stuff or transactions. You must plan for this in your applications. In February 2010, Amazon announced extensions to allow consistent reads. When using a GetAttributes or SELECT, the ConsistentRead = true can be selected, forcing a read of the most current value. This tells SimpleDB to read the items from the master database rather than from one of the slaves, guaranteeing the latest updates or deletes. This does not mean you can use this on all reads and still get the extreme scaling. conditional PUT or DELETE was also announced, which will execute a database PUT or DELETE only if the consistent read of a specific attribute has a specific value or does not exist. This is useful if concurrent controls or counters primitives.

If Simple DB bothers you, there's always Amazon Relational Database Service (RDS). It's a virtual MySQL server instance with extra support features:
Provides MySQL 5.1 functionality
automatically patches database software
automatically backs up the database and stores backups for user-specified retention periods
point-in-time recovery
offers API call to scale the db capabilities
facilitates easy replication to multiple machines for read-heavy workloads
* Monitor compute and storage resource utilization via Amazon Cloud Front for no extra charge. Lets you monitor it through the AWS console (browser-based interface) or via API calls.

jazz_snob

I remember reading some where that "big" apps like facebook use denormalized tables (no joins). So far I've never done that. I've not tried it: http://cassandra.apache.org/

sneakyimp

I wonder if there's any kind of koan-like or catechism-like means by which we might indoctrinate ourselves with the tradeoffs involved in these nonrelational db systems relative to their relational buddies.

No joins, for instance? What does it mean in practice? Does it mean we can only have one big table (they call them Domains in SimpleDB parlance) ? Or if I have more than one Domain, they sure as heck better not require any joins. A particular messy issue that comes to mind is that I have one Domain for Users, another domain for StatusUpdates. I can't join two two, but I can run a query to get all StatusUpdates with UserID of 'x'. How might this impact the code of a page displaying UserUpdates?

I wonder if you can do DISTINCT selects...