Tuesday, December 12, 2006
Space vs Speed
In the old days you used to have to worry about whether to prioritize storage space or response speed. Basically, you can code a database to be as fast as you want if you have unlimited space. You can create a file for absolutely everything - so when data is retrieved, the database goes to just one file or directory and that's it. Conversely, if you minimize the space you consume, your database is forced to do more active searching through piles of data, slowing it down.
Nowadays space is so plentiful and cheap that you don't really need to worry about space. You can run amok and create the most relational database ever, where every file relates to another. Sweet.
Bandwidth, however, is not yet unlimited, though it soon will be. So your main constraint right now is how long it takes to back your database up over the Internet. If you design your backup system well, the whole architecture can be regenerated from a few core backup data files. Then the issue becomes the processing required to unpack a million-user database. And that's a whole new story.
Nowadays space is so plentiful and cheap that you don't really need to worry about space. You can run amok and create the most relational database ever, where every file relates to another. Sweet.
Bandwidth, however, is not yet unlimited, though it soon will be. So your main constraint right now is how long it takes to back your database up over the Internet. If you design your backup system well, the whole architecture can be regenerated from a few core backup data files. Then the issue becomes the processing required to unpack a million-user database. And that's a whole new story.
Commercial Ware, Open Source, or In-House
When developing a database, you basically have a choice of [1] commercial applications (such as Oracle), [2] stuff from the Open Source community (more power!), and [3] writing something from scratch.
The problem with [1] is that it's expensive and takes forever to learn how to use. In fact, for most applications, you need to be certified. On other problem is that you are limited to what the database offers. With [2] the problems are similar, though there is more flexibility and the quality of the product tends to be better. But you do find yourself drawn into the development community. I would not recommend getting involved when working on a commercial project with deadlines and so on, unless the client is supportive of the culture. [3] is insane, like tooling bolts and nuts to make a car, instead of leasing it from Hertz.
Being the radical nutcase that I am, I opted for [3]. We can add the exact functionalities that we need. We can design the architecture exactly how we want it. And we do not have to spend a lot of time asking people how stuff works, because every single line of code was written in-house. So far, we're neither ahead nor behind, but exactly on schedule - this despite three day-long brownouts (power outages). Right now I'm pretty happy.
The problem with [1] is that it's expensive and takes forever to learn how to use. In fact, for most applications, you need to be certified. On other problem is that you are limited to what the database offers. With [2] the problems are similar, though there is more flexibility and the quality of the product tends to be better. But you do find yourself drawn into the development community. I would not recommend getting involved when working on a commercial project with deadlines and so on, unless the client is supportive of the culture. [3] is insane, like tooling bolts and nuts to make a car, instead of leasing it from Hertz.
Being the radical nutcase that I am, I opted for [3]. We can add the exact functionalities that we need. We can design the architecture exactly how we want it. And we do not have to spend a lot of time asking people how stuff works, because every single line of code was written in-house. So far, we're neither ahead nor behind, but exactly on schedule - this despite three day-long brownouts (power outages). Right now I'm pretty happy.
Real Estate Databases in Action
OK, before I go on any further, let's take a look at some real estate databases in action. Nowadays, all databases are online, so basically we are talking about websites here.
One of the best known is Loopnet. Loopnet specialize in commercial real estate in North America. As far as I can tell, their geographical database in the US is two-tier, STATE and LOCALITY, and counties (or boroughs or parishes) are ignored. User discretion drives the bulk of categorization decisions, it seems. It seems to have worked well for Loopnet.
Then there is Trulia. The interesting thing about Trulia is that created an interconnected database, which pipes queries to broker-owned websites, rather than including everything in their own database. Frankly I am amazed this worked, but I must say they have done a fantastic job. This architectural feature impedes rapid growth, as new brokers cannot be listed instantly (I assume) but it sure hasn't dented Trulia's growth. In the long run, this model could prove to be the most efficient and elegant of all systems.
Next, often mentioned in the same breath as Trulia, is Zillow, who started out as a home valuation (assessment) site. Now they offer listings. Judging from blog comments across the Internet, it would seem that Zillow's inherent role in the marketplace, whether intentional or not, is to cut out brokers, and allow direct buyer/seller contacts. This model has worked in the transportation industry - airlines now sell tickets to passengers rather than via travel agents. But will it work in the real estate industry?
Rounding out the big four is Point2, or homes.point2.com, a Canadian outfit which started out as a broker of heavy machinery. They have been growing quickly thanks to good site design, superb SEO, and global reach. For some reason Point2 are mentioned less often than Trulia, Zillow, etc. I have yet to figure out why this is.
In addition, there are a gazillion regional MLS (Multiple Listing Services) which are pretty crummy in terms of functionality. In fact, you could say that almost all of them do not even work. This is probably because in terms of background, these are the equivalent of Soviet collective farms. Those that do work are proprietary systems offered by individual brokerages. The best come close to Trulia and Loopnet but are hindered by their limited geographical scope.
One of the best known is Loopnet. Loopnet specialize in commercial real estate in North America. As far as I can tell, their geographical database in the US is two-tier, STATE and LOCALITY, and counties (or boroughs or parishes) are ignored. User discretion drives the bulk of categorization decisions, it seems. It seems to have worked well for Loopnet.
Then there is Trulia. The interesting thing about Trulia is that created an interconnected database, which pipes queries to broker-owned websites, rather than including everything in their own database. Frankly I am amazed this worked, but I must say they have done a fantastic job. This architectural feature impedes rapid growth, as new brokers cannot be listed instantly (I assume) but it sure hasn't dented Trulia's growth. In the long run, this model could prove to be the most efficient and elegant of all systems.
Next, often mentioned in the same breath as Trulia, is Zillow, who started out as a home valuation (assessment) site. Now they offer listings. Judging from blog comments across the Internet, it would seem that Zillow's inherent role in the marketplace, whether intentional or not, is to cut out brokers, and allow direct buyer/seller contacts. This model has worked in the transportation industry - airlines now sell tickets to passengers rather than via travel agents. But will it work in the real estate industry?
Rounding out the big four is Point2, or homes.point2.com, a Canadian outfit which started out as a broker of heavy machinery. They have been growing quickly thanks to good site design, superb SEO, and global reach. For some reason Point2 are mentioned less often than Trulia, Zillow, etc. I have yet to figure out why this is.
In addition, there are a gazillion regional MLS (Multiple Listing Services) which are pretty crummy in terms of functionality. In fact, you could say that almost all of them do not even work. This is probably because in terms of background, these are the equivalent of Soviet collective farms. Those that do work are proprietary systems offered by individual brokerages. The best come close to Trulia and Loopnet but are hindered by their limited geographical scope.
Database Architecture
There are essentially three issues to be taken into consideration when designing the architecture of a real estate database, or any other database for that matter.
1. Backup. This is typically listed at the end. But I choose to list as the top priority. Rather than adding backup functions as an afterthought, it is best to design the architecture with backup in mind, right when you begin. The most important thing about backup is that it is automated. Systems tend to fail whenever you happened to forget to backup your data.
2. Speed. While Google's magic algorithm doubtlessly contributed to its success, my theory is that the blinding speed of the Google website was the top factor helping its early popularity. Google would return - and still does - return queries so quickly it would take your breath away. Nowadays, all engines are fast, but Google is still the fastest by a good real discernable interval.
3. Elegance. In programming terms, elegance can be defined as "lack of duplication". This is important, because if you change one thing somewhere, and forget to change it elsewhere, you are looked at a bug swarm. Hence, the less duplication you have, and the more elegant your architecture looks, the better.
1. Backup. This is typically listed at the end. But I choose to list as the top priority. Rather than adding backup functions as an afterthought, it is best to design the architecture with backup in mind, right when you begin. The most important thing about backup is that it is automated. Systems tend to fail whenever you happened to forget to backup your data.
2. Speed. While Google's magic algorithm doubtlessly contributed to its success, my theory is that the blinding speed of the Google website was the top factor helping its early popularity. Google would return - and still does - return queries so quickly it would take your breath away. Nowadays, all engines are fast, but Google is still the fastest by a good real discernable interval.
3. Elegance. In programming terms, elegance can be defined as "lack of duplication". This is important, because if you change one thing somewhere, and forget to change it elsewhere, you are looked at a bug swarm. Hence, the less duplication you have, and the more elegant your architecture looks, the better.
About This Blog
I will use this space to theorize - think out loud - about the intricacies of relational database architecture and management. This is for a generic real estate listings directory, intended for easy localization and worldwide distribution.
We will not consider, at this time, the meta requirements imposed by marketing, branding, and so on. This blog will be focused on the nitty-gritty engineering aspects of developing a high-performance database structure.
However, it goes without saying that ease of use will be one of the most important, if not the primary, condition that must be fulfilled.
We will not consider, at this time, the meta requirements imposed by marketing, branding, and so on. This blog will be focused on the nitty-gritty engineering aspects of developing a high-performance database structure.
However, it goes without saying that ease of use will be one of the most important, if not the primary, condition that must be fulfilled.