The T-Files


Sun, 02 Sep 2007

Immutable object love

I have spent a lot of time recently at $day_job (actually more like work.DayJob, I am 100% Java these days) trying to come up with an easy-to-use, yet powerful, flexible and efficient object persistence scheme. Traditionally, it has all been hand-rolled SQL here, which is repetitive (hence both boring and error-prone), verbose, and has has poor tool support. For a new project, we started to use Hibernate, which is very powerful, and widely used and supported, but it seems to suffer from the "leaking abstractions" problem: While you can happily declare all your entities and relationships at a very high level, if you want to have any semblance of reasonably efficient storage and scalable query performance, you have to understand the inner workings of Hibernate. Unlike other OR-mappers, Hibernate knows about this problem, and does not try to hide the underlying database: It does not claim exclusive ownership of the data, it encourages you to tune the interactions between the layers, and it even supports direct SQL queries. All this makes Hibernate a great tool for people who know what they are doing, but it probably does not work as the magic black box we would have liked to have. This is of cause not a shortcoming of Hibernate. With a problem domain so complex as generic object persistence, there can be no simple solution. What I am looking for now is a tool that limits itself to a narrower set of tasks, and by thus eliminating complexities in the problem domain also eliminates complexities for its users.

Specifically, I have developed a potentially unhealthy fascination with write-once-never-update ways to store data. By introducing this one simple constraint, everything becomes so much easier:

  • Transactions: Because data is never updated, there can be no dirty reads. You still want atomic inserts, unique primary keys, and guaranteed durability from your database, but there is not much need for long running transactions.
  • Lazy-loading: Because data never changes, if you obtain a reference (like a primary key) to the data, you can easily defer access to the content to a later time, without having to worry about getting an inconsistent view.
  • Caching: Because data never changes, you can aggressively and at all layers cache everything, without having to worry about the cache becoming stale.
  • Shared storage for nested objects: If you have a nested data structure, there is no need to deep-clone anything when writing it to the database. It is totally safe to share pointers, because none of the involved objects can ever change the contents of that pointer. By extension, you have to store every unique value only once. This can save a lot of disk space.
  • Backup: There are only incremental backups of new data.

So how can one store real data, which happens to change over time? One way to tackle this would be the approach of a versioning system: Regard every revision of the object as its own piece of data, and store those. An update to the object becomes an insert of a new revision. This way, you also get an update history, undo functionality, and an audit trace, all of which you probably want anyway. It may be necessary to layer a few normal (updatable) tables on top, to maintain pointers to the current versions in order to achieve acceptable query performance, but those should be very simple (and small).