I have spent a lot of time recently at $day_job
(actually more like work.DayJob, I am 100% Java these days)
trying to come up with an easy-to-use, yet powerful, flexible and efficient
object persistence scheme. Traditionally, it has all been hand-rolled
SQL here, which is repetitive (hence both boring and error-prone), verbose,
and has has poor tool support. For a new project, we started to use Hibernate,
which is very powerful, and widely used and supported,
but it seems to suffer from the "leaking abstractions" problem:
While you can happily declare all your entities and relationships
at a very high level, if you want to have any semblance of reasonably
efficient storage and scalable query performance, you have to
understand the inner workings of Hibernate. Unlike other OR-mappers,
Hibernate knows about this problem, and does not try to hide the
underlying database: It does not claim exclusive ownership of the data,
it encourages you to tune the interactions
between the layers, and it even supports direct SQL queries.
All this makes Hibernate a great tool for people who know what they are doing,
but it probably does not work as the magic black box we would
have liked to have.
This is of cause not a shortcoming of Hibernate. With a problem
domain so complex as generic object persistence, there can be
no simple solution. What I am looking for now is a tool that
limits itself to a narrower set of tasks, and by thus eliminating
complexities in the problem domain also eliminates complexities
for its users.
Specifically, I have developed a potentially unhealthy fascination with write-once-never-update ways to store data. By introducing this one simple constraint, everything becomes so much easier:
- Transactions: Because data is never updated, there can be no dirty reads. You still want atomic inserts, unique primary keys, and guaranteed durability from your database, but there is not much need for long running transactions.
- Lazy-loading: Because data never changes, if you obtain a reference (like a primary key) to the data, you can easily defer access to the content to a later time, without having to worry about getting an inconsistent view.
- Caching: Because data never changes, you can aggressively and at all layers cache everything, without having to worry about the cache becoming stale.
- Shared storage for nested objects: If you have a nested data structure, there is no need to deep-clone anything when writing it to the database. It is totally safe to share pointers, because none of the involved objects can ever change the contents of that pointer. By extension, you have to store every unique value only once. This can save a lot of disk space.
- Backup: There are only incremental backups of new data.
So how can one store real
data, which happens to
change over time? One way to tackle this would be the approach
of a versioning system: Regard every revision of the object as
its own piece of data, and store those. An update to the object
becomes an insert of a new revision.
This way, you also get
an update history, undo functionality, and an audit trace,
all of which you probably want anyway. It may be necessary to layer
a few normal
(updatable) tables on top, to maintain pointers
to the current versions in order to achieve acceptable query performance,
but those should be very simple (and small).



