Best Way to Handle Dirty State in an Orm Model

Best way to handle dirty state in an ORM model

_{Attention!
My opinion on the subject has somewhat changed in the past month. While the answer where is still valid, when dealing with large object graphs, I would recommend using Unit-of-Work pattern instead. You can find a brief explanation of it in this answer}

I'm kinda confused how what-you-call-Model is related to ORM. It's kinda confusing. Especially since in MVC the Model is a layer (at least, thats how I understand it, and your "Models" seem to be more like Domain Objects).

I will assume that what you have is a code that looks like this:

  $model = new SomeModel;
  $mapper = $ormFactory->build('something');

  $model->setId( 1337 );
  $mapper->pull( $model );

  $model->setPayload('cogito ergo sum');

  $mapper->push( $model );

And, i will assume that what-you-call-Model has two methods, designer to be used by data mappers: getParameters() and setParameters(). And that you call isDirty() before mapper stores what-you-call-Model's state and call cleanState() - when mapper pull data into what-you-call-Model.

_{BTW, if you have a better suggestion for getting values from-and-to data mappers instead of setParameters() and getParameters(), please share, because I have been struggling to come up with something better. This seems to me like encapsulation leak.}

This would make the data mapper methods look like:

  public function pull( Parametrized $object )
  {
      if ( !$object->isDirty() )
      {
          // there were NO conditions set on clean object
          // or the values have not changed since last pull
          return false; // or maybe throw exception
      }

      $data = // do stuff which read information from storage

      $object->setParameters( $data );
      $object->cleanState();

      return $true; // or leave out ,if alternative as exception
  }

  public static function push( Parametrized $object )
  {
      if ( !$object->isDirty() )
      {
          // there is nothing to save, go away
          return false; // or maybe throw exception
      }

      $data = $object->getParameters();
      // save values in storage
      $object->cleanState();

      return $true; // or leave out ,if alternative as exception
  }

_{In the code snippet Parametrized is a name of interface, which object should be implementing. In this case the methods getParameters() and setParameters(). And it has such a strange name, because in OOP, the implements word means has-abilities-of , while the extends means is-a.}

Up to this part you should already have everything similar...

Now here is what the isDirty() and cleanState() methods should do:

  public function cleanState()
  {
      $this->is_dirty = false;
      $temp = get_object_vars($this);
      unset( $temp['variableChecksum'] );
      // checksum should not be part of itself
      $this->variableChecksum = md5( serialize( $temp ) );
  }

  public function isDirty()
  {
      if ( $this->is_dirty === true )
      {
          return true;
      }

      $previous = $this->variableChecksum;

      $temp = get_object_vars($this);
      unset( $temp['variableChecksum'] );
      // checksum should not be part of itself
      $this->variableChecksum = md5( serialize( $temp ) );

      return $previous !== $this->variableChecksum;
  }

force object to be `dirty` in sqlalchemy

I came across the same problem recently and it was not obvious.

Objects themselves are not dirty, but their attributes are. As SQLAlchemy will write back only changed attributes, not the whole object, as far as I know.

If you set an attribute using set_attribute and it is different from the original attribute data, SQLAlchemy founds out the object is dirty (TODO: I need details how it does the comparison):

   from sqlalchemy.orm.attributes import set_attribute
   set_attribute(obj, data_field_name, data)

If you want to mark the object dirty regardless of the original attribute value, no matter if it has changed or not, use flag_modified:

   from sqlalchemy.orm.attributes import flag_modified
   flag_modified(obj, data_field_name)

Different ways to implement 'dirty'-flag functionality

For my DAO I keep a copy of the original values as retrieved from the database. When I send it to be updated, I simply compare the original values with the current. It costs a little in processing but it is a lot better than having a dirty flag per property.

EDIT to further justify not having a dirty flag: if the property returns to its original value, there is no way to reflect that, the dirty flag continues dirty because the original value was lost.

How does Hibernate detect dirty state of an entity object?

Hibernate uses a strategy called inspection, which is basically this: when an object is loaded from the database a snapshot of it is kept in memory. When the session is flushed Hibernate compares the stored snapshot with the current state. If they differ the object is marked as dirty and a suitable SQL command is enqueued. If the object is still transient then it is always dirty.

Source: book Hibernate in Action (appendix B: ORM implementation strategies)

It's important to notice however that Hibernate's dirty-checking is independent of the methods equals/hascode. Hibernate does not look at these methods at all (except when using java.util.Set's, but this is unrelated to dirty-checking, only to the Collections API) The state snapshot I mentioned earlier is something similar to an array of values. It would be a very bad decision to leave such a core aspect of the framework in the hands of developers (to be honest, developers should not care about dirty-checking). Needless to say that equals/hascode can be implemented in many ways according to your needs. I recommend you to read the cited book, there the author discuss equals/hascode implementation strategies. Very insightful reading.

Doctrine2 Entites - Is it possible to compare a dirty object to one from the database

After you flush the $em it happens (it is commited) in database.. so... you might want to retrieve the $db_entity before flush()

I am not sure what you want.. but you can also use merge instead of persist.
- merge is returning the object modified - id generated and setted
- persist is modifying your instance
If you want to have the object modified and not persisted use it before flush.
EntityManager is giving you the same instance because you didn't $em->clear()
- flush is commiting all changes (all dirty objects)
- clear is clearing memory cache. so when you find(..., $id) , you will get a brand new instance
Is clone keyword working for you? like in this example:

$entity = $em->find('My\Entity', $id);
$clonedEntity = clone $entity;

And you might also want to read this: Implementing Wakeup or Clone

Who should handle the conditions in complex queries, the data mapper or the service layer?

_{The data mapper pattern only tells you, what it is supposed to do, not how it should be implemented.

Therefore all the answers in this topic should be treated as subjective, because they reflect each authors personal preferences.}

I usually try to keep mapper's interface as simple as possible:

fetch(), retrieves data in the domain object or collection,
save(), saves (updates existing or inserts new) the domain object or collection
remove(), deletes the domain object or collection from storage medium

I keep the condition in the domain object itself:

$user = new User;
$user->setName( 'Jedediah' );

$mapper = new UserMapper;
$mapper->fetch( $user );

if ( $user->getFlags() > 5  )
{
    $user->setStatus( User::STATUS_LOCKED );
}

$mapper->save( $user );

This way you can have multiple conditions for the retrieval, while keeping the interface clean.

The downside to this would be that you need a public method for retrieving information from the domain object to have such fetch() method, but you will need it anyway to perform save().

There is no real way to implement the "Tell Don't Ask" rule-of-thumb for mapper and domain object interaction.

As for "How to make sure that you really need to save the domain object?", which might occur to you, it has been covered here, with extensive code examples and some useful bits in the comments.

Update

I case if you expect to deal with groups of objects, you should be dealing with different structures, instead of simple Domain Objects.

$category = new Category;
$category->setTitle( 'privacy' );

$list = new ArticleCollection;

$list->setCondition( $category );
$list->setDateRange( mktime( 0, 0, 0, 12, 9, 2001) );
// it would make sense, if unset second value for range of dates 
// would default to NOW() in mapper

$mapper = new ArticleCollectionMapper;
$mapper->fetch( $list );

foreach ( $list as $article )
{
    $article->setFlag( Article::STATUS_REMOVED );
}

$mapper->store( $list );

In this case the collection is glorified array, with ability to accept different parameters, which then are used as conditions for the mapper. It also should let the mapper to acquired list changed domain objects from this collection, when mapper is attempting to store the collection.

The mapper in this case should be capable of building (or using preset ones) queries with all the possible conditions (as a developer you will know all of those conditions, therefore you do not need to make it work with infinite set of conditions) and update or create new entries for all the unsaved domain object, that collection contains.

_{Note: In some aspect you could say, that the mapper are related to builder/factory patterns. The goal is different, but the approach to solving the problems is very similar.}

What is the best way to implement record updates on an ORM?

If you're writing this as an educational exercise or as a means for further honing your design skills, then great! If you're writing this because you actually need an ORM, I would suggest that looking at one of the many existing ORM's would be a much wiser idea. These products--Entity Framework, NHinbernate, etc.--already have people dedicated to maintaining them, so they provide a much more viable option than trying to roll your own ORM.

That said, I would shy away from any automatic database updates. Most existing ORM's follow a pattern of storing state information at the entity level (typically an entity represents a single row in a table, though entities can relate to other entities, of course), and changes are committed by the developer explicitly calling a function to do so. This is similar to your timer approach, but without the...well...timer. It can be nice to have changes committed automatically if you're writing something like a Winforms application and the user is updating properties through data binding, but that is generally better accomplished by having a utility class (such as a custom binding list implementation) that detects changes and commits them automatically.

On observing an execution tree of interdependent models in MVC

The cause

The root of your problem is misuse of active record pattern. AR is meant for simple domain entities with only basic CRUD operations. When you start adding large amount of validation logic and relations between multiple tables, the pattern starts to break apart.

Active record, at its best, is a minor SRP violation, for the sake of simplicity. When you start piling on responsibilities, you start to incur severe penalties.

Solution(s)

Level 1:

The best option is the separate the business and storage logic. Most often it is done by using domain object and data mappers:

Domain objects (in other materials also known as business object or domain model objects) deal with validation and specific business rules and are completely unaware of, how (or even "if") data in them was stored and retrieved. They also let you have object that are not directly bound to a storage structures (like DB tables).
For example: you might have a LiveReport domain object, which represents current sales data. But it might have no specific table in DB. Instead it can be serviced by several mappers, that pool data from Memcache, SQL database and some external SOAP. And the LiveReport instance's logic is completely unrelated to storage.
Data mappers know where to put the information from domain objects, but they do not any validation or data integrity checks. Thought they can be able to handle exceptions that cone from low level storage abstractions, like violation of UNIQUE constraint.
Data mappers can also perform transaction, but, if a single transaction needs to be performed for multiple domain object, you should be looking to add Unit of Work (more about it lower).
In more advanced/complicated cases data mappers can interact and utilize DAOs and query builders. But this more for situation, when you aim to create an ORM-like functionality.
Each domain object can have multiple mappers, but each mapper should work only with specific class of domain objects (or a subclass of one, if your code adheres to LSP). You also should recognize that domain object and a collection of domain object are two separate things and should have separate mappers.
Also, each domain object can contain other domain objects, just like each data mapper can contain other mappers. But in case of mappers it is much more a matter of preference (I dislike it vehemently).

Another improvement, that could alleviate your current mess, would be to prevent application logic from leaking in the presentation layer (most often - controller). Instead you would largely benefit from using services, that contain the interaction between mappers and domain objects, thus creating a public-ish API for your model layer.

Basically, services you encapsulate complete segments of your model, that can (in real world - with minor effort and adjustments) be reused in different applications. For example: Recognition, Mailer or DocumentLibrary would all services.

Also, I think I should not, that not all services have to contain domain object and mappers. A quite good example would be the previously mentioned Mailer, which could be used either directly by controller, or (what's more likely) by another service.

Level 2:

If you stop using the active record pattern, this become quite simple problem: you need to make sure, that you save only data from those domain objects, which have actually changed since last save.

As I see it, there are two way to approach this:

Quick'n'Dirty
If something changed, just update it all ...
The way, that I prefer is to introduce a checksum variable in the domain object, which holds a hash from all the domain object's variables (of course, with the exception of checksum it self).
Each time the mapper is asked to save a domain object, it calls a method isDirty() on this domain object, which checks, if data has changed. Then mapper can act accordingly. This also, with some adjustments, can be used for object graphs (if they are not to extensive, in which case you might need to refactor anyway).
Also, if your domain object actually gets mapped to several tables (or even different forms of storage), it might be reasonable to have several checksums, for each set of variables. Since mapper are already written for specific classes of domain object, it would not strengthen the existing coupling.
For PHP you will find some code examples in this ansewer.

Note: if your implementation is using DAOs to isolate domain objects from data mappers, then the logic of checksum based verification, would be moved to the DAO.
Unit of Work
This is the "industry standard" for your problem and there is a whole chapter (11th) dealing with it in PoEAA book.
The basic idea is this, you create an instance, that acts like controller (in classical, not in MVC sense of the word) between you domain objects and data mappers.
Each time you alter or remove a domain object, you inform the Unit of Work about it. Each time you load data in a domain object, you ask Unit of Work to perform that task.
There are two ways to tell Unit of Work about the changes:
- caller registration: object that performs the change also informs the Unit of Work
- object registration: the changed object (usually from setter) informs the Unit of Work, that it was altered
When all the interaction with domain object has been completed, you call commit() method on the Unit of Work. It then finds the necessary mappers and store stores all the altered domain objects.

Level 3:

At this stage of complexity the only viable implementation is to use Unit of Work. It also would be responsible for initiating and committing the SQL transactions (if you are using SQL database), with the appropriate rollback clauses.

P.S.

Read the "Patterns of Enterprise Application Architecture" book. It's what you desperately need. It also would correct the misconception about MVC and MVC-inspired design patters, that you have acquired by using Rails-like frameworks.

Tracking an objects state

Option 1 - using a boolean flag - is the best way. In terms of performance, as well as general usability and portability.

All the other options incur excessive overhead which is simply not needed in this case.

Best Way to Handle Dirty State in an Orm Model