October 29, 2005

The fragility of Google Base and Ning

Nova Spivack argues that services such as Ning will be brittle because of the interdependencies between data-schemas.


Briefly stated: As the number of unique data schemas created in such systems grows, the probability of applications that use those schemas breaking also grows (perhaps exponentially).

Here's why:

Let's say that Sue creates a new schema in Ning (or Google Base) for a "Person." They make an app that uses this record structure. Now Joe makes a calendar app that takes Sue's Person record and connects it with his own unique "Event" record schema. Joe's app relies on Sue's Person schema to work. Next, Bob makes a To-Do list app that uses Joe's Event schema and Sue's Person Schema and pumps out "To-Do-Entry" records. Finally, Lisa creates a Project manager app that uses Sue's Person schema, Joe's Event schema, and Bob's To-Do-Entry schema, to pump out "Project" records.

So we have a network of apps that rely on data schemas from other apps. Next, let's say that Sue decides to change one of the attribute-value pairs in her Person schema -- perhaps changing it to map to a string instead of an integer value. That 1 simple change has huge ripple effects. First it causes Joe's app to break, which then causes Bob's app to break, which causes Lisa's app to break, etc. In other words, we have a chain reaction of broken apps.

As the number of unique schemas increases, the likelihood that a given schema will be modified in a given time frame also increases. At the extreme end of this curve, with large numbers of users, schemas and apps, the likelihood approaches 100% that at any given time some schema that is directly or indirectly required by a given app will have changed, causing that app to break. So in other words if such services are successful, apps within them will break ever more frequently, causing endless problems for developers.


I think this is very much something which will have to be seen in practice rather than reasoned beforehand.

If Sue's schema is relied on by others will she be so cavalier in making arbitrary changes? Or rather, there are two scenarios :


  • One is that only Sue's schema is copied, and if she then wants to change it, presumably Joe can just stick with the original schema.


  • Alternatively Sue's data is being actively consumed by Joe, and the applications will need to be kept in sync. In this case, things will depend whether Sue actively wants her data consumed by Joe :


    • If she does, she'll have a strong incentive to not break his application by not changing her schema, or to co-ordinate with him for a negotiated change.


    • In the worst case, where we presume Sue is not actively helping Joe, Joe will have to keep his application tracking the vaguaries of Sue's updates, and will probably try to insulate those applications downstream from him by wrapping Sue's data in a more stable format. Even in this worst case, we presume Sue is not going to be changing her schema arbitrarily every week.

      Note that consuming eg. XML data is not really like scraping HTML. HTML can change rapidly because site owners experiment with the appearance of their pages. On the other hand, a pure data format is only likely to change when the application needs to represent new information.




Novack points out that


This is the very problem that the Semantic Web was created to solve. The Semantic Web provides tools for data schema integration and interoperability. The base value of RDF and OWL is that they provide a means to define, publish and map between data schemas in an open way. So for example, application creators can map their unique schemas to centrally agreed upon ontologies enabling the best of both worlds: individual developer freedom and global standards.


But let's look at what has to happen for the SemWeb version to take place.

Someone has to define the ontology. Who is going to do that? We can imagine one of two scenarios. Either Sue is going to define the ontology by herself, or she is going to sit down with Joe, Lisa and Bob and define it communally.

Either case raises awkward questions.


  • If Sue is working alone, for her own benefit :

    • a) what's her incentive to do the extra work of defining an ontology over and above her schema?


    • b) given that Sue is defining her schema and the ontology, it seems likely she'll define the ontology to have roughly the same representational capacity as her schema. But, as noted above, data schemas are normally only changed when you discover you need new representational capacity. When Sue updates her schema, it's likely that this is going to be due to a new requirement which also isn't captured in the ontology.




  • If, on the other hand, Sue is explicitly working with Joe et al, then defining a shared ontology for their work is just one way of defining a common exchange format. For years, common formats have satisfactorarily allowed different applications to work together without a combinatorial explosion of incompatibility. It's not clear why we imagine Ning-like programs unable to do the same. (Although I confess my ignorance of Ning here, perhaps there are technical restrictions that prevent this?)



All arguments I've seen for the SemWeb fall into this dilemma. Either there's explicit co-operation and SynWeb solutions would work as well. Or there's no explicit co-operation, but you're going to have to be extremely lucky to find that the ontology is sufficient to make interesting inferences to combine the data (in this example, to translate between Sue's and Joe's respective schemas).

No comments: