Archetype_tool, QueueCatalog, becareful with indexing with Plone’s portal_catalog !

Ecrit le 14 Dec 2008

The portal_catalog in Plone is the index of all objects. It means the portal_catalog store some datas (indexes+metadatas) per object in the website. So when the number of entries in the site grows up, the portal_catalog size follows. The post by tarek Indexing explain it well:

50% of the size of the ZODB was the catalog (I am talking about gigas here)

And that is the case for some website I’m working on.

If you are storing about 10 indexes and 10 metadatas in it. the size of a brain with indexes for one object is not so far from the real object (brains are persistent object stored !). So if you are about to add an index, or a metadata think about it twice before doing so.

But as me you want to be able to index a new data from my new content type to search for objects that match to your criteria . For example you have a type “Contact” with a field “EMail”. Many developpers will just adding an index getEmail in the portal catalog. Please don’t do it! Just adding a new catalog tool inherited from CMFPlone CatalogTool, and changing the init index method (Plone2.5 doesn’t support well the generic setup way of managing catalog index and metadata). Next add path index that is needed by the Archetype CatalogMultiplexin class to reindexObject, and register this object in the archetype tool so it will be automaticaly synchronised with the new catalog.

You can query that new catalog like the portal_catalog, and if you don’t want to wake up the object and get the metadata from the portal_catalog, just do a query to the portal_catalog by using the path index. Recently i have totaly remove a content type from the portal_catalog. That has been possible because this content type was not displayed throw classic plone folder_content view. As results 30 000 brains has been removed from the portal_catalog.

Think about why you will add an index in portal_catalog. Every object in your web site has that field or are supposed to be able to have it ? if you answer yes, you can. A good example is a rating information, If you want to be able to sort results on it, you must use the portal_catalog.

An other things very important about indexing with Plone is about full text indexation and is for performance purpose. As you will find in mailing list, lot’s of people are complaining about ZCTextIndex. They are slow ! I have done a bench this week with PloneQueueCatalog and factory hack. I have delayed every ZCTextIndex, and put a pyston timed decorator on indexObject, reindexObject and unindexObject of CatalogMultiplexin class. The results was:

  • 0.039 seconds with ZCTextIndexes and 0.025seconds with ZCTextIndexes delayed on the last indexObject call.
  • 0.5sec for not interesting index+reindex+unindex in portal_factory without factoryhack , and 0.11sec with the hack. This one has not been improved by QueueCatalog.

The portal_catalog has 130 000 objects indexed, and the data.fs is about 1.3GO. This bench has been done on my laptop (IntelCore2 duo + 2GO RAM).

So now I’m using those products in production now.

TESTS PURPOSE: I would like to make a relational database to replace the portal_catalog. Now postgresql support for full text indexing, and it will be pretty easy to return basic objects that will use results from sql query as attribute. With a good JMeter test plan to see results. If anyone has already done that kind of test i m really intersting in this, please tell me.