Uploaded image for project: 'eZ Publish / Platform'
  1. eZ Publish / Platform
  2. EZP-16340

DB Deadlocks on ezcontentobject_tree when updating subnode

    Details

      Description

      From customer:
      -----------------------------------------------
      We are currently encountering a high number of db deadlocks resulting in fatal errors, usually when publishing content. This appears to be due to the update of the subnode modified time in the ezcontentobject_tree table.

      During publishing this sometimes fails to gain a lock and results in a fatal error which is very disconcerting for editors within the system.

      Is there a way to optimize this query using an IN statement instead of multiple node_id= in the WHERE clause? My explain query showed that the current statement is affecting all table rows, whereas the IN would affect only a handful.

      This is becoming a real issue from a scalability perspective, is there any other mysql settings or changes we can make to avoid or remedy this deadlock situation?
      -----------------------------------------------

      Attached a patch that changes multiple "OR node_id = " to "node_id IN()" using eZDB::generateSQLINStatement().

      1. ezcontentobjecttreenode.php-issue-6647.diff
        1 kB
        (inactive) Gunnstein Lye
      2. issue-16340-modified_subnode-master.diff
        3 kB
        (inactive) Gunnstein Lye

        Issue Links

          Activity

          Hide
          Patrick Allaert added a comment -

          Fixed in rev. 25148:
          trunk (4.3.0beta2)
          stable/4.2 (4.2.1)
          stable/4.1 (4.1.5)

          Show
          Patrick Allaert added a comment - Fixed in rev. 25148: trunk (4.3.0beta2) stable/4.2 (4.2.1) stable/4.1 (4.1.5)
          Hide
          Gabriel Finkelstein added a comment -

          I have ez 4.2.0 and have applied this patch, but now I get:

          Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree SET modified_subnode=1271943672 WHERE node_id IN ( 1, 2, 11355, 3083, 3072, 56286 )

          Will an upgrade fix this? I mean, is there an additional fix besides this patch in 4.2.1 and 4.3.0 regarding this issue? (I'm trying to avoid going through the entire upgrade process in vain.)

          Show
          Gabriel Finkelstein added a comment - I have ez 4.2.0 and have applied this patch, but now I get: Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree SET modified_subnode=1271943672 WHERE node_id IN ( 1, 2, 11355, 3083, 3072, 56286 ) Will an upgrade fix this? I mean, is there an additional fix besides this patch in 4.2.1 and 4.3.0 regarding this issue? (I'm trying to avoid going through the entire upgrade process in vain.)
          Hide
          Gaetano Giunta added a comment -

          Having the optimizer work out different execution plans for the two queries is surprising - and a sure sign of a subpar optimizer - no wonder this might have been fixed in recent sql versions.

          Another interesting approach is the one suggested here: http://explainextended.com/2009/08/18/passing-parameters-in-mysql-in-list-vs-temporary-table/ - but I guess it does not apply to our particular case, since we will rarely have big depths in the node tree, and we update as many nodes in an update as the depth of the current one.

          A novel approach to avoid lock contents: do not update last-updated for nodes 1 and 2 - demand that to a cronjob script (doing that for every node would lead to too many things breaking and probably to a too long / complex query)

          Novel approach 2: stop storing that info in the db, make it a dynamic field that can be gotten via a simple query (select max(modified) where path like ...)

          Show
          Gaetano Giunta added a comment - Having the optimizer work out different execution plans for the two queries is surprising - and a sure sign of a subpar optimizer - no wonder this might have been fixed in recent sql versions. Another interesting approach is the one suggested here: http://explainextended.com/2009/08/18/passing-parameters-in-mysql-in-list-vs-temporary-table/ - but I guess it does not apply to our particular case, since we will rarely have big depths in the node tree, and we update as many nodes in an update as the depth of the current one. A novel approach to avoid lock contents: do not update last-updated for nodes 1 and 2 - demand that to a cronjob script (doing that for every node would lead to too many things breaking and probably to a too long / complex query) Novel approach 2: stop storing that info in the db, make it a dynamic field that can be gotten via a simple query (select max(modified) where path like ...)
          Hide
          (inactive) Gunnstein Lye added a comment -

          In reply to comment #050954
          Novel approach 1 makes a lot of sense to me. Novel approach 2 might break some existing code?

          Show
          (inactive) Gunnstein Lye added a comment - In reply to comment #050954 Novel approach 1 makes a lot of sense to me. Novel approach 2 might break some existing code?
          Hide
          Gaetano Giunta added a comment -

          In reply to comment #050955
          Approach 1 is good, but I guess we will still get a lot of contention on the 1st and 2nd category container objects.
          I imagine a site with lots of dynamically updated content such as a newspaper or a blog farm would have the 'hot' contents stored mostly in the same subtree.

          As to what might break using the 2nd approach: a quick scan of the sources revealed that the modified_subnode info is used by:

          • some forum templates. Here switching to a dynamic attribute might work easily
          • in some sort_by clauses. this is tougher
          • it is mentioned to be related to cache handling (argh!)
          Show
          Gaetano Giunta added a comment - In reply to comment #050955 Approach 1 is good, but I guess we will still get a lot of contention on the 1st and 2nd category container objects. I imagine a site with lots of dynamically updated content such as a newspaper or a blog farm would have the 'hot' contents stored mostly in the same subtree. As to what might break using the 2nd approach: a quick scan of the sources revealed that the modified_subnode info is used by: some forum templates. Here switching to a dynamic attribute might work easily in some sort_by clauses. this is tougher it is mentioned to be related to cache handling (argh!)
          Hide
          (inactive) Gunnstein Lye added a comment -

          In reply to comment #050956
          The solution for approach 1 is then to have a setting for which nodes we should skip updating, defaulting to none, like (needs better name): DoNotUpdateModifiedForNodeIDs[]=42. The cronjob would read the same values and update those nodes.

          Show
          (inactive) Gunnstein Lye added a comment - In reply to comment #050956 The solution for approach 1 is then to have a setting for which nodes we should skip updating, defaulting to none, like (needs better name): DoNotUpdateModifiedForNodeIDs[]=42. The cronjob would read the same values and update those nodes.
          Hide
          Tony Chan added a comment - - edited

          Neither Approach 1 nor Approach 2 is good.

          Take a look to the original problem statement: "We are currently encountering a high number of db deadlocks resulting in fatal errors, usually when publishing content. This appears to be due to the update of the subnode modified time in the ezcontentobject_tree table......"

          Probably we need to be very careful when it is mentioned as deadlocks, after investigating this issue for a number of days, certainly it is not deadlock, more important, trying to use php code to solve so-called-deadlock but in fact non-deadlock issue would never be easy, it would create more side effect issues and we would lose more hair.

          Just thinking to optimize the query is not the right direction, "WHERE node_id IN ( 1, 2, 11355, 3083, 3072, 56286 )" vs " WHERE node_id='1' OR node_id='2' OR node_id='220' OR node_id='84458'", using IN or using OR in WHERE-cause has no major difference in this case.

          The issue is about:
          Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree SET modified_subnode=1279768018 WHERE node_id='1' OR node_id='2' OR node_id='220' OR node_id='84458'

          • Shouldn't interpret this as deadlock then we could find simple and effective solution. *
          Show
          Tony Chan added a comment - - edited Neither Approach 1 nor Approach 2 is good. Take a look to the original problem statement: "We are currently encountering a high number of db deadlocks resulting in fatal errors, usually when publishing content. This appears to be due to the update of the subnode modified time in the ezcontentobject_tree table......" Probably we need to be very careful when it is mentioned as deadlocks, after investigating this issue for a number of days, certainly it is not deadlock, more important, trying to use php code to solve so-called-deadlock but in fact non-deadlock issue would never be easy, it would create more side effect issues and we would lose more hair. Just thinking to optimize the query is not the right direction, "WHERE node_id IN ( 1, 2, 11355, 3083, 3072, 56286 )" vs " WHERE node_id='1' OR node_id='2' OR node_id='220' OR node_id='84458'", using IN or using OR in WHERE-cause has no major difference in this case. The issue is about: Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree SET modified_subnode=1279768018 WHERE node_id='1' OR node_id='2' OR node_id='220' OR node_id='84458' Shouldn't interpret this as deadlock then we could find simple and effective solution. *
          Hide
          (inactive) Gunnstein Lye added a comment -

          In reply to comment #050958
          Hi Tony,
          What makes you say it is not a deadlock? Why do you think Gaetanos suggestions will not help? Please elaborate.

          The change from OR to IN() will not be noticeable in single queries, but in high load situations it may matter a lot (depending on which database engine is used, since optimizers are different). Other good reasons for this change is to follow our coding standards, and avoid trouble with Oracle.

          I believe in Gaetanos approach 1, because when you have many editors publishing articles at the same time, they are all trying to update the same table rows and this causes lock timeouts. If we postpone the updating of for example node 1 and 2 to a cronjob, it will be easier to achieve the write lock on the other rows, I think.

          Show
          (inactive) Gunnstein Lye added a comment - In reply to comment #050958 Hi Tony, What makes you say it is not a deadlock? Why do you think Gaetanos suggestions will not help? Please elaborate. The change from OR to IN() will not be noticeable in single queries, but in high load situations it may matter a lot (depending on which database engine is used, since optimizers are different). Other good reasons for this change is to follow our coding standards, and avoid trouble with Oracle. I believe in Gaetanos approach 1, because when you have many editors publishing articles at the same time, they are all trying to update the same table rows and this causes lock timeouts. If we postpone the updating of for example node 1 and 2 to a cronjob, it will be easier to achieve the write lock on the other rows, I think.
          Hide
          Tony Chan added a comment -

          Having many editors edit or publish articles is true, while each editor is working on his own article, probably not the case that many editors work on the same article and publish it at the same time, so from logical point of view, each editor's process should try to lock its own record, make no sense to me that I need to lock your record in order to publish my article.

          From my observation, the editor using ezflow to do the layout, I am quite sure no one else is doing the same thing, while it still gets timeout.

          Show
          Tony Chan added a comment - Having many editors edit or publish articles is true, while each editor is working on his own article, probably not the case that many editors work on the same article and publish it at the same time, so from logical point of view, each editor's process should try to lock its own record, make no sense to me that I need to lock your record in order to publish my article. From my observation, the editor using ezflow to do the layout, I am quite sure no one else is doing the same thing, while it still gets timeout.
          Hide
          Tony Chan added a comment -

          Gabriel Finkelstein has applied the fix to use IN, I have applied the fix to use IN too, both of us saw the issue still exist, that is why I think using IN or using OR has no difference in this case.

          Show
          Tony Chan added a comment - Gabriel Finkelstein has applied the fix to use IN, I have applied the fix to use IN too, both of us saw the issue still exist, that is why I think using IN or using OR has no difference in this case.
          Hide
          (inactive) Gunnstein Lye added a comment -

          In reply to comment #050963
          There are technical design reasons for why editors of different articles need to update the same table rows. For example, it is needed to make caches for the front page expire when an article is published.

          Publishing content affects the same tables, no matter if an article or an eZ Flow layout is published.

          Show
          (inactive) Gunnstein Lye added a comment - In reply to comment #050963 There are technical design reasons for why editors of different articles need to update the same table rows. For example, it is needed to make caches for the front page expire when an article is published. Publishing content affects the same tables, no matter if an article or an eZ Flow layout is published.
          Hide
          (inactive) Gunnstein Lye added a comment -

          In reply to comment #050965
          Ok, I understand. So the IN() fix is not a complete solution, but we still think it is a step in the right direction (for reasons mentioned above).

          We'll continue to work on this.

          Show
          (inactive) Gunnstein Lye added a comment - In reply to comment #050965 Ok, I understand. So the IN() fix is not a complete solution, but we still think it is a step in the right direction (for reasons mentioned above). We'll continue to work on this.
          Hide
          Gaetano Giunta added a comment - - edited

          Interesting blog post on debugging lock waits:
          http://www.chriscalender.com/?p=91

          1. it might be that the lock being held here is one from a replication thread, not from ez code - if you use replication

          2. you'll need recent builds of innodb plugin to get that info unfortunately

          btw: lock monitoring tools:
          http://dev.mysql.com/doc/refman/5.1/en/innodb-monitors.html
          http://code.google.com/p/innotop/

          Show
          Gaetano Giunta added a comment - - edited Interesting blog post on debugging lock waits: http://www.chriscalender.com/?p=91 1. it might be that the lock being held here is one from a replication thread, not from ez code - if you use replication 2. you'll need recent builds of innodb plugin to get that info unfortunately btw: lock monitoring tools: http://dev.mysql.com/doc/refman/5.1/en/innodb-monitors.html http://code.google.com/p/innotop/
          Hide
          Tony Chan added a comment - - edited

          I still don't think it is true deadlock in this case, seems to me it just one-way locking held by other party too long which might be because of some reason, for example slow/busy disk i/o.

          If it is the case that the lock is held by replication then this is still one-way locking as replication probably does not need to wait any lock held by us.

          Show
          Tony Chan added a comment - - edited I still don't think it is true deadlock in this case, seems to me it just one-way locking held by other party too long which might be because of some reason, for example slow/busy disk i/o. If it is the case that the lock is held by replication then this is still one-way locking as replication probably does not need to wait any lock held by us.
          Hide
          (inactive) Gunnstein Lye added a comment -

          In reply to comment #050968
          You're right, deadlock is not the correct word.

          I am preparing to test Gaetanos suggestions.

          Show
          (inactive) Gunnstein Lye added a comment - In reply to comment #050968 You're right, deadlock is not the correct word. I am preparing to test Gaetanos suggestions.
          Hide
          Gaetano Giunta added a comment -

          Novel approach 3:
          . do not remove that col from the db
          . do not update it in real time
          . have a cronjob fix it from time to time

          algorithm for cronjob is easy:

          • select all nodes updates since last cronjob run
          • for each of them
          • update all nodes in their path chain whose subtree_updated value is less or equal to current node update time

          If the subtree_updated value is not used for cache calculations but only for sorting things in templates, it could be viable to have it updated on a delay.

          Horrible hack: have a new ini setting allowing this data update realtime or via cronjob...

          Show
          Gaetano Giunta added a comment - Novel approach 3: . do not remove that col from the db . do not update it in real time . have a cronjob fix it from time to time algorithm for cronjob is easy: select all nodes updates since last cronjob run for each of them update all nodes in their path chain whose subtree_updated value is less or equal to current node update time If the subtree_updated value is not used for cache calculations but only for sorting things in templates, it could be viable to have it updated on a delay. Horrible hack: have a new ini setting allowing this data update realtime or via cronjob...
          Hide
          Gabriel Finkelstein added a comment - - edited

          In reply to comment #050970
          The downside is that cache-blocks with subtree_expiry won't refresh instantly, but only after the cronjob has run.

          Show
          Gabriel Finkelstein added a comment - - edited In reply to comment #050970 The downside is that cache-blocks with subtree_expiry won't refresh instantly, but only after the cronjob has run.
          Hide
          Tony Chan added a comment - - edited

          I still think to using php code to solve non-deaklock timeout would create unwanted side effects and would lose more hair.

          My simple solution here:

          Increased the: TransactionDeadlockDetectionTimeout=5000
          (from 1200ms default)

          (from my point of view, the best way to handle one-way locking timeout is: don't leave there too soon)

          Show
          Tony Chan added a comment - - edited I still think to using php code to solve non-deaklock timeout would create unwanted side effects and would lose more hair. My simple solution here: Increased the: TransactionDeadlockDetectionTimeout=5000 (from 1200ms default) (from my point of view, the best way to handle one-way locking timeout is: don't leave there too soon)
          Hide
          Tony Chan added a comment -

          Further finding:

          seems the problem coming from everybody needs to lock Node-1 and node-2.

          Show
          Tony Chan added a comment - Further finding: seems the problem coming from everybody needs to lock Node-1 and node-2.
          Hide
          Tony Chan added a comment -

          [ Jul 26 2010 09:36:19 ] Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_treeSET remote_id='277a7380036daf5f1271e9ace6a85974'WHERE node_id='2'

          [ Jul 26 2010 09:36:19 ] eZDBInterface::commit TRANSID-8086466be2ac677e749bd0d6c91d26a1:Transaction in progress failed due to DB error, transaction was rollbacked. Transaction ID is TRANSID-8086466be2ac677e749bd0d6c91d26a1.

          Seems the real problem is everybody need to lock Node 2 when publishing new articles or changing layout in ez flow

          Show
          Tony Chan added a comment - [ Jul 26 2010 09:36:19 ] Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_treeSET remote_id='277a7380036daf5f1271e9ace6a85974'WHERE node_id='2' [ Jul 26 2010 09:36:19 ] eZDBInterface::commit TRANSID-8086466be2ac677e749bd0d6c91d26a1:Transaction in progress failed due to DB error, transaction was rollbacked. Transaction ID is TRANSID-8086466be2ac677e749bd0d6c91d26a1. Seems the real problem is everybody need to lock Node 2 when publishing new articles or changing layout in ez flow
          Hide
          Tony Chan added a comment -

          Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree
          Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree SET modified_subnode=1280105331 WHERE node_id IN ( 1, 2 )
          Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree SET modified_subnode=1280105347 WHERE node_id IN ( 1, 5, 12, 83926 )
          Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree SET modified_subnode=1280105398 WHERE node_id IN ( 1, 2, 224, 85293 )
          Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree SET modified_subnode=1280105553 WHERE node_id IN ( 1, 43, 85220 )
          Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree SET modified_subnode=1280105553 WHERE node_id IN ( 1, 2 )
          Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree SET modified_subnode=1280109308 WHERE node_id IN ( 1, 2, 1072 )
          Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree SET modified_subnode=1280109313 WHERE node_id IN ( 1, 43, 51, 85308 )
          Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree

          Show
          Tony Chan added a comment - Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree SET modified_subnode=1280105331 WHERE node_id IN ( 1, 2 ) Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree SET modified_subnode=1280105347 WHERE node_id IN ( 1, 5, 12, 83926 ) Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree SET modified_subnode=1280105398 WHERE node_id IN ( 1, 2, 224, 85293 ) Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree SET modified_subnode=1280105553 WHERE node_id IN ( 1, 43, 85220 ) Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree SET modified_subnode=1280105553 WHERE node_id IN ( 1, 2 ) Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree SET modified_subnode=1280109308 WHERE node_id IN ( 1, 2, 1072 ) Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree SET modified_subnode=1280109313 WHERE node_id IN ( 1, 43, 51, 85308 ) Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezcontentobject_tree
          Hide
          (inactive) Gunnstein Lye added a comment -

          Hi Tony,

          Did the increase of TransactionDeadlockDetectionTimeout solve your problem?

          "seems the problem coming from everybody needs to lock Node-1 and node-2."

          Yes, this is why Gaetano suggested to not update these nodes on publishing, but instead update them later via a cronjob. I am working on a test for this.

          Show
          (inactive) Gunnstein Lye added a comment - Hi Tony, Did the increase of TransactionDeadlockDetectionTimeout solve your problem? "seems the problem coming from everybody needs to lock Node-1 and node-2." Yes, this is why Gaetano suggested to not update these nodes on publishing, but instead update them later via a cronjob. I am working on a test for this.
          Hide
          Tony Chan added a comment - - edited

          Found and added one more parameter this afternoon:

          innodb_lock_wait_timeout=120 (default is 50 sec)

          (probably innodb_lock_wait_timeout is the right parameter to increase the wait_timeout)

          Show
          Tony Chan added a comment - - edited Found and added one more parameter this afternoon: innodb_lock_wait_timeout=120 (default is 50 sec) (probably innodb_lock_wait_timeout is the right parameter to increase the wait_timeout)
          Hide
          (inactive) Gunnstein Lye added a comment -

          The first test method suggested by Gaetano works (running two publishing scripts in parallel with a delay inserted in the code) - I get the same lock wait timeout.

          Interesting: The sleep( 100 ) I added to eZContentObjectTreeNode::updateAndStoreModified() causes the publish script to run for 8 minutes (even when running only one script instance)! I would expect less than 2 minutes. It turns out that updateAndStoreModified() is run 5 times when publishing one object! It is run once with "TEMPPATH" as node parameter, and 4 times with the real nodes (1, 2, 68 in my case). I need to find out why this happens. If it is a bug, it should be fixed before we implement Gaetanos cronjob suggestion.

          Show
          (inactive) Gunnstein Lye added a comment - The first test method suggested by Gaetano works (running two publishing scripts in parallel with a delay inserted in the code) - I get the same lock wait timeout. Interesting: The sleep( 100 ) I added to eZContentObjectTreeNode::updateAndStoreModified() causes the publish script to run for 8 minutes (even when running only one script instance)! I would expect less than 2 minutes. It turns out that updateAndStoreModified() is run 5 times when publishing one object! It is run once with "TEMPPATH" as node parameter, and 4 times with the real nodes (1, 2, 68 in my case). I need to find out why this happens. If it is a bug, it should be fixed before we implement Gaetanos cronjob suggestion.
          Hide
          (inactive) Gunnstein Lye added a comment -

          In reply to comment #050978
          This was a false alarm. eZContentObjectTreeNode::updateAndStoreModified() runs only once when publishing from the web interface. It is my testing CLI script that causes it to be run 5 times.

          Show
          (inactive) Gunnstein Lye added a comment - In reply to comment #050978 This was a false alarm. eZContentObjectTreeNode::updateAndStoreModified() runs only once when publishing from the web interface. It is my testing CLI script that causes it to be run 5 times.
          Hide
          (inactive) Gunnstein Lye added a comment - - edited

          I have tried filtering out node 1 and 2 from eZContentObjectTreeNode::updateAndStoreModified() and rerunning the test with two simultaneous publishing scripts. This solves the lock wait timeout when updating ezcontentobject_tree.modified_subnode.

          However, I get another lock wait timeout instead, on the url alias query:

          INSERT INTO ezurlalias_ml (id, parent, lang_mask, text, text_md5, action, action_type, link, is_alias, is_original, alias_redirects)
          VALUES(63, 0, 3, 'Hello-World6', '74572d0a0c7152ba0b4cc23a78856511', 'eznode:87', 'eznode', 63, 0, 1, 1)
          

          This is a bit odd since such a simple INSERT query should only lock the inserted row, not the whole table:
          http://dev.mysql.com/doc/refman/5.1/en/innodb-locks-set.html
          However, there may be other queries on the table that hold the lock.

          This lock may not be a problem in real life however - in my test I am using a 100 second sleep in the code. Time will show.

          Show
          (inactive) Gunnstein Lye added a comment - - edited I have tried filtering out node 1 and 2 from eZContentObjectTreeNode::updateAndStoreModified() and rerunning the test with two simultaneous publishing scripts. This solves the lock wait timeout when updating ezcontentobject_tree.modified_subnode. However, I get another lock wait timeout instead, on the url alias query: INSERT INTO ezurlalias_ml (id, parent, lang_mask, text, text_md5, action, action_type, link, is_alias, is_original, alias_redirects) VALUES(63, 0, 3, 'Hello-World6', '74572d0a0c7152ba0b4cc23a78856511', 'eznode:87', 'eznode', 63, 0, 1, 1) This is a bit odd since such a simple INSERT query should only lock the inserted row, not the whole table: http://dev.mysql.com/doc/refman/5.1/en/innodb-locks-set.html However, there may be other queries on the table that hold the lock. This lock may not be a problem in real life however - in my test I am using a 100 second sleep in the code. Time will show.
          Hide
          Gaetano Giunta added a comment -

          In reply to comment #050981
          Could this be related to the fact that we did not add the autoincrementing id column to that table but used a separate, one-column table instead?

          I do not want to wander too far off, but it might be an easy thing to test - add that col to the table, and comment off the call to get a new url id via php...

          Show
          Gaetano Giunta added a comment - In reply to comment #050981 Could this be related to the fact that we did not add the autoincrementing id column to that table but used a separate, one-column table instead? I do not want to wander too far off, but it might be an easy thing to test - add that col to the table, and comment off the call to get a new url id via php...
          Hide
          Gaetano Giunta added a comment -

          In reply to comment #050981
          As a side note: could you test using READ COMMITTED transaction isolation level to see if it changes anything?

          Show
          Gaetano Giunta added a comment - In reply to comment #050981 As a side note: could you test using READ COMMITTED transaction isolation level to see if it changes anything?
          Hide
          Tony Chan added a comment - - edited

          Here, I changed some MYSQL config:

          #
          TransactionDeadlockDetectionTimeout=10000
          HeartbeatIntervalDbDb=3000
          HeartbeatIntervalDbApi=3000
          #
          innodb_lock_wait_timeout=500
          #

          In the past 4 hours, haven't found any lock timeout yet.

          Will see if this really works.

          Show
          Tony Chan added a comment - - edited Here, I changed some MYSQL config: # TransactionDeadlockDetectionTimeout=10000 HeartbeatIntervalDbDb=3000 HeartbeatIntervalDbApi=3000 # innodb_lock_wait_timeout=500 # In the past 4 hours, haven't found any lock timeout yet. Will see if this really works.
          Hide
          (inactive) Gunnstein Lye added a comment - - edited

          I have attached my suggestion for solving the timeouts on the modified_subnode query. It adds a setting for the node IDs that will not be updated during publishing. Instead they will be updated when the frequent cronjob part runs. The patch is for master/4.4.0 and can be applied using

          patch -p1 < issue-16340-modified_subnode-master.diff
          

          Please note that the patch is not activated by default. To activate, enter the nodes you want in your settings/override/content.ini.append.php
          for example like this:

          [NodeSettings]
          # Top level root node
          SkipUpdateModifiedNodes[]=1
          # Content root node
          SkipUpdateModifiedNodes[]=2
          # Users root node
          SkipUpdateModifiedNodes[]=5
          # Media root node
          SkipUpdateModifiedNodes[]=43
          

          issue-16340-modified_subnode-master.diff

          Show
          (inactive) Gunnstein Lye added a comment - - edited I have attached my suggestion for solving the timeouts on the modified_subnode query. It adds a setting for the node IDs that will not be updated during publishing. Instead they will be updated when the frequent cronjob part runs. The patch is for master/4.4.0 and can be applied using patch -p1 < issue-16340-modified_subnode-master.diff Please note that the patch is not activated by default. To activate, enter the nodes you want in your settings/override/content.ini.append.php for example like this: [NodeSettings] # Top level root node SkipUpdateModifiedNodes[]=1 # Content root node SkipUpdateModifiedNodes[]=2 # Users root node SkipUpdateModifiedNodes[]=5 # Media root node SkipUpdateModifiedNodes[]=43 issue-16340-modified_subnode-master.diff
          Hide
          (inactive) Gunnstein Lye added a comment -

          In reply to comment #050985
          That's great, I hope it works!

          The danger of increasing the timout values is that your database can be flooded with waiting requests, so I think we still need to increase the efficiency of the publishing operation.

          Show
          (inactive) Gunnstein Lye added a comment - In reply to comment #050985 That's great, I hope it works! The danger of increasing the timout values is that your database can be flooded with waiting requests, so I think we still need to increase the efficiency of the publishing operation.
          Hide
          (inactive) Gunnstein Lye added a comment -

          In reply to comment #050983
          @GG:
          I saw no difference when using READ COMMITTED transaction isolation level.

          Show
          (inactive) Gunnstein Lye added a comment - In reply to comment #050983 @GG: I saw no difference when using READ COMMITTED transaction isolation level.
          Hide
          (inactive) Gunnstein Lye added a comment -

          I'm reopening the issue due to continued timeout reports.

          Show
          (inactive) Gunnstein Lye added a comment - I'm reopening the issue due to continued timeout reports.
          Hide
          Tony Chan added a comment - - edited

          Got 10 Query errors this morning

          Show
          Tony Chan added a comment - - edited Got 10 Query errors this morning
          Hide
          Gaetano Giunta added a comment -

          Some more lateral thinking:

          changing the generated query from

          UPDATE ezcontentobject_tree SET modified_subnode=1280105347 WHERE node_id IN ( 1, 5, 12, 83926 )

          to

          UPDATE ezcontentobject_tree SET modified_subnode=1280105347 WHERE node_id IN ( 1, 5, 12, 83926 ) AND modified_subnode<1280105347

          might be beneficial, esp. when

          • the update query is run 5 or 6 times for publishing a single object
          • the db locks for update only the rows that satisfy the where condition
          • objects are very frequently updated (ie. two objects updated at the same second)

          might have unexpected side effects when

          • a server has a clock in the future, then the clock is reset to current time by an admin
          • there are multiple servers updating content in parallel, whose clocks are not in sync

          note that the side effects might in fact be beneficial in those two cases, not necessarily bad...

          Show
          Gaetano Giunta added a comment - Some more lateral thinking: changing the generated query from UPDATE ezcontentobject_tree SET modified_subnode=1280105347 WHERE node_id IN ( 1, 5, 12, 83926 ) to UPDATE ezcontentobject_tree SET modified_subnode=1280105347 WHERE node_id IN ( 1, 5, 12, 83926 ) AND modified_subnode<1280105347 might be beneficial, esp. when the update query is run 5 or 6 times for publishing a single object the db locks for update only the rows that satisfy the where condition objects are very frequently updated (ie. two objects updated at the same second) might have unexpected side effects when a server has a clock in the future, then the clock is reset to current time by an admin there are multiple servers updating content in parallel, whose clocks are not in sync note that the side effects might in fact be beneficial in those two cases, not necessarily bad...
          Hide
          (inactive) Gunnstein Lye added a comment -

          In reply to comment #050991
          "the update query is run 5 or 6 times for publishing a single object "

          For the record: On web publishing it runs only once. On CLI publishing I see it running 5 or 6 times depending on method.

          Show
          (inactive) Gunnstein Lye added a comment - In reply to comment #050991 "the update query is run 5 or 6 times for publishing a single object " For the record: On web publishing it runs only once. On CLI publishing I see it running 5 or 6 times depending on method.
          Hide
          (inactive) Gunnstein Lye added a comment -

          In reply to comment #050991
          "the db locks for update only the rows that satisfy the where condition"

          Do you know if it does? For which databases?

          Show
          (inactive) Gunnstein Lye added a comment - In reply to comment #050991 "the db locks for update only the rows that satisfy the where condition" Do you know if it does? For which databases?
          Hide
          Tony Chan added a comment -

          Applied your latest patch, seems no more timeout issues so far

          Show
          Tony Chan added a comment - Applied your latest patch, seems no more timeout issues so far
          Hide
          Tony Chan added a comment -

          Applied your latest patch, seems no more timeout issues so far

          Show
          Tony Chan added a comment - Applied your latest patch, seems no more timeout issues so far
          Hide
          Brendan Pike added a comment -

          I believe I'm experiencing this on a 4.2.0 site when trying to use updatesearchindex.php ie.

          Starting object re-indexing
          Number of objects to index: 23586
          Fatal error: A database transaction in eZ Publish failed.

          Error: (eZMySQLDB)
          Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezsearch_word SET object_count=( object_count - 1 ) WHERE id in ( 1 )

          Will this be patched soon for 4.2.0 also?

          Show
          Brendan Pike added a comment - I believe I'm experiencing this on a 4.2.0 site when trying to use updatesearchindex.php ie. Starting object re-indexing Number of objects to index: 23586 Fatal error: A database transaction in eZ Publish failed. Error: (eZMySQLDB) Query error: Lock wait timeout exceeded; try restarting transaction. Query: UPDATE ezsearch_word SET object_count=( object_count - 1 ) WHERE id in ( 1 ) Will this be patched soon for 4.2.0 also?
          Hide
          Gaetano Giunta added a comment -

          Any updates on the status of thepatch that prevents updating the modified_subnodes value for nodes at the root of the tree?
          I am seeing db deadlocks at another customer, where a lot of user accounts get created in parallel...

          Show
          Gaetano Giunta added a comment - Any updates on the status of thepatch that prevents updating the modified_subnodes value for nodes at the root of the tree? I am seeing db deadlocks at another customer, where a lot of user accounts get created in parallel...
          Hide
          (inactive) Gunnstein Lye added a comment -

          In reply to comment #050998
          The patch was rejected by the review board. OMS and AR have been discussing other solutions, but I have not heard of any results yet.

          Show
          (inactive) Gunnstein Lye added a comment - In reply to comment #050998 The patch was rejected by the review board. OMS and AR have been discussing other solutions, but I have not heard of any results yet.
          Hide
          Gabriel Finkelstein added a comment -

          I commented the "UPDATE ezcontentobject_tree SET modified_subnode=..." since I didn't need it, and now I'm getting the lock wait timeout exceeded message but with this query:
          INSERT INTO ezcontentobject_tree (parent_node_id, main_node_id, contentobject_id, contentobject_version, contentobject_is_published, depth, sort_field, sort_order, priority, modified_subnode, path_string, path_identification_string, remote_id, is_hidden, is_invisible) VALUES(3072, 0, 79053, 1, 0, 5, 1, 1, 0, 0, '/TEMPPATH', '', '09f0768ae8be5969ffbc6344d3d3f781', 0, 0)
          So I suppose the problem is not caused by updating the modified_subnode value, but by some other query that runs before both of these.

          Show
          Gabriel Finkelstein added a comment - I commented the "UPDATE ezcontentobject_tree SET modified_subnode=..." since I didn't need it, and now I'm getting the lock wait timeout exceeded message but with this query: INSERT INTO ezcontentobject_tree (parent_node_id, main_node_id, contentobject_id, contentobject_version, contentobject_is_published, depth, sort_field, sort_order, priority, modified_subnode, path_string, path_identification_string, remote_id, is_hidden, is_invisible) VALUES(3072, 0, 79053, 1, 0, 5, 1, 1, 0, 0, '/TEMPPATH', '', '09f0768ae8be5969ffbc6344d3d3f781', 0, 0) So I suppose the problem is not caused by updating the modified_subnode value, but by some other query that runs before both of these.
          Hide
          Bertrand Dunogier added a comment -

          As highlighted earlier by Gaetano, another possible track would be to limit the transaction that includes this query. A query that affects so many rows should NOT be part of such a huge transaction. Either this particular query could be moved outside the transaction (after, I'd say), or other queries can be moved out.

          The issue with delayed execution (e.g. asynchronous) of this query is that it affects clearing of subtree based cache blocks, and will delay their update. This is by itself not an issue, but ViewCaching would STILL be cleared during publishing, and this would lead to inconsistencies as ViewCache files could be updated with outdated cache blocks.

          If such a workaround improves things in an order of magnitude that resolves the current issues, eZ Engineering is thinking on the larger scales: what do we use this field for exactly; can we use something else instead ? if we need this field, what is the true purpose, and how can we make the feature behind it more efficient.

          I must also add that proper MySQL tuning will improve the situation; it won't remove the LOCK timeout issue, but will make it less likely to happen, as queries will execute faster.

          To conclude, about raising the wait timeout: yes, this is a valid workaround, but it has one major defect that makes it a non option from a product point of vue: it degrades the editor's experience. Publishing MUST be either fast or asynchronous.

          Show
          Bertrand Dunogier added a comment - As highlighted earlier by Gaetano, another possible track would be to limit the transaction that includes this query. A query that affects so many rows should NOT be part of such a huge transaction. Either this particular query could be moved outside the transaction (after, I'd say), or other queries can be moved out. The issue with delayed execution (e.g. asynchronous) of this query is that it affects clearing of subtree based cache blocks, and will delay their update. This is by itself not an issue, but ViewCaching would STILL be cleared during publishing, and this would lead to inconsistencies as ViewCache files could be updated with outdated cache blocks. If such a workaround improves things in an order of magnitude that resolves the current issues, eZ Engineering is thinking on the larger scales: what do we use this field for exactly; can we use something else instead ? if we need this field, what is the true purpose, and how can we make the feature behind it more efficient. I must also add that proper MySQL tuning will improve the situation; it won't remove the LOCK timeout issue, but will make it less likely to happen, as queries will execute faster. To conclude, about raising the wait timeout: yes, this is a valid workaround, but it has one major defect that makes it a non option from a product point of vue: it degrades the editor's experience. Publishing MUST be either fast or asynchronous.
          Hide
          Gaetano Giunta added a comment -

          http://dev.mysql.com/doc/refman/5.1/en/replication-sbr-rbr.html

          It says that when using replication, SBR mode needs more locking than RBR mode. Might be worth investigating...

          Show
          Gaetano Giunta added a comment - http://dev.mysql.com/doc/refman/5.1/en/replication-sbr-rbr.html It says that when using replication, SBR mode needs more locking than RBR mode. Might be worth investigating...
          Hide
          Bertrand Dunogier added a comment -

          This situation is now fixed by using the asynchronous publishing system introduced in eZ Publish 4.5:
          https://github.com/ezsystems/ezpublish/blob/master/doc/features/4.5/ezasynchronouspublisher.txt

          Show
          Bertrand Dunogier added a comment - This situation is now fixed by using the asynchronous publishing system introduced in eZ Publish 4.5: https://github.com/ezsystems/ezpublish/blob/master/doc/features/4.5/ezasynchronouspublisher.txt
          Hide
          Philipp Kamps added a comment -

          I don't think it's a problem of a single sql query. The entire publishing process (ezp core php code) is wrapped into a DB transaction. That transaction includes ez publishing pre/post trigger handling. So let's say you have a custom publishing post trigger which takes some time to execute, it will become more likely that you run into the "Lock wait timeout exceeded" error. The default setting for inno DB tables is 50 seconds.

          Show
          Philipp Kamps added a comment - I don't think it's a problem of a single sql query. The entire publishing process (ezp core php code) is wrapped into a DB transaction. That transaction includes ez publishing pre/post trigger handling. So let's say you have a custom publishing post trigger which takes some time to execute, it will become more likely that you run into the "Lock wait timeout exceeded" error. The default setting for inno DB tables is 50 seconds.
          Hide
          Geoff Bentley added a comment -

          Async publishing doesn't fix the problem - but it does delay it.

          Show
          Geoff Bentley added a comment - Async publishing doesn't fix the problem - but it does delay it.

            People

            • Assignee:
              Patrick Allaert
              Reporter:
              (inactive) Gunnstein Lye
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: