When using Mysql/MariaDB with Nutch, something need to be tuned in order to make Nutch work as expected.

Because Nutch by default hides or buries Exception(s) under piles of logs, you think you have fetched all the Urls but actually you didn't.

For example, here is an exception "java.sql.BatchUpdateException: The last packet successfully received from the server was 73,558,766 milliseconds ago. The last packet sent successfully to the server was 73,558,766 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem."

To avoid this, set:

  • JDBC - url="jdbc:mysql://localhost:3306/confluence?autoReconnect=true"
  • Mysql
    • my.cnf or my.cnf.d/server.cnf
    • or without restart the mysql server (mysqld)

$mysql -uroot -p -e"SET GLOBAL wait_timeout=100000; SET GLOBAL interactive_timeout=100000;"

to check/confirm the system settings:

>SELECT @@global.wait_timeout, @@global.interactive_timeout, @@SESSION.wait_timeout, @@SESSION.interactive_timeout;