Monday, February 25, 2013

Post-Mortem Analysis Of Virus Database Push Issues

On Thursday, 14 Feb 2013, in preparation for the coming ClamAV 0.98 release, a new database was scheduled to be made available to users. We had a set of issues while performing this upgrade, and we feel that it is appropriate to let our users and mirror providers know what happened, what has done to fix the issues, and what is being done to prevent these issues from happening again.

So first, What Happened?

  1. 14 Feb 2013 0800 EST: Start of our scheduled work on our infrastructure.
  2. 14 Feb 2013 0815 EST: A new, custom daily.cvd (our virus definition database) was published. This database was generated with ClamAV 0.98, which in turn caused freshclam to think that a new version of ClamAV was available (not yet, but there will be).
  3. 14 Feb 2013 0830 EST: Published a new daily.cvd, generated with ClamAV 0.97.6, the current version of ClamAV. This corrected the issue with incorrect notifications of a new version of ClamAV.
  4. 14 Feb 2013 1100 EST: Clients report errors with updating. Investigation starts.
  5. 14 Feb 2013 1130 EST: The problem was isolated. The new database wasn't copied into a critical directory on our internal Signature server. The database publishing infrastructure didn't know that a custom database had been published. The custom database was overwritten with a new database.  This resulted in some users being unable to use the .cdiff files (our incremental update files) for updating, leading to users who had downloaded the custom database to be unable to update.
  6. 14 Feb 2013 1330 EST: A new database was published to resolve the issues. Issues should now be resolved for most users.
  7. 19 Feb 2013 1700 EST: Issues resolved for all remaining users by modifying the set of available .cdiff files.
Fixes That Have Been Performed

We've deleted all database files that would cause errors. This should fix the remainder of issues for our users.  However, any users who are still seeing errors should delete the mirrors.dat file in their database directory to force a reset of mirror selection.

Prevention

We've put in place a workflow that will prevent issues like this from popping up. A full change-management process is in place, with an emphasis on peer-reviewed planning, comprehensive test plans and appropriate personnel assignments.  Change plans will be approved by a senior administrator, a ClamAV developer and a representative from the analyst team.

For the convenience of our mirror providers, there is now a set maintenance window for routine changes: Monday 5pm EST through midnight EST.  As always, we will aim to notify mirror providers a week in advance of any change.  In the case of emergent issues, a different time or a shorter notification may be required.

We apologize for any inconvenience caused by the problems outlined in this post.  We will continue to review our processes to ensure that we are providing the best experience for both our users and our mirror providers.