Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interrupted compact can leave extra hashes in volumes #5184

Open
Jojo-1000 opened this issue May 11, 2024 · 2 comments
Open

Interrupted compact can leave extra hashes in volumes #5184

Jojo-1000 opened this issue May 11, 2024 · 2 comments

Comments

@Jojo-1000
Copy link
Contributor

Jojo-1000 commented May 11, 2024

Taken from merged PR, because the issue still exists. An interrupted compact operation can leave the database in an inconsistent state, where some blocks were deleted from the table but still exist in the actual dblock volume. This causes an error with full verification turned on.

Test case to reproduce: Jojo-1000@aab7c63


foreach (var entry in new AsyncDownloader(volumesToDownload, backend))
{
using (var tmpfile = entry.TempFile)
{
if (m_result.TaskControlRendevouz() == TaskControlState.Stop)
{
backend.WaitForComplete(db, transaction);
return false;
}
downloadedVolumes.Add(new KeyValuePair<string, long>(entry.Name, entry.Size));
var inst = VolumeBase.ParseFilename(entry.Name);
using (var f = new BlockVolumeReader(inst.CompressionModule, tmpfile, m_options))
{
foreach (var e in f.Blocks)
{
if (q.UseBlock(e.Key, e.Value, transaction))
{
//TODO: How do we get the compression hint? Reverse query for filename in db?
var s = f.ReadBlock(e.Key, buffer);
if (s != e.Value)
throw new Exception(string.Format("Size mismatch problem for block {0}, {1} vs {2}", e.Key, s, e.Value));
newvol.AddBlock(e.Key, buffer, 0, s, Duplicati.Library.Interface.CompressionHint.Compressible);
if (newvolindex != null)
newvolindex.AddBlock(e.Key, e.Value);
db.MoveBlockToNewVolume(e.Key, e.Value, newvol.VolumeID, transaction);
blocksInVolume++;
if (newvol.Filesize > m_options.VolumeSize)
{
FinishVolumeAndUpload(db, backend, newvol, newvolindex, uploadedVolumes);
newvol = new BlockVolumeWriter(m_options);
newvol.VolumeID = db.RegisterRemoteVolume(newvol.RemoteFilename, RemoteVolumeType.Blocks, RemoteVolumeState.Temporary, transaction);
if (m_options.IndexfilePolicy != Options.IndexFileStrategy.None)
{
newvolindex = new IndexVolumeWriter(m_options);
newvolindex.VolumeID = db.RegisterRemoteVolume(newvolindex.RemoteFilename, RemoteVolumeType.Index, RemoteVolumeState.Temporary, transaction);
db.AddIndexBlockLink(newvolindex.VolumeID, newvol.VolumeID, transaction);
newvolindex.StartVolume(newvol.RemoteFilename);
}
blocksInVolume = 0;
//After we upload this volume, we can delete all previous encountered volumes
deletedVolumes.AddRange(DoDelete(db, backend, deleteableVolumes, ref transaction));
deleteableVolumes = new List<IRemoteVolume>();
}
}
}
}
deleteableVolumes.Add(entry);
}
}

  • Compact runs through all the blocks in each volume (entry) that should be compacted (Line 146)
  • Used blocks are added to the new volume newvol. In the database transaction, their volume IDs are moved from entry to newvol
  • If newvol is full, it is uploaded and the deletable volumes are removed. DoDelete causes a transaction commit, so that the backend changes are not lost if duplicati crashes (Line 194)
  • Most likely, there are still blocks left in the current volume. So, a new newvol is started and the rest of the blocks are added
  • Once all blocks from entry are handled, it is added to deleteableVolumes (Line 201), which will be deleted after the next upload
  • entry is finished, so a new volume is downloaded
  • There is a network error which throws an exception
  • The transaction is rolled back, the current newvol was not uploaded, so the volumes in deleteableVolumes can't be deleted
  • This leaves entry with
    • blocks that were removed from the DB entries, because they were uploaded in the previous newvol, and
    • blocks that were rolled back from the aborted newvol to entry
    • all of these blocks are still there on the remote, so the former will show up as extra

At least in my test, the verification always produced the extra blocks twice, for one index volume and one block volume. There is never more than one block volume with extra hashes. This is consistent with the above explanation.

The transaction commit points are not trivial to change:

  • Any remote file deletion requires a commit before performing the delete, otherwise data can be lost
  • We can't commit before the volumes are uploaded, otherwise blocks from entry will move to newvol in the database, which is not yet uploaded correctly
  • This means that the code already commits at every point where that is possible
  • We also can't upload the current newvol whenever entry is finished, because that would prevent compact from combining smaller volumes into bigger ones

Unless someone can come up with a different transaction scheme, that can resolve these issues, the moved blocks in entry have to be marked as either duplicate, obsolete (in a new separate table) or deleted, in the same transaction that moves them to newvol.

If the blocks are set to deleted (which I would prefer on first thought), it needs to be verified that:

  • the DeletedBlock and Block tables can contain the same block hashes with different volumes
  • this doesn't cause problems when the same block is deleted again, which puts multiple same hashes in DeletedBlock
  • the started entry should finish compacting in the next compact operation that runs, otherwise a half-waste volume remains (this should not be an issue if the deleted blocks work correctly)

Originally posted by @Jojo-1000 in #4967 (comment)

@Jojo-1000
Copy link
Contributor Author

I thought a bit more about this, especially how it will work with database recreate.

I think the only possibility is to record the blocks in the DuplicateBlock table, because a database recreate can't know which version of the block should be the deleted one. Instead it will mark all of the duplicate blocks, and only record the first occurrence in the Block table.

To make this work properly, I suggest these changes:

  • When moving the blocks to the new volume, put the existing one in the duplicate block table
  • When the old volume gets deleted after uploading the new one, the duplicated blocks from that volume also need to be cleared from the table
  • In the wasted space calculation, treat duplicated blocks as wasted space (same as deleted blocks)
  • In the test operation, check the duplicated blocks and don't treat them as errors or extra hashes (this should also fix all other cases of "extra hash" after a database recreate)
  • There needs to be a way to have duplicated deleted blocks (after a duplicated block is deleted)
  • (Optional) With database recreate, make sure that all but the last (based on volume timestamp) occurrence of the block are marked as duplicate. This would make sure that after an interrupted compact, the recreated database also has the duplicated blocks in the un-compacted volume, instead of the new one. This makes sure no space is wasted in the new volume, and the duplicates will disappear with the next compact
  • (Optional) When compacting a volume (maybe even in waste calculation?), check whether there are duplicate blocks in other volumes, that will not be compacted. Then move the "real" Block entry to one of those volumes, and remove it from the compacted volume. This would over time compact away any duplicated blocks, no matter how they are distributed
    The optional changes seem low-priority to me, because this entire case should only happen rarely and it is better to be correct than space efficient.

Originally posted by @Jojo-1000 in #4967 (comment)

An alternative solution is to temporarily allow an inconsistent database. In combination with #4982 the next compact would remove the extra blocks. However, the space calculation would be incorrect in the mean time.

@ts678
Copy link
Collaborator

ts678 commented May 13, 2024

Thanks for the continuing work. I've been digging through history in 2022 and 2023 about this bug, as it's been a long story.

Taken from merged PR

I assume this references PR of original post which got only partially merged in 2023, taking only more critical missing file error.
The Extra: hashes seen in test with full-remote-verification had been hoped to be an annoyance without other impact.

I'll need to read this awhile longer to see if that has changed, but I'll say that one thing that bothers me about the collection of imperfections that Duplicati gains in its records over time worry me, as they might show up as mystery issues after more runs.

They're also potential land mines for new code that makes assumptions about the data being by-the-book, when it isn't really. Having said that, finite resources tend to focus on the things that break right now, and I can't disagree. Glad we're catching up.

Another option is to write a book, so "by-the-book" has meaning. Document all the oddities, e.g. the NULL and sentinel values. People coming in from the Duplicati Inc. side might want that as training, and the pieces that I post in the forum only go so far.

So that's an editorial break, but going back to March 2022, I found one note in my files on a possible test case which might be

test all with full-remote-verification shows "Extra" hashes from error in compact #4693 (but it's not the network error, as here)

and has a note on it that it may be fixed by

Fix "test" error: extra hashes #4982 which is still a draft and is not the OP PR. I find its note repeating editorial that I said above:

Reuse existing deleted blocks to prevent uploading duplicated copies that can cause problems later

Ultimately I think we need some opinion on this that is more expert than either of us, and I'm glad that might be possible again.

Even the top expert might still have some limits, as I've been asking annoying questions about correctness of the commit design, including how well it handles different threads doing commit based on their needs. Is a good time for one also good for the rest? Hoping to get a design document from someone sometime on the concurrency plan and its interaction with transaction scheme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants