Fix "test" error: extra hashes #4982

Jojo-1000 · 2023-06-27T20:24:44Z

This fixes multiple issues with deleted / duplicated blocks:

Reuse existing deleted blocks to prevent uploading duplicated copies that can cause problems later
When running recreate database, fill the DeletedBlock table for correct space calculations
When running compact, check that there is no duplicate block in another volume which would be modified

Steps to reproduce

Error with compact (comment)
Error with recreate (comment)

Duplicate blocks:

Create backup with --no-auto-compact and --no-encrypt
Create file A.txt with content A, backup 1
Delete A.txt, backup 2
Recreate A.txt with same content, backup 3

Expected result:
Backup 3 should not upload block for A.txt (hash VZrq0IJk1XldOQlxjN0Fq9SVcuhP5VWQ7vMaiKCP3/0=), as it is already contained in backup 1.

Actual result:
The dblock file for backup 3 contains a block with hash VZrq0IJk1XldOQlxjN0Fq9SVcuhP5VWQ7vMaiKCP3/0=, as does the dblock file for backup 1.

TODO:

Performance impact

Check the performance impact of reusing existing deleted blocks. If necessary, add an index over (Hash, Size) to the DeletedBlock table.
A test of the before/after performance on a large backup with deleted blocks would be appreciated.

The worst case performance could be tested as follows:

Create a test scenario where large amounts of new files are added every backup
24% of those are deleted in the next version (less than the compact threshold, so those blocks are kept). It would need to be ensured that the deleted files are spread over the dblock volumes.
Compare backup times before and after this change

Validate database recreate change

The change in database recreate works in my tests, but I am not completely sure that the query catches all usages of blocks (I first missed the BlocklistHash table, for example). It could be possible that some blocks are moved to deleted incorrectly.

This prevents duplicated blocks after a block was deleted and re-added (duplicati#4693). Also fix RemoveMissingBlocks in LocalListBrokenFilesDatabase, which did not clear the DeletedBlock table.

The DeletedBlock table was not filled after a database recreate. This results in incorrect compact size calculations and possible duplicate blocks. To detect deleted blocks, add all blocks not referenced in a blockset or used as a blocklist hash.

Check that blocks which are moved are recorded for the volume to be deleted. If duplicate blocks exist and one is in the DeletedBlock table, this can erase a block entry on an unrelated volume (duplicati#4693).

ts678 · 2023-06-28T00:14:26Z

I first missed the BlocklistHash table, for example

I meant to comment in this issue, where you wrote:

check the Block table for blocks that do not appear in any file.

but you got the PR out first. I think that's all the flavors. A blockset might be data or metadata though (two aspects of a file).
Reducing Time Spent Deleting Blocks SQL shows that idea, although my proposed speedup rewrite didn't get much testing...

ts678 · 2023-06-28T11:13:15Z

When running recreate database, fill the DeletedBlock table for correct space calculations

Does the current brute-force Deleting Blocks SQL in topic linked above reduce such needs?

duplicati/Duplicati/Library/Main/Database/LocalDeleteDatabase.cs

Line 97 in d0f1498

    
           cmd.ExecuteNonQuery(@"INSERT INTO ""DeletedBlock"" (""Hash"", ""Size"", ""VolumeID"") SELECT ""Hash"", ""Size"", ""VolumeID"" FROM ""Block"" WHERE ""ID"" NOT IN (SELECT DISTINCT ""BlockID"" AS ""BlockID"" FROM ""BlocksetEntry"" UNION SELECT DISTINCT ""ID"" FROM ""Block"", ""BlocklistHash"" WHERE ""Block"".""Hash"" = ""BlocklistHash"".""Hash"") ");

Of course it has to run, but it likely will eventually run. Regardless, are the goals equivalent?
That would give an opportunity to pick the best scheme, and use it in both of the locations.

Jojo-1000 · 2023-06-28T11:23:15Z

Does the current brute-force Deleting Blocks SQL in topic linked above reduce such needs?

If I understand it correctly, the suggested change runs the query when specific files or filesets are deleted, so blocks don't have to be looked up in the full table.

In a database recreate, all of the blocks have to be examined to see if they are used, so I don't think the change can be applied here.

The current queries are very similar, although I use a temporary table to avoid looking up the block IDs twice.

ts678 · 2023-06-28T11:54:10Z

the suggested change runs the query when specific files or filesets are deleted

PoC change would probably not fit here, which is why I suggest brute-force plan:

all of the blocks have to be examined

Better formatted (courtesy of poorsql.com) version of the cited query looks like it looks at all of the blocks.
The part above UNION is blockset (file data and metadata) oriented. Below the UNION are blocklist blocks.

INSERT INTO "DeletedBlock" (
 	"Hash",
 	"Size",
 	"VolumeID"
 	)
SELECT "Hash",
 	"Size",
 	"VolumeID"
FROM "Block"
WHERE "ID" NOT IN (
 	 	SELECT DISTINCT "BlockID" AS "BlockID"
 	 	FROM "BlocksetEntry"
 	 	
 	 	UNION
 	 	
 	 	SELECT DISTINCT "ID"
 	 	FROM "Block",
 	 	 	"BlocklistHash"
 	 	WHERE "Block"."Hash" = "BlocklistHash"."Hash"
 	 	)

Questions about DISTINCT and UNION ALL versus UNION got many query variations benchmarked here:
Backup Runtime after 2.0.7.1 update. It would be interesting to time your plan against the favorites there,
assuming of course that all give the correct answer. Maybe whichever way runs best gets put to wide use.
In the other use, there was also some moaning about having to do a slow query twice in a row. Any help?

ts678 · 2023-06-29T11:39:53Z

When running recreate database, fill the DeletedBlock table

Reuse existing deleted blocks

These actually pair well together for one occasional use case:

Migrate from Linux to Windows
or more generally with a longer writeup across two requests:
Changing source OS

and there might be others, e.g. if database gets very broken.

Idea is to reattach source file blocks rather than reuploading.
Ideally a recreate can know them by just reading dindex files.
If new design puts the entire destination in DeletedBlock, it's
good that an initial backup on new OS etc. can get them out.

duplicati/Duplicati/Library/Main/Operation/BackupHandler.cs

Line 413 in d0f1498

    
           throw new UserInformationException(string.Format("The backup contains files that belong to another operating system. Proceeding with a backup would cause the database to contain paths from two different operation systems, which is not supported. To proceed without losing remote data, delete all filesets and make sure the --{0} option is set, then run the backup again to re-use the existing data on the remote store.", "no-auto-compact"), "CrossOsDatabaseReuseNotSupported");

duplicatibot · 2023-08-27T19:50:56Z

This pull request has been mentioned on Duplicati. There might be relevant details there:

https://forum.duplicati.com/t/how-to-reuse-remote-data-when-changing-os/16717/4

duplicatibot · 2023-11-07T23:01:42Z

This pull request has been mentioned on Duplicati. There might be relevant details there:

https://forum.duplicati.com/t/database-recreation-not-really-starting/16948/87

duplicatibot · 2024-02-03T19:59:18Z

This pull request has been mentioned on Duplicati. There might be relevant details there:

https://forum.duplicati.com/t/how-to-fix-missing-volumes/17377/9

Jojo-1000 added 4 commits June 27, 2023 22:08

Reuse existing deleted blocks that are still on the remote.

7fa4a0f

This prevents duplicated blocks after a block was deleted and re-added (duplicati#4693). Also fix RemoveMissingBlocks in LocalListBrokenFilesDatabase, which did not clear the DeletedBlock table.

Add test operation to UnitTest with different blocksizes.

999de91

Fix compact issue when duplicate blocks exist.

6fc5e18

Check that blocks which are moved are recorded for the volume to be deleted. If duplicate blocks exist and one is in the DeletedBlock table, this can erase a block entry on an unrelated volume (duplicati#4693).

ts678 mentioned this pull request Sep 1, 2023

Fix missing file error caused by interrupted compact #4967

Closed

Jojo-1000 mentioned this pull request May 11, 2024

Interrupted compact can leave extra hashes in volumes #5184

Open

Merge branch 'master' into fix-test-extra-hashes

a792229

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix "test" error: extra hashes #4982

Fix "test" error: extra hashes #4982

Jojo-1000 commented Jun 27, 2023

ts678 commented Jun 28, 2023

ts678 commented Jun 28, 2023

Jojo-1000 commented Jun 28, 2023

ts678 commented Jun 28, 2023

ts678 commented Jun 29, 2023

duplicatibot commented Aug 27, 2023

duplicatibot commented Nov 7, 2023

duplicatibot commented Feb 3, 2024

Fix "test" error: extra hashes #4982

Are you sure you want to change the base?

Fix "test" error: extra hashes #4982

Conversation

Jojo-1000 commented Jun 27, 2023

Steps to reproduce

Duplicate blocks:

TODO:

Performance impact

Validate database recreate change

ts678 commented Jun 28, 2023

ts678 commented Jun 28, 2023

Jojo-1000 commented Jun 28, 2023

ts678 commented Jun 28, 2023

ts678 commented Jun 29, 2023

duplicatibot commented Aug 27, 2023

duplicatibot commented Nov 7, 2023

duplicatibot commented Feb 3, 2024