Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix "test" error: extra hashes #4982

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Jojo-1000
Copy link
Contributor

Closes #4693

This fixes multiple issues with deleted / duplicated blocks:

  • Reuse existing deleted blocks to prevent uploading duplicated copies that can cause problems later
  • When running recreate database, fill the DeletedBlock table for correct space calculations
  • When running compact, check that there is no duplicate block in another volume which would be modified

Steps to reproduce

Error with compact (comment)
Error with recreate (comment)

Duplicate blocks:

  • Create backup with --no-auto-compact and --no-encrypt
  • Create file A.txt with content A, backup 1
  • Delete A.txt, backup 2
  • Recreate A.txt with same content, backup 3

Expected result:
Backup 3 should not upload block for A.txt (hash VZrq0IJk1XldOQlxjN0Fq9SVcuhP5VWQ7vMaiKCP3/0=), as it is already contained in backup 1.

Actual result:
The dblock file for backup 3 contains a block with hash VZrq0IJk1XldOQlxjN0Fq9SVcuhP5VWQ7vMaiKCP3/0=, as does the dblock file for backup 1.

TODO:

Performance impact

Check the performance impact of reusing existing deleted blocks. If necessary, add an index over (Hash, Size) to the DeletedBlock table.
A test of the before/after performance on a large backup with deleted blocks would be appreciated.

The worst case performance could be tested as follows:

  • Create a test scenario where large amounts of new files are added every backup
  • 24% of those are deleted in the next version (less than the compact threshold, so those blocks are kept). It would need to be ensured that the deleted files are spread over the dblock volumes.
  • Compare backup times before and after this change

Validate database recreate change

The change in database recreate works in my tests, but I am not completely sure that the query catches all usages of blocks (I first missed the BlocklistHash table, for example). It could be possible that some blocks are moved to deleted incorrectly.

This prevents duplicated blocks after a block was deleted and re-added (duplicati#4693).
Also fix RemoveMissingBlocks in LocalListBrokenFilesDatabase, which did not clear the DeletedBlock table.
The DeletedBlock table was not filled after a database recreate. This results in incorrect compact size calculations and possible duplicate blocks.
To detect deleted blocks, add all blocks not referenced in a blockset or used as a blocklist hash.
Check that blocks which are moved are recorded for the volume to be deleted. If duplicate blocks exist and one is in the DeletedBlock table, this can erase a block entry on an unrelated volume (duplicati#4693).
@ts678
Copy link
Collaborator

ts678 commented Jun 28, 2023

I first missed the BlocklistHash table, for example

I meant to comment in this issue, where you wrote:

check the Block table for blocks that do not appear in any file.

but you got the PR out first. I think that's all the flavors. A blockset might be data or metadata though (two aspects of a file).
Reducing Time Spent Deleting Blocks SQL shows that idea, although my proposed speedup rewrite didn't get much testing...

@ts678
Copy link
Collaborator

ts678 commented Jun 28, 2023

When running recreate database, fill the DeletedBlock table for correct space calculations

Does the current brute-force Deleting Blocks SQL in topic linked above reduce such needs?

cmd.ExecuteNonQuery(@"INSERT INTO ""DeletedBlock"" (""Hash"", ""Size"", ""VolumeID"") SELECT ""Hash"", ""Size"", ""VolumeID"" FROM ""Block"" WHERE ""ID"" NOT IN (SELECT DISTINCT ""BlockID"" AS ""BlockID"" FROM ""BlocksetEntry"" UNION SELECT DISTINCT ""ID"" FROM ""Block"", ""BlocklistHash"" WHERE ""Block"".""Hash"" = ""BlocklistHash"".""Hash"") ");

Of course it has to run, but it likely will eventually run. Regardless, are the goals equivalent?
That would give an opportunity to pick the best scheme, and use it in both of the locations.

@Jojo-1000
Copy link
Contributor Author

Does the current brute-force Deleting Blocks SQL in topic linked above reduce such needs?

If I understand it correctly, the suggested change runs the query when specific files or filesets are deleted, so blocks don't have to be looked up in the full table.

In a database recreate, all of the blocks have to be examined to see if they are used, so I don't think the change can be applied here.

The current queries are very similar, although I use a temporary table to avoid looking up the block IDs twice.

@ts678
Copy link
Collaborator

ts678 commented Jun 28, 2023

the suggested change runs the query when specific files or filesets are deleted

PoC change would probably not fit here, which is why I suggest brute-force plan:

all of the blocks have to be examined

Better formatted (courtesy of poorsql.com) version of the cited query looks like it looks at all of the blocks.
The part above UNION is blockset (file data and metadata) oriented. Below the UNION are blocklist blocks.

INSERT INTO "DeletedBlock" (
 	"Hash",
 	"Size",
 	"VolumeID"
 	)
SELECT "Hash",
 	"Size",
 	"VolumeID"
FROM "Block"
WHERE "ID" NOT IN (
 	 	SELECT DISTINCT "BlockID" AS "BlockID"
 	 	FROM "BlocksetEntry"
 	 	
 	 	UNION
 	 	
 	 	SELECT DISTINCT "ID"
 	 	FROM "Block",
 	 	 	"BlocklistHash"
 	 	WHERE "Block"."Hash" = "BlocklistHash"."Hash"
 	 	)

Questions about DISTINCT and UNION ALL versus UNION got many query variations benchmarked here:
Backup Runtime after 2.0.7.1 update. It would be interesting to time your plan against the favorites there,
assuming of course that all give the correct answer. Maybe whichever way runs best gets put to wide use.
In the other use, there was also some moaning about having to do a slow query twice in a row. Any help?

@ts678
Copy link
Collaborator

ts678 commented Jun 29, 2023

When running recreate database, fill the DeletedBlock table

Reuse existing deleted blocks

These actually pair well together for one occasional use case:

Migrate from Linux to Windows
or more generally with a longer writeup across two requests:
Changing source OS

and there might be others, e.g. if database gets very broken.

Idea is to reattach source file blocks rather than reuploading.
Ideally a recreate can know them by just reading dindex files.
If new design puts the entire destination in DeletedBlock, it's
good that an initial backup on new OS etc. can get them out.

throw new UserInformationException(string.Format("The backup contains files that belong to another operating system. Proceeding with a backup would cause the database to contain paths from two different operation systems, which is not supported. To proceed without losing remote data, delete all filesets and make sure the --{0} option is set, then run the backup again to re-use the existing data on the remote store.", "no-auto-compact"), "CrossOsDatabaseReuseNotSupported");

@duplicatibot
Copy link

This pull request has been mentioned on Duplicati. There might be relevant details there:

https://forum.duplicati.com/t/how-to-reuse-remote-data-when-changing-os/16717/4

@duplicatibot
Copy link

This pull request has been mentioned on Duplicati. There might be relevant details there:

https://forum.duplicati.com/t/database-recreation-not-really-starting/16948/87

@duplicatibot
Copy link

This pull request has been mentioned on Duplicati. There might be relevant details there:

https://forum.duplicati.com/t/how-to-fix-missing-volumes/17377/9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

test all with full-remote-verification shows "Extra" hashes from error in compact
3 participants