-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(fs): duplicate entries handling in FileSystem API. #756
Conversation
Thanks for drafting a fix for this issue. If the files are the same, skipping the syntax error and keeping everything else the same seems like a good solution. When the files are different, it's probably useful for most users to be able to see all versions. I don't love the I don't have a deep understanding of these file formats. When two files exist with the same name what would actually happen in a typical system that used the format? Would only one version of the file show up and one be hidden? If that's the case (and there's a way to tell which one would show up), perhaps selecting the normal file for the standard unblob output location would be good and then adding the second (or third, etc) into a secondary output location? |
I tried this out and found there's still an issue with some firmware images: if
|
Yes, that's usually what happens. Silently.
Let's start with an implementation that simply overwrites and see what comes up from the team during review. |
Sounds good to me. In terms of this patch and the error I found above, I initially tried fixing it by making it support deleting the directory when it already existed. That turned out to be a bad idea as it would end up deleting a directory full of lots of extracted files only to replace it with an empty directory. Things seem better if it overwrites duplicate files and ignores existing directories: diff --git a/unblob/file_utils.py b/unblob/file_utils.py
index c96da27..68727d7 100644
--- a/unblob/file_utils.py
+++ b/unblob/file_utils.py
@@ -477,13 +477,22 @@ class FileSystem:
fs_path = self._fs_path(path)
if fs_path.absolute_path.exists():
- report = ExtractionProblem(
- path=str(fs_path.relative_path),
- problem=f"Attempting to create a file that already exists through {path_use_description}",
- resolution="Overwrite.",
- )
- fs_path.absolute_path.unlink()
- self.record_problem(report)
+ if fs_path.absolute_path.is_file():
+ report = ExtractionProblem(
+ path=str(fs_path.relative_path),
+ problem=f"Attempting to create a file that already exists through {path_use_description}",
+ resolution="Overwrite.",
+ )
+ self.record_problem(report)
+ fs_path.absolute_path.unlink()
+
+ elif fs_path.absolute_path.is_dir():
+ report = ExtractionProblem(
+ path=str(fs_path.relative_path),
+ problem=f"Attempting to create a directory that already exists through {path_use_description}",
+ resolution="Ignore",
+ )
+ self.record_problem(report)
if not fs_path.is_safe:
report = PathTraversalProblem( |
And there's another issue in here too - in diff --git a/unblob/file_utils.py b/unblob/file_utils.py
index 65d7ad6..4ee7f97 100644
--- a/unblob/file_utils.py
+++ b/unblob/file_utils.py
@@ -565,7 +565,7 @@ class FileSystem:
def _get_checked_link(self, src: Path, dst: Path) -> Optional[_FSLink]:
link = _FSLink(root=self.root, src=src, dst=dst)
- if link.src.absolute_path.exists():
+ if link.dst.absolute_path.exists():
self.record_problem(link.format_report("File already exists."))
return None
if not link.is_safe: |
The code I pushed is the works of @qkaiser and @AndrewFasano.
The issue is that we are tripping two tests:
|
@nyuware for The
|
The The original problem is a CPIO extractor problem. The fs.unlink(entry.path) before fs.carve(entry.path, self.file, entry.start_offset, entry.size) in the I would also add a test file with duplicate files to cover the new overwriting behavior. An alternative implementation could be to smart up the |
@e3krisztian yes that's the solution we agreed on with @nyuware , tyvm for your input ! :) |
9fbdf4b
to
c2a5c76
Compare
52af630
to
b66df80
Compare
According to @e3krisztian's comment, we decided to The 3 pushes were me failing to rebase the branch and forgetting the integration tests, apologies about that. |
31dcbc0
to
0f38e67
Compare
0f38e67
to
e1dbe74
Compare
e1dbe74
to
1ba3e04
Compare
Early draft to take care of archives and filesystems holding duplicate entries.
We need to decide on a strategy when that happens:
.1
à la wget) ? (rewrite)Do we want to compare hashes before creating a renamed entry or not ?
What happens if we have triplicate or quadruplicate entries ?
Resolve #754