Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(fs): duplicate entries handling in FileSystem API. #756

Merged
merged 1 commit into from
Mar 25, 2024

Conversation

qkaiser
Copy link
Contributor

@qkaiser qkaiser commented Feb 11, 2024

Early draft to take care of archives and filesystems holding duplicate entries.

We need to decide on a strategy when that happens:

  • overwrite the previous entry with the new one ? (overwrite)
  • do not overwrite the previous entry with the new one ? (skip)
  • write the file with a new suffix (e.g. .1 à la wget) ? (rewrite)

Do we want to compare hashes before creating a renamed entry or not ?

What happens if we have triplicate or quadruplicate entries ?

Resolve #754

@qkaiser qkaiser added the bug Something isn't working label Feb 11, 2024
@qkaiser qkaiser self-assigned this Feb 11, 2024
@qkaiser qkaiser marked this pull request as draft February 11, 2024 18:47
@AndrewFasano
Copy link

Thanks for drafting a fix for this issue. If the files are the same, skipping the syntax error and keeping everything else the same seems like a good solution.

When the files are different, it's probably useful for most users to be able to see all versions. I don't love the .1 syntax though, just because that makes the output directory less like how the files would look like in a standard system that was using the filesystem.

I don't have a deep understanding of these file formats. When two files exist with the same name what would actually happen in a typical system that used the format? Would only one version of the file show up and one be hidden? If that's the case (and there's a way to tell which one would show up), perhaps selecting the normal file for the standard unblob output location would be good and then adding the second (or third, etc) into a secondary output location?

@AndrewFasano
Copy link

I tried this out and found there's still an issue with some firmware images: if fs_path.absolute_path.exists() and fs_path.absolute_path is a directory, the unlink call fails:

2024-02-12 02:15.14 [error    ] Unknown error happened while extracting chunk pid=2445276
Traceback (most recent call last):
  File "/unblob/unblob/processing.py", line 607, in _extract_chunk
    if result := chunk.extract(inpath, extract_dir):
  File "/unblob/unblob/models.py", line 118, in extract
    return self.handler.extract(inpath, outdir)
  File "/unblob/unblob/models.py", line 455, in extract
    return self.EXTRACTOR.extract(inpath, outdir)
  File "/unblob/unblob/handlers/archive/cpio.py", line 384, in extract
    parser.dump_entries(fs)
  File "/unblob/unblob/handlers/archive/cpio.py", line 217, in dump_entries
    fs.mkdir(
  File "/unblob/unblob/file_utils.py", line 524, in mkdir
    safe_path = self._get_extraction_path(path, "mkdir")
  File "/unblob/unblob/file_utils.py", line 485, in _get_extraction_path
    fs_path.absolute_path.unlink()
  File "/usr/lib/python3.10/pathlib.py", line 1206, in unlink
    self._accessor.unlink(self)
IsADirectoryError: [Errno 21] Is a directory: '/tmp/tmpg86lywif/DIR-655_REVB_FIRMWARE_2.00.ZIP_extract/dir655_revB_FW_200NA/DIR655B1_FW200NAb33.bin_extract/64-6212776.gzip_extract/gzip.uncompressed_extract/2906880-7696567.gzip_extract/gzip.uncompressed_extract/dev'

@qkaiser
Copy link
Contributor Author

qkaiser commented Feb 12, 2024

I don't have a deep understanding of these file formats. When two files exist with the same name what would actually happen in a typical system that used the format? Would only one version of the file show up and one be hidden?

Yes, that's usually what happens. Silently.

If that's the case (and there's a way to tell which one would show up), perhaps selecting the normal file for the standard unblob output location would be good and then adding the second (or third, etc) into a secondary output location?

Let's start with an implementation that simply overwrites and see what comes up from the team during review.

@AndrewFasano
Copy link

AndrewFasano commented Feb 12, 2024

Sounds good to me.

In terms of this patch and the error I found above, I initially tried fixing it by making it support deleting the directory when it already existed. That turned out to be a bad idea as it would end up deleting a directory full of lots of extracted files only to replace it with an empty directory. Things seem better if it overwrites duplicate files and ignores existing directories:

diff --git a/unblob/file_utils.py b/unblob/file_utils.py
index c96da27..68727d7 100644
--- a/unblob/file_utils.py
+++ b/unblob/file_utils.py
@@ -477,13 +477,22 @@ class FileSystem:
         fs_path = self._fs_path(path)

         if fs_path.absolute_path.exists():
-            report = ExtractionProblem(
-                path=str(fs_path.relative_path),
-                problem=f"Attempting to create a file that already exists through {path_use_description}",
-                resolution="Overwrite.",
-            )
-            fs_path.absolute_path.unlink()
-            self.record_problem(report)
+            if fs_path.absolute_path.is_file():
+                report = ExtractionProblem(
+                    path=str(fs_path.relative_path),
+                    problem=f"Attempting to create a file that already exists through {path_use_description}",
+                    resolution="Overwrite.",
+                )
+                self.record_problem(report)
+                fs_path.absolute_path.unlink()
+
+            elif fs_path.absolute_path.is_dir():
+                report = ExtractionProblem(
+                    path=str(fs_path.relative_path),
+                    problem=f"Attempting to create a directory that already exists through {path_use_description}",
+                    resolution="Ignore",
+                )
+                self.record_problem(report)

         if not fs_path.is_safe:
             report = PathTraversalProblem(

@AndrewFasano
Copy link

And there's another issue in here too - in _get_checked_link the src is the link target, while the dst is the file that's being created (discussion here). These names are very confusing. So the file already exists error should only be raised if the dst already exists.

diff --git a/unblob/file_utils.py b/unblob/file_utils.py
index 65d7ad6..4ee7f97 100644
--- a/unblob/file_utils.py
+++ b/unblob/file_utils.py
@@ -565,7 +565,7 @@ class FileSystem:
     def _get_checked_link(self, src: Path, dst: Path) -> Optional[_FSLink]:
         link = _FSLink(root=self.root, src=src, dst=dst)

-        if link.src.absolute_path.exists():
+        if link.dst.absolute_path.exists():
             self.record_problem(link.format_report("File already exists."))
             return None
         if not link.is_safe:

@AndrewFasano AndrewFasano mentioned this pull request Feb 14, 2024
@qkaiser qkaiser assigned nyuware and unassigned qkaiser Feb 29, 2024
@nyuware
Copy link
Contributor

nyuware commented Mar 11, 2024

The code I pushed is the works of @qkaiser and @AndrewFasano.

  1. Overwrites files that already exists.
  2. Ignores directories if one already exists.

The issue is that we are tripping two tests:

  1. test_open_no_path_traversal, fails because of assert sandbox.problems == [], which is no longer true because we raise Overwriting already existing file on file overwrites
  2. test_open fails because of
    with sandbox.open(path, "rb+") as f: assert f.read() == bytes(100) + b"text" because the file already exists.

1. I think is a non issue
2. do we accept that a file CAN BE overwritten, thus dropping that part of the test, or do we not overwrite ?

@qkaiser
Copy link
Contributor Author

qkaiser commented Mar 12, 2024

@nyuware for test_open_no_path_traversal, you can filter on sandbox.problems. The test should verify that sandbox.problems does not contain PathTraversalProblem. Since the problem reported by your change is a ExtractionProblem it will be ok.

The test_open failure is indicative of something that must change in the API. We should only report a problem when trying to create a file that already exists. In the test_open, we're creating a file and then simply reading from it. Opening a file in read mode should not lead to an overwrite.

_get_extraction_path could take a mode parameter that defaults to wb, if the mode is not write, then we should not report an ExtractionProblem and overwrite.

@e3krisztian
Copy link
Contributor

The FileSystem class (as its doc-string says) is intended to limit the output to a well known directory, and the class' responsibility should end there.

The original problem is a CPIO extractor problem.
I think the resolution of conflicting output files should be handled mostly in the CPIO extractor.

The cpio extractor is using the FileSystem.carve() method, which will die, if the output already exists.
Fortunately there is already a FileSystem.unlink(path) method, which will delete path, but will not complain if path does not exist.
So I would simply put

fs.unlink(entry.path)

before

fs.carve(entry.path, self.file, entry.start_offset, entry.size)

in the cpio extractor.

I would also add a test file with duplicate files to cover the new overwriting behavior.


An alternative implementation could be to smart up the carve() function and method with an overwrite_existing=False default parameter, and change the carve call in cpio to have the extra overwrite_existing=True.

@qkaiser
Copy link
Contributor Author

qkaiser commented Mar 13, 2024

@e3krisztian yes that's the solution we agreed on with @nyuware , tyvm for your input ! :)

@nyuware nyuware force-pushed the 754-duplicate-entries branch from 9fbdf4b to c2a5c76 Compare March 21, 2024 08:41
@nyuware nyuware marked this pull request as ready for review March 21, 2024 08:42
@nyuware nyuware force-pushed the 754-duplicate-entries branch 2 times, most recently from 52af630 to b66df80 Compare March 21, 2024 08:50
@nyuware
Copy link
Contributor

nyuware commented Mar 21, 2024

According to @e3krisztian's comment, we decided to overwrite the files if they already exists by doing fs.unlink(_file_) before it gets carved again and fails to do so.

The 3 pushes were me failing to rebase the branch and forgetting the integration tests, apologies about that.

@nyuware nyuware force-pushed the 754-duplicate-entries branch 5 times, most recently from 31dcbc0 to 0f38e67 Compare March 21, 2024 13:55
@e3krisztian e3krisztian force-pushed the 754-duplicate-entries branch from 0f38e67 to e1dbe74 Compare March 22, 2024 10:42
@qkaiser qkaiser force-pushed the 754-duplicate-entries branch from e1dbe74 to 1ba3e04 Compare March 25, 2024 06:22
@qkaiser qkaiser enabled auto-merge March 25, 2024 06:23
@qkaiser qkaiser merged commit 695b59f into main Mar 25, 2024
13 checks passed
@qkaiser qkaiser deleted the 754-duplicate-entries branch March 25, 2024 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FileExistsError during extraction with Netgear firmware
4 participants