fix(fs): duplicate entries handling in FileSystem API. #756

qkaiser · 2024-02-11T18:47:10Z

Early draft to take care of archives and filesystems holding duplicate entries.

We need to decide on a strategy when that happens:

overwrite the previous entry with the new one ? (overwrite)
do not overwrite the previous entry with the new one ? (skip)
write the file with a new suffix (e.g. .1 à la wget) ? (rewrite)

Do we want to compare hashes before creating a renamed entry or not ?

What happens if we have triplicate or quadruplicate entries ?

Resolve #754

AndrewFasano · 2024-02-12T01:50:51Z

Thanks for drafting a fix for this issue. If the files are the same, skipping the syntax error and keeping everything else the same seems like a good solution.

When the files are different, it's probably useful for most users to be able to see all versions. I don't love the .1 syntax though, just because that makes the output directory less like how the files would look like in a standard system that was using the filesystem.

I don't have a deep understanding of these file formats. When two files exist with the same name what would actually happen in a typical system that used the format? Would only one version of the file show up and one be hidden? If that's the case (and there's a way to tell which one would show up), perhaps selecting the normal file for the standard unblob output location would be good and then adding the second (or third, etc) into a secondary output location?

AndrewFasano · 2024-02-12T03:19:28Z

I tried this out and found there's still an issue with some firmware images: if fs_path.absolute_path.exists() and fs_path.absolute_path is a directory, the unlink call fails:

2024-02-12 02:15.14 [error    ] Unknown error happened while extracting chunk pid=2445276
Traceback (most recent call last):
  File "/unblob/unblob/processing.py", line 607, in _extract_chunk
    if result := chunk.extract(inpath, extract_dir):
  File "/unblob/unblob/models.py", line 118, in extract
    return self.handler.extract(inpath, outdir)
  File "/unblob/unblob/models.py", line 455, in extract
    return self.EXTRACTOR.extract(inpath, outdir)
  File "/unblob/unblob/handlers/archive/cpio.py", line 384, in extract
    parser.dump_entries(fs)
  File "/unblob/unblob/handlers/archive/cpio.py", line 217, in dump_entries
    fs.mkdir(
  File "/unblob/unblob/file_utils.py", line 524, in mkdir
    safe_path = self._get_extraction_path(path, "mkdir")
  File "/unblob/unblob/file_utils.py", line 485, in _get_extraction_path
    fs_path.absolute_path.unlink()
  File "/usr/lib/python3.10/pathlib.py", line 1206, in unlink
    self._accessor.unlink(self)
IsADirectoryError: [Errno 21] Is a directory: '/tmp/tmpg86lywif/DIR-655_REVB_FIRMWARE_2.00.ZIP_extract/dir655_revB_FW_200NA/DIR655B1_FW200NAb33.bin_extract/64-6212776.gzip_extract/gzip.uncompressed_extract/2906880-7696567.gzip_extract/gzip.uncompressed_extract/dev'

qkaiser · 2024-02-12T16:52:07Z

I don't have a deep understanding of these file formats. When two files exist with the same name what would actually happen in a typical system that used the format? Would only one version of the file show up and one be hidden?

Yes, that's usually what happens. Silently.

If that's the case (and there's a way to tell which one would show up), perhaps selecting the normal file for the standard unblob output location would be good and then adding the second (or third, etc) into a secondary output location?

Let's start with an implementation that simply overwrites and see what comes up from the team during review.

AndrewFasano · 2024-02-12T18:00:53Z

Sounds good to me.

In terms of this patch and the error I found above, I initially tried fixing it by making it support deleting the directory when it already existed. That turned out to be a bad idea as it would end up deleting a directory full of lots of extracted files only to replace it with an empty directory. Things seem better if it overwrites duplicate files and ignores existing directories:

diff --git a/unblob/file_utils.py b/unblob/file_utils.py
index c96da27..68727d7 100644
--- a/unblob/file_utils.py
+++ b/unblob/file_utils.py
@@ -477,13 +477,22 @@ class FileSystem:
         fs_path = self._fs_path(path)

         if fs_path.absolute_path.exists():
-            report = ExtractionProblem(
-                path=str(fs_path.relative_path),
-                problem=f"Attempting to create a file that already exists through {path_use_description}",
-                resolution="Overwrite.",
-            )
-            fs_path.absolute_path.unlink()
-            self.record_problem(report)
+            if fs_path.absolute_path.is_file():
+                report = ExtractionProblem(
+                    path=str(fs_path.relative_path),
+                    problem=f"Attempting to create a file that already exists through {path_use_description}",
+                    resolution="Overwrite.",
+                )
+                self.record_problem(report)
+                fs_path.absolute_path.unlink()
+
+            elif fs_path.absolute_path.is_dir():
+                report = ExtractionProblem(
+                    path=str(fs_path.relative_path),
+                    problem=f"Attempting to create a directory that already exists through {path_use_description}",
+                    resolution="Ignore",
+                )
+                self.record_problem(report)

         if not fs_path.is_safe:
             report = PathTraversalProblem(

AndrewFasano · 2024-02-14T20:25:06Z

And there's another issue in here too - in _get_checked_link the src is the link target, while the dst is the file that's being created (discussion here). These names are very confusing. So the file already exists error should only be raised if the dst already exists.

diff --git a/unblob/file_utils.py b/unblob/file_utils.py
index 65d7ad6..4ee7f97 100644
--- a/unblob/file_utils.py
+++ b/unblob/file_utils.py
@@ -565,7 +565,7 @@ class FileSystem:
     def _get_checked_link(self, src: Path, dst: Path) -> Optional[_FSLink]:
         link = _FSLink(root=self.root, src=src, dst=dst)

-        if link.src.absolute_path.exists():
+        if link.dst.absolute_path.exists():
             self.record_problem(link.format_report("File already exists."))
             return None
         if not link.is_safe:

nyuware · 2024-03-11T14:26:36Z

The code I pushed is the works of @qkaiser and @AndrewFasano.

Overwrites files that already exists.
Ignores directories if one already exists.

The issue is that we are tripping two tests:

test_open_no_path_traversal, fails because of assert sandbox.problems == [], which is no longer true because we raise Overwriting already existing file on file overwrites
test_open fails because of
with sandbox.open(path, "rb+") as f: assert f.read() == bytes(100) + b"text" because the file already exists.

1. I think is a non issue
2. do we accept that a file CAN BE overwritten, thus dropping that part of the test, or do we not overwrite ?

qkaiser · 2024-03-12T08:41:53Z

@nyuware for test_open_no_path_traversal, you can filter on sandbox.problems. The test should verify that sandbox.problems does not contain PathTraversalProblem. Since the problem reported by your change is a ExtractionProblem it will be ok.

The test_open failure is indicative of something that must change in the API. We should only report a problem when trying to create a file that already exists. In the test_open, we're creating a file and then simply reading from it. Opening a file in read mode should not lead to an overwrite.

_get_extraction_path could take a mode parameter that defaults to wb, if the mode is not write, then we should not report an ExtractionProblem and overwrite.

e3krisztian · 2024-03-13T13:28:34Z

The FileSystem class (as its doc-string says) is intended to limit the output to a well known directory, and the class' responsibility should end there.

The original problem is a CPIO extractor problem.
I think the resolution of conflicting output files should be handled mostly in the CPIO extractor.

The cpio extractor is using the FileSystem.carve() method, which will die, if the output already exists.
Fortunately there is already a FileSystem.unlink(path) method, which will delete path, but will not complain if path does not exist.
So I would simply put

fs.unlink(entry.path)

before

fs.carve(entry.path, self.file, entry.start_offset, entry.size)

in the cpio extractor.

I would also add a test file with duplicate files to cover the new overwriting behavior.

An alternative implementation could be to smart up the carve() function and method with an overwrite_existing=False default parameter, and change the carve call in cpio to have the extra overwrite_existing=True.

qkaiser · 2024-03-13T13:54:39Z

@e3krisztian yes that's the solution we agreed on with @nyuware , tyvm for your input ! :)

nyuware · 2024-03-21T08:53:08Z

According to @e3krisztian's comment, we decided to overwrite the files if they already exists by doing fs.unlink(_file_) before it gets carved again and fails to do so.

The 3 pushes were me failing to rebase the branch and forgetting the integration tests, apologies about that.

qkaiser added the bug Something isn't working label Feb 11, 2024

qkaiser self-assigned this Feb 11, 2024

qkaiser marked this pull request as draft February 11, 2024 18:47

qkaiser mentioned this pull request Feb 11, 2024

FileExistsError during extraction with Netgear firmware #754

Closed

AndrewFasano mentioned this pull request Feb 14, 2024

Symlink Troubles #761

Closed

qkaiser assigned nyuware and unassigned qkaiser Feb 29, 2024

nyuware force-pushed the 754-duplicate-entries branch from 9fbdf4b to c2a5c76 Compare March 21, 2024 08:41

nyuware marked this pull request as ready for review March 21, 2024 08:42

nyuware force-pushed the 754-duplicate-entries branch 2 times, most recently from 52af630 to b66df80 Compare March 21, 2024 08:50

nyuware force-pushed the 754-duplicate-entries branch 5 times, most recently from 31dcbc0 to 0f38e67 Compare March 21, 2024 13:55

e3krisztian force-pushed the 754-duplicate-entries branch from 0f38e67 to e1dbe74 Compare March 22, 2024 10:42

e3krisztian approved these changes Mar 22, 2024

View reviewed changes

fix(cpio): Fix dupplicated CPIO entries by overwriting them.

1ba3e04

qkaiser force-pushed the 754-duplicate-entries branch from e1dbe74 to 1ba3e04 Compare March 25, 2024 06:22

qkaiser enabled auto-merge March 25, 2024 06:23

qkaiser merged commit 695b59f into main Mar 25, 2024
13 checks passed

qkaiser deleted the 754-duplicate-entries branch March 25, 2024 06:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(fs): duplicate entries handling in FileSystem API. #756

fix(fs): duplicate entries handling in FileSystem API. #756

qkaiser commented Feb 11, 2024

AndrewFasano commented Feb 12, 2024

AndrewFasano commented Feb 12, 2024

qkaiser commented Feb 12, 2024

AndrewFasano commented Feb 12, 2024 •

edited

Loading

AndrewFasano commented Feb 14, 2024

nyuware commented Mar 11, 2024 •

edited

Loading

qkaiser commented Mar 12, 2024

e3krisztian commented Mar 13, 2024

qkaiser commented Mar 13, 2024

nyuware commented Mar 21, 2024

fix(fs): duplicate entries handling in FileSystem API. #756

fix(fs): duplicate entries handling in FileSystem API. #756

Conversation

qkaiser commented Feb 11, 2024

AndrewFasano commented Feb 12, 2024

AndrewFasano commented Feb 12, 2024

qkaiser commented Feb 12, 2024

AndrewFasano commented Feb 12, 2024 • edited Loading

AndrewFasano commented Feb 14, 2024

nyuware commented Mar 11, 2024 • edited Loading

qkaiser commented Mar 12, 2024

e3krisztian commented Mar 13, 2024

qkaiser commented Mar 13, 2024

nyuware commented Mar 21, 2024

AndrewFasano commented Feb 12, 2024 •

edited

Loading

nyuware commented Mar 11, 2024 •

edited

Loading