Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Program stalls/can't download entire blog #8

Open
ddescent opened this issue Jul 2, 2023 · 14 comments
Open

Program stalls/can't download entire blog #8

ddescent opened this issue Jul 2, 2023 · 14 comments

Comments

@ddescent
Copy link

ddescent commented Jul 2, 2023

Downloaded 3 days ago, have been trying since then and don't know what I'm doing wrong. I know pretty much nothing about Python or any coding language so this is all pretty new to me.

I've tried the all of these variations of the command:
tumblr_backup.py -i --save-video --save-audio --tag-index blog-name
tumblr_backup.py --save-video --save-audio --tag-index blog-name
tumblr_backup.py -i --save-video --save-audio --tag-index -p year blog-name
tumblr_backup.py -i --save-video --save-audio --tag-index -p year-month blog-name
tumblr_backup.py --save-video --save-audio --tag-index -p year-month blog-name

The first two would work at first, but eventually result in a stall, with a message like "downloading 7000 to 7050" that never moved again. I saw people saying this would be fixed with the -p command, so I tried that. It worked for most of my blog (2016 to 2020), but I got the same stall once I tried 2021. So then, I tried adding the month to the command. After some frustration with the program telling me "Stopping backup: Incremental backup complete, 0 posts backed up", I took out the -i command and that seemed to work. But now I am stuck again, this time on the message "Waiting for worker threads to finish." I don't know what's causing these stalls or how to fix them. I had seen some people saying it could be caused by the fancy/colored text offered in more recent Tumblr updates, but the post that seemed to stall one of my "year-month" attempts didn't have any of that, it was just an image.

@cebtenzzre
Copy link
Owner

cebtenzzre commented Jul 3, 2023

If you can apply this patch, either by hand or with GNU patch (copy this to a text file, including the whitespace at the end, and run patch -Np1 -i /path/to/saved/patch in the same directory as tumblr_backup.py) it will tell me which threads are getting stuck and where instead of just stopping at "Waiting for worker threads to finish".

This assumes 10 seconds should be enough for everything to finish, but if you're more patient you could try changing the number on the timeout = time.time() + 10 line to maybe 20 or 30 for a more accurate result.

diff --git a/tumblr_backup.py b/tumblr_backup.py
index d9fb4ea..292fbc7 100755
--- a/tumblr_backup.py
+++ b/tumblr_backup.py
@@ -1520,7 +1520,7 @@ class ThreadPool:
         self.queue = LockedQueue(threading.RLock(), max_queue)
         self.quit = threading.Event()
         self.abort = threading.Event()
-        self.threads = [threading.Thread(target=self.handler) for _ in range(thread_count)]
+        self.threads = [threading.Thread(target=self.handler, daemon=True) for _ in range(thread_count)]
         for t in self.threads:
             t.start()
 
@@ -1540,9 +1540,16 @@ class ThreadPool:
     def cancel(self):
         self.abort.set()
         no_internet.destroy()
+
+        import traceback
+        timeout = time.time() + 10
         for i, t in enumerate(self.threads, start=1):
             logger.status('Stopping threads {}{}\r'.format(' ' * i, '.' * (len(self.threads) - i)))
-            t.join()
+            t.join(max(1, timeout - time.time()))
+        for t in self.threads:
+            if t.is_alive():
+                print(t, 'is stuck')
+                traceback.print_stack(sys._current_frames()[t.ident])
 
         logger.info('Backup canceled.\n')

@ddescent ddescent closed this as completed Jul 6, 2023
@ddescent
Copy link
Author

ddescent commented Jul 6, 2023

Thank you for your response! Unfortunately, I haven't been able to recreate the issue because of a new one arising. The program will now get stuck with the message "DNS probe finished: No internet. Waiting...o finish", which is confusing because it will say this despite my computer being connected to the internet and being able to load websites. (Sorry that it said I marked as completed, I am apparently bad with websites too and accidentally marked that haha)

@ddescent ddescent reopened this Jul 6, 2023
@cebtenzzre
Copy link
Owner

Hm, that's weird. That would imply that your computer is somehow unable to reach Google DNS (8.8.8.8), which the script checks if a web request failed in case you don't have internet. Can you ping 8.8.8.8 ok? What about dig google.com @8.8.8.8 (Linux/macOS) or nslookup google.com 8.8.8.8 (Windows)?

@ddescent
Copy link
Author

ddescent commented Jul 7, 2023

I didn't have any issues pinging/connecting to 8.8.8.8 with those commands

@cebtenzzre
Copy link
Owner

For now you can bypass the check by adding a line to is_dns_working in util.py, like this:

 util.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/util.py b/util.py
index 3bbd5c3..dfef1dc 100644
--- a/util.py
+++ b/util.py
@@ -97,6 +97,7 @@ DNS_QUERY = b'\xf1\xe1\x01\x00\x00\x01\x00\x00\x00\x00\x00\x00\x06google\x03com\
 
 
 def is_dns_working(timeout=None):
+    return True
     try:
         with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as sock:
             if timeout is not None:

I haven't decided what to do about this yet. I suppose having a way to specify an alternate DNS server or disable the feature entirely might be useful if Google DNS isn't available. I can't think of any reason why dig or nslookup would succeed but the check in the script would fail, unless your internet connection is so slow that it takes more than 5 seconds to get a reply - maybe an option to change the timeout would help?

@cebtenzzre
Copy link
Owner

I just pushed 91d872a which provides a --skip-dns-check option you can use to work around that issue. Let me know if you run into anything else.

@Demirath
Copy link

Demirath commented Feb 24, 2024

I might be able to add more context, as it seems to be specific posts which throw the DNS error for me. A specific .[post id].html.[string] file will refuse to download after the error is thrown, and when I wait for all the other queued files to finish (so I can tell which one it is), get the post id, delete my reblog from Tumblr and rerun, then it continues until it hits the next one. I'm unsure what the posts have in common, but this one threw the error twice, once in a 2022 reblog and once in a 2021 reblog: https://www.tumblr.com/bunjywunjy/669018562974957568/petermorwood-caitlynlynch-the1920sinpictures

@cebtenzzre
Copy link
Owner

cebtenzzre commented Feb 24, 2024

I might be able to add more context, as it seems to be specific posts which throw the DNS error for me.

This is known - the script only attempts to check for a working internet connection when some network request fails. I had assumed that basically everyone with a working internet connection should be able to send a DNS query to Google, but apparently this is not true - some people are simply unable to e.g. dig google.com @8.8.8.8 (Linux/Mac) or nslookup google.com 8.8.8.8 (Windows), despite otherwise having functioning internet access.

I think the only reason this DNS request would (falsely) fail would be if your internet connection is aggressively firewalled, e.g. becuase you are using a VPN client that tries to prevent leaks of DNS traffic onto the public internet. Does that apply to you?

I suppose this should be changed to a simple HTTP request - perhaps a HEAD request to Tumblr's homepage.

@cebtenzzre cebtenzzre reopened this Feb 24, 2024
@Demirath
Copy link

No, as far as I know my internet connection is completely VPN-free.

@crispin-cas9
Copy link

I'm currently having the same problem as OP originally had when I try to backup my blog - it stalls at around 7700/51000. No DNS error messages on my end though. I assume it must be getting stuck on a particular post. Any thoughts on how I could try to bypass it? Would the same fixes suggested earlier in the thread be worth trying?

@hibiscera
Copy link

hibiscera commented Feb 26, 2024

Also seconding having the same problem as OP, my backup is getting consistently stuck at 25200/33725, all four times I've tried to backup the blog! I also tried by year and immediately get the stall once I try 2012.

@aureliawisenri
Copy link

also having the same problem - on two of my sub-1k post sideblogs, everything was fine, but when i moved to back up the first of one of my more moderately-sized sideblogs, it started consistently stalling at 2250 to 2299 (of 4449 expected).

@Mental-Heretic
Copy link

I also have this issue when trying to run a larger blog, I had already run the command at first to backup the original posts and that worked fine, but when trying to backup all of it it became a lot slower and stalled.

@cebtenzzre
Copy link
Owner

This hasn't occurred for me yet, but I made a version with stall detection that you can run if you are seeing this.

From an e-mail I sent to one user:

I thought about this and realized that there wasn't a good way to get useful info about what threads are stalled without modifying the code. So I found the time to make a new version that should help debug this.

If you run pip install -U https://github.com/cebtenzzre/tumblr-utils/archive/stall-debug.zip, you should get a version of tumblr-backup (pip show tumblr-backup will report version 1.0.6.dev0) that contains some extra debugging code for this purpose.

When you run this version of tumblr-backup, it will create a log file called tumblr-backup-log.txt in the current directory. It logs each post that is backed up, and if it detects that no progress is made for five minutes, it will write a dump of stack traces to the end of this file (starting with something like "Timeout (0:05:00)!" followed by a list of threads, files, and line numbers).

If you are able to reproduce the stall with this version of tumblr-backup, please send me that file - especially if it contains the stack traces at the end. As the timeout is conservatively set to five minutes, you will need to wait until there is no more output from tumblr-backup for at least five minutes to get the full debug info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants