Program stalls/can't download entire blog #8

ddescent · 2023-07-02T22:47:49Z

Downloaded 3 days ago, have been trying since then and don't know what I'm doing wrong. I know pretty much nothing about Python or any coding language so this is all pretty new to me.

I've tried the all of these variations of the command:
tumblr_backup.py -i --save-video --save-audio --tag-index blog-name
tumblr_backup.py --save-video --save-audio --tag-index blog-name
tumblr_backup.py -i --save-video --save-audio --tag-index -p year blog-name
tumblr_backup.py -i --save-video --save-audio --tag-index -p year-month blog-name
tumblr_backup.py --save-video --save-audio --tag-index -p year-month blog-name

The first two would work at first, but eventually result in a stall, with a message like "downloading 7000 to 7050" that never moved again. I saw people saying this would be fixed with the -p command, so I tried that. It worked for most of my blog (2016 to 2020), but I got the same stall once I tried 2021. So then, I tried adding the month to the command. After some frustration with the program telling me "Stopping backup: Incremental backup complete, 0 posts backed up", I took out the -i command and that seemed to work. But now I am stuck again, this time on the message "Waiting for worker threads to finish." I don't know what's causing these stalls or how to fix them. I had seen some people saying it could be caused by the fancy/colored text offered in more recent Tumblr updates, but the post that seemed to stall one of my "year-month" attempts didn't have any of that, it was just an image.

cebtenzzre · 2023-07-03T00:08:00Z

If you can apply this patch, either by hand or with GNU patch (copy this to a text file, including the whitespace at the end, and run patch -Np1 -i /path/to/saved/patch in the same directory as tumblr_backup.py) it will tell me which threads are getting stuck and where instead of just stopping at "Waiting for worker threads to finish".

This assumes 10 seconds should be enough for everything to finish, but if you're more patient you could try changing the number on the timeout = time.time() + 10 line to maybe 20 or 30 for a more accurate result.

diff --git a/tumblr_backup.py b/tumblr_backup.py
index d9fb4ea..292fbc7 100755
--- a/tumblr_backup.py
+++ b/tumblr_backup.py
@@ -1520,7 +1520,7 @@ class ThreadPool:
         self.queue = LockedQueue(threading.RLock(), max_queue)
         self.quit = threading.Event()
         self.abort = threading.Event()
-        self.threads = [threading.Thread(target=self.handler) for _ in range(thread_count)]
+        self.threads = [threading.Thread(target=self.handler, daemon=True) for _ in range(thread_count)]
         for t in self.threads:
             t.start()
 
@@ -1540,9 +1540,16 @@ class ThreadPool:
     def cancel(self):
         self.abort.set()
         no_internet.destroy()
+
+        import traceback
+        timeout = time.time() + 10
         for i, t in enumerate(self.threads, start=1):
             logger.status('Stopping threads {}{}\r'.format(' ' * i, '.' * (len(self.threads) - i)))
-            t.join()
+            t.join(max(1, timeout - time.time()))
+        for t in self.threads:
+            if t.is_alive():
+                print(t, 'is stuck')
+                traceback.print_stack(sys._current_frames()[t.ident])
 
         logger.info('Backup canceled.\n')

ddescent · 2023-07-06T23:30:43Z

Thank you for your response! Unfortunately, I haven't been able to recreate the issue because of a new one arising. The program will now get stuck with the message "DNS probe finished: No internet. Waiting...o finish", which is confusing because it will say this despite my computer being connected to the internet and being able to load websites. (Sorry that it said I marked as completed, I am apparently bad with websites too and accidentally marked that haha)

cebtenzzre · 2023-07-07T00:54:43Z

Hm, that's weird. That would imply that your computer is somehow unable to reach Google DNS (8.8.8.8), which the script checks if a web request failed in case you don't have internet. Can you ping 8.8.8.8 ok? What about dig google.com @8.8.8.8 (Linux/macOS) or nslookup google.com 8.8.8.8 (Windows)?

ddescent · 2023-07-07T15:16:12Z

I didn't have any issues pinging/connecting to 8.8.8.8 with those commands

cebtenzzre · 2023-07-09T02:56:55Z

For now you can bypass the check by adding a line to is_dns_working in util.py, like this:

 util.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/util.py b/util.py
index 3bbd5c3..dfef1dc 100644
--- a/util.py
+++ b/util.py
@@ -97,6 +97,7 @@ DNS_QUERY = b'\xf1\xe1\x01\x00\x00\x01\x00\x00\x00\x00\x00\x00\x06google\x03com\
 
 
 def is_dns_working(timeout=None):
+    return True
     try:
         with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as sock:
             if timeout is not None:

I haven't decided what to do about this yet. I suppose having a way to specify an alternate DNS server or disable the feature entirely might be useful if Google DNS isn't available. I can't think of any reason why dig or nslookup would succeed but the check in the script would fail, unless your internet connection is so slow that it takes more than 5 seconds to get a reply - maybe an option to change the timeout would help?

cebtenzzre · 2023-07-13T21:12:23Z

I just pushed 91d872a which provides a --skip-dns-check option you can use to work around that issue. Let me know if you run into anything else.

Demirath · 2024-02-24T19:35:53Z

I might be able to add more context, as it seems to be specific posts which throw the DNS error for me. A specific .[post id].html.[string] file will refuse to download after the error is thrown, and when I wait for all the other queued files to finish (so I can tell which one it is), get the post id, delete my reblog from Tumblr and rerun, then it continues until it hits the next one. I'm unsure what the posts have in common, but this one threw the error twice, once in a 2022 reblog and once in a 2021 reblog: https://www.tumblr.com/bunjywunjy/669018562974957568/petermorwood-caitlynlynch-the1920sinpictures

cebtenzzre · 2024-02-24T20:36:34Z

I might be able to add more context, as it seems to be specific posts which throw the DNS error for me.

This is known - the script only attempts to check for a working internet connection when some network request fails. I had assumed that basically everyone with a working internet connection should be able to send a DNS query to Google, but apparently this is not true - some people are simply unable to e.g. dig google.com @8.8.8.8 (Linux/Mac) or nslookup google.com 8.8.8.8 (Windows), despite otherwise having functioning internet access.

I think the only reason this DNS request would (falsely) fail would be if your internet connection is aggressively firewalled, e.g. becuase you are using a VPN client that tries to prevent leaks of DNS traffic onto the public internet. Does that apply to you?

I suppose this should be changed to a simple HTTP request - perhaps a HEAD request to Tumblr's homepage.

Demirath · 2024-02-24T20:41:38Z

No, as far as I know my internet connection is completely VPN-free.

crispin-cas9 · 2024-02-26T13:24:49Z

I'm currently having the same problem as OP originally had when I try to backup my blog - it stalls at around 7700/51000. No DNS error messages on my end though. I assume it must be getting stuck on a particular post. Any thoughts on how I could try to bypass it? Would the same fixes suggested earlier in the thread be worth trying?

hibiscera · 2024-02-26T19:28:13Z

Also seconding having the same problem as OP, my backup is getting consistently stuck at 25200/33725, all four times I've tried to backup the blog! I also tried by year and immediately get the stall once I try 2012.

aureliawisenri · 2024-03-01T01:38:22Z

also having the same problem - on two of my sub-1k post sideblogs, everything was fine, but when i moved to back up the first of one of my more moderately-sized sideblogs, it started consistently stalling at 2250 to 2299 (of 4449 expected).

Mental-Heretic · 2025-01-14T01:58:51Z

I also have this issue when trying to run a larger blog, I had already run the command at first to backup the original posts and that worked fine, but when trying to backup all of it it became a lot slower and stalled.

cebtenzzre · 2025-01-20T02:39:28Z

This hasn't occurred for me yet, but I made a version with stall detection that you can run if you are seeing this.

From an e-mail I sent to one user:

I thought about this and realized that there wasn't a good way to get useful info about what threads are stalled without modifying the code. So I found the time to make a new version that should help debug this.

If you run pip install -U https://github.com/cebtenzzre/tumblr-utils/archive/stall-debug.zip, you should get a version of tumblr-backup (pip show tumblr-backup will report version 1.0.6.dev0) that contains some extra debugging code for this purpose.

When you run this version of tumblr-backup, it will create a log file called tumblr-backup-log.txt in the current directory. It logs each post that is backed up, and if it detects that no progress is made for five minutes, it will write a dump of stack traces to the end of this file (starting with something like "Timeout (0:05:00)!" followed by a list of threads, files, and line numbers).

If you are able to reproduce the stall with this version of tumblr-backup, please send me that file - especially if it contains the stack traces at the end. As the timeout is conservatively set to five minutes, you will need to wait until there is no more output from tumblr-backup for at least five minutes to get the full debug info.

ddescent closed this as completed Jul 6, 2023

ddescent reopened this Jul 6, 2023

cebtenzzre closed this as completed Jul 13, 2023

cebtenzzre reopened this Feb 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Program stalls/can't download entire blog #8

Program stalls/can't download entire blog #8

ddescent commented Jul 2, 2023

cebtenzzre commented Jul 3, 2023 •

edited

Loading

ddescent commented Jul 6, 2023

cebtenzzre commented Jul 7, 2023

ddescent commented Jul 7, 2023

cebtenzzre commented Jul 9, 2023

cebtenzzre commented Jul 13, 2023

Demirath commented Feb 24, 2024 •

edited

Loading

cebtenzzre commented Feb 24, 2024 •

edited

Loading

Demirath commented Feb 24, 2024

crispin-cas9 commented Feb 26, 2024

hibiscera commented Feb 26, 2024 •

edited

Loading

aureliawisenri commented Mar 1, 2024

Mental-Heretic commented Jan 14, 2025

cebtenzzre commented Jan 20, 2025

Program stalls/can't download entire blog #8

Program stalls/can't download entire blog #8

Comments

ddescent commented Jul 2, 2023

cebtenzzre commented Jul 3, 2023 • edited Loading

ddescent commented Jul 6, 2023

cebtenzzre commented Jul 7, 2023

ddescent commented Jul 7, 2023

cebtenzzre commented Jul 9, 2023

cebtenzzre commented Jul 13, 2023

Demirath commented Feb 24, 2024 • edited Loading

cebtenzzre commented Feb 24, 2024 • edited Loading

Demirath commented Feb 24, 2024

crispin-cas9 commented Feb 26, 2024

hibiscera commented Feb 26, 2024 • edited Loading

aureliawisenri commented Mar 1, 2024

Mental-Heretic commented Jan 14, 2025

cebtenzzre commented Jan 20, 2025

cebtenzzre commented Jul 3, 2023 •

edited

Loading

Demirath commented Feb 24, 2024 •

edited

Loading

cebtenzzre commented Feb 24, 2024 •

edited

Loading

hibiscera commented Feb 26, 2024 •

edited

Loading