-
Notifications
You must be signed in to change notification settings - Fork 152
Tutorial
Download some wiki backups from Available Backups
This is a tutorial to learn how to backup a wiki using WikiTeam tools. You can backup your wiki or any public wiki.
You can read a documentation for developers in http://wikiteam.readthedocs.io
If you have shell access on the web server, follow the instructions on this page: http://www.mediawiki.org/wiki/Manual:DumpBackup.php for an XML backup. For an image backup, just copy the image directory (see your !LocalSettings.php for details) or use dumpUploads.php to make a fancy list that you can use with tar etc.
If you have no shell access, then use the WikiTeam dumpgenerator.py (download), available in the repository.
There are thousands wikis you can help archive, for which our wonderful friends of WikiApiary could not find a dump.
You will need Python: http://www.python.org/download/
The dump generator will need to be run from a DOS or GNU/Linux command-line. If you encounter issues with missing dependencies, follow this help guide.
There are two types of backups that can be made: XML dumps (current and history) and image dumps. (But you can do both in one dump - see below.)
An XML dump contains the meta-data of the edits (author, date, comment) and the text (wikitext). An XML dump may be "current" or "history". A "history" dump contains the complete history of every page, which is better for historical and research purposes and is the default. A "current" dump contains only the last edit for every page.
An image dump contains all the images available in a wiki, plus their descriptions.
There are two ways to start a backup - API (api.php) or index.php . API is the better method to use, but index.php can be used when the API is not available. (Note: Not all MediaWiki wikis have api.php.) To find out if the wiki you want to back up has api.php or not, open a browser window and go to the wiki's Main Page. Click on the "View history" tab. You will see a URL such as this: http://en.wikipedia.org/w/index.php?title=Main_Page&action=history Now edit the URL: remove everything after /w/, and replace it with api.php (In the example, the result would be: http://en.wikipedia.org/w/api.php). If you see a webpage with the API documentation, copy the URL to the clipboard. If not, the API is probably not being used in the wiki or check the URL and try again (or use index.php).
If there is no api.php, then you'll need to use index.php. For the correct URL, in the example above, remove everything after the "?" and copy the URL to the clipboard. (In the example above the correct URL would be: http://en.wikipedia.org/w/index.php)
Note: DO NOT try to dump the Wikipedia site! It would result in a file that would quite likely be in the terabyte range. We are just using the Wikipedia URL as an example.
At a DOS prompt or GNU/Linux command line, type the following:
If the wiki you want to backup has api.php, use this:
python dumpgenerator.py --api=http://en.wikipedia.org/w/api.php --xml
If you need to use index.php, then use this:
python dumpgenerator.py --index=http://en.wikipedia.org/w/index.php --xml
For a complete dump with both xml and images, use both '--xml' and '--images' in the same command, like this:
python dumpgenerator.py --api=http://en.wikipedia.org/w/api.php --xml --images
The --xml option, by default, downloads the full history of every page. If you only want the last revision for every page, you will need to use --xml --curonly
To reduce the demands of downloading wiki pages one after another, it is recommended that you use a delay. Use --delay to add any length of delay you want (in seconds). For example, a delay of 5 seconds between each request: --delay=5
To resume an interrupted dump, use the parameters --resume and --path to indicate where the incomplete dump is located. The command will look something like this:
python dumpgenerator.py --api=http://en.wikipedia.org/w/index.php --xml --images --resume --path=dumpdirectory
When you start a new dump, the script will create a directory (or folder) in the form of: domainorg-20110520-wikidump . Sometimes there will be a sub-directory in the name, or "wiki" added in - that's OK.
If you want to check the XML dump integrity (optional), there are two possible methods:
1. You can try to import it into a MediaWiki (using this method, which may be slow if wiki is large).
or
2. Type this into your command line (GNU/Linux) to count `<title></title>` `<page>` `</page>` and `<revision>` `</revision>` XML tags:
grep "<title>" *.xml -c;grep "<page>" *.xml -c;grep "</page>" *.xml -c;grep "<revision>" *.xml -c;grep "</revision>" *.xml -c
You should see something similar to this (not the actual numbers - the first three numbers should be the same and the last two should be the same as each other):
- 580
- 580
- 580
- 5677
- 5677
if you don't care to lose the corrupt pages, exclude them with this script (TO DO).
Otherwise if you take note of when the download is interrupted and needs to be resumed by taking a screenshot or similar, you can search in the XML for that file name and check the <page> and </page> tags are opened and closed properly around the page and the two pages above and below it.
If you encountered corruption, what seems to be normal is that the page will be missing the end </page> tag and will have been downloaded again twice as one as a sort of Frankenstein page - you can end the tag properly and delete the second copy of the page (on the original wiki you can visit the Special:Export page by hand and look up these 3 pages and compare that everything is how it should be, though hashes not being transferred for "File:" pages seems normal).
After the wiki dump is complete, you will need to archive the files in the wikidump directory. The WikiTeam project uses the 7-Zip file archiver. 7z is a compressed archive file format. We use it because it is free (the GNU LGPL license) and it has a high compression ratio. Read about it in Wikipedia. There is a version for GNU/Linux (sudo apt-get install p7zip). There are also a graphical front-end.
You can use a graphical front-end to compress and archive the files, or run one of these commands:
- For XML only:
- For the whole folder including all files and subfolders:
It is suggested that you use the following file-naming convention: domainorg-20110520-history.xml - where "domainorg" is the domain name, "20110520" is the date the dump was started. (You can use the domain name and date from the directory that was created when the dump was started.)
launcher.py does all this automatically for you; especially useful for multiple wikis at once (it takes a text list of wikis as input): see "Download a list of wikis" below.
And of course, you then have to release your dumps! You can easily do so on the Internet Archive's wikiteam collection. Login or register first.
Normally, you can just use uploader.py (especially if you have multiple wikis): the script takes the filename of a list of wikis as argument and uploads their dumps to archive.org. If you've used launcher.py or followed the instructions above closely, you only need to:
- Retrieve your S3 keys, save them one per line (in the order provided) on a `keys.txt` file in same directory as uploader.py;
- Run the script like `python uploader.py mywikis`, where `mywikis` is the filename of a list of the api.php URL(s) of the wiki(s) to upload, one per line.
- Create a new item for each wiki.
- Use "wiki-DOMAIN" for the page URL and "Wiki - NAME" for the page title (technically, item identifier and title respectively).
- Add "MediaWiki; wiki; wikiteam" as subject/keywords so that it's possible to find the item.
- If possible, enter more info in the relevant fields, like the license of the wiki.
- Follow the instructions to upload the archives.
- In the last step, put the item in texts/opensource (community texts); it will then moved to the WikiTeam collection (until then, it should be listed with this search).
- For particularly big archives you might want to use S3 (for API keys, see this page), or this tool if you have many items.
- If you use such command line tools, choose "web" as media type: it's more correct, doesn't need authorisation and will make the items appear in the web crawls collection.
- To facilitate future dumps and further metadata fixes, please add the api.php URL of the wiki to the `originalurl` field.
- Beware that if you specify a non-Creative Commons license in the licenseurl parameter, due to a bug, you won't be able to edit the item metadata via the graphical interface any longer, so put all the details from the start.
A script has been created to mass download lists of wikis, initially for the TaskForce: launcher.py.
Just place it in the same directory as dumpgenerator.py and run it like this: python launcher.py mylist.txt. The list must contain URLs to the api.php of each wiki you have to download, one per line.
The script will try to dump both pages and images of all those wikis, to verify the integrity of the XML dump and to create archives ready for upload to the Internet Archive. It doesn't handle upload yet, but it will soon.
Skimming the output of the script is not very handy, but you can just rerun after completion to ensure that it actually did everything and not terminate some download due to an error: the script checks if the XML is complete (closed by `</mediawiki>`) and all images have been downloaded (the last one on the list has been downloaded) and if not resumes the dump; creates the compressed archive if not produced yet; downloads the missing wikis.
Read http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps and http://www.mediawiki.org/wiki/Manual:ImportImages.php.
Remember you always can see the inline help with --help parameter.
If you have further questions, you can send a message to our mailing list. If you detect any error, report an issue. Be bold!
Welcome to the WikiTeam documentation wiki! We are a group dedicated to archiving wikis around the Internet, and you are invited to be part of it! Find out more.
- Main Page
- News
- Tutorial
- Developers docs
- FAQ
- Software
- Collections
- Community
- Research
- SpeedyDeletion
- WikiFarms