API-only export option (without Special:Export) #311

nemobis · 2018-05-07T22:26:36Z

Wanted for various reasons. Current implementation: --xmlrevisions, false by default. If the default method to download wikis doesn't work for you, please try using the flag --xmlrevisions and let us know how it went.
https://groups.google.com/forum/#!topic/wikiteam-discuss/ba2K-WeRJ-0

Previous takes:
#195
#280

The text was updated successfully, but these errors were encountered:

nemobis · 2018-05-07T22:28:26Z

Does not yet work for Wikia, partly because they return a blank page for exportnowrap used in getXMLHeader(). Have to use wikitools there as well?

  File "./dumpgenerator.py", line 2195, in <module>
    main()
  File "./dumpgenerator.py", line 2187, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1756, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 717, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "./dumpgenerator.py", line 792, in getXMLRevisions
    for page in result['query']['allrevisions']:
KeyError: 'query'
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

nemobis · 2018-05-09T05:15:58Z

Before even downloading the first revisions, there is some wiki where the export gets stuck in an endless loop of "Invalid JSON response. Trying the request again" or similar message:

Analysing http://www.haplozone.net/wiki/index.php
Trying generating a new dump into a new directory...
Retrieving the XML for every page from the beginning
Invalid JSON, trying request again
Invalid JSON, trying request again

nemobis · 2018-05-18T21:30:12Z

Now tested with a 1.12 wiki,http://meritbadge.org/wiki/index.php/Main_Page , courtesy https://lists.wikimedia.org/pipermail/wikitech-l/2018-May/090004.html : 27cbdfd 680145e

nemobis · 2018-05-18T22:04:26Z

For Wikia, the API export works without exportnowrap: http://00eggsontoast00.wikia.com/api.php?action=query&prop=revisions&meta=siteinfo&titles=Main%20Page&export&format=json

But facepalm, where the API help says "Export the current revisions of all given or generated pages" it really means that any revision other than the current one is ignored: http://00eggsontoast00.wikia.com/api.php?action=query&revids=3|80|85&export is the same as http://00eggsontoast00.wikia.com/api.php?action=query&revids=85&export

nemobis · 2018-05-19T15:53:25Z

Here we go: 7143f7e

It's very fast on most wikis, because it makes way less requests if your average number of revisions per page is less than 50.

The first dump produced with this method is: https://archive.org/download/wiki-ferstaberindecom_f2_en/ferstaberindecom_f2_en-20180519-history.xml.7z

nemobis · 2018-05-19T19:39:14Z

And now also Wikia, without the allrevisions module: https://archive.org/details/wiki-00eggsontoast00wikiacom

The XML built "manually" with --xmlrevisions is almost the same as usual (at the cost of making at least one request per page), but it's missing parentid and at the moment minoredit.

nemobis · 2018-05-19T21:49:10Z

Analysing http://nimiarkisto.fi/w/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
29 namespaces found
    Retrieving titles in the namespace 0
.Traceback (most recent call last):
  File "./dumpgenerator.py", line 2288, in <module>
    main()
  File "./dumpgenerator.py", line 2280, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1844, in createNewDump
    getPageTitles(config=config, session=other['session'])
  File "./dumpgenerator.py", line 416, in getPageTitles
    for title in titles:
  File "./dumpgenerator.py", line 292, in getPageTitlesAPI
    allpages = jsontitles['query']['allpages']
KeyError: 'query'

nemobis · 2018-05-21T06:04:09Z

In testing this for Wikia, remember that the number of edits on Special:Statistics isn't always truthful (this is normal on MediaWiki). For instance http://themodifyers.wikia.com/wiki/Special:Statistics says 2333 edits, but dumpgenerator.py exports 1864, and that's the right amount: entering all the titles on themodifyers.wikia.com/wiki/Special:Export and exporting all revisions gives the same amount.

Also, a page with 53 revisions on that wiki was correctly exported, which means that API continuation works; that's something!

nemobis · 2018-05-21T20:14:05Z

Not sure what's going on at http://zh.asoiaf.wikia.com/api.php

Traceback (most recent call last):
  File "./dumpgenerator.py", line 2308, in <module>
    main()
  File "./dumpgenerator.py", line 2300, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1864, in createNewDump
    getPageTitles(config=config, session=other['session'])
  File "./dumpgenerator.py", line 429, in getPageTitles
    for title in titles:
  File "./dumpgenerator.py", line 252, in getPageTitlesAPI
    config=config, session=session)
TypeError: 'NoneType' object is not iterable
tail: cannot open 'zhasoiafwikiacom-20180521-wikidump/zhasoiafwikiacom-20180521-history.xml' for reading: No such file or directory
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

http://zhpad.wikia.com/api.php seems to eventually fail as well

nemobis · 2018-05-22T06:51:26Z

Next step: implementing resuming. I'll probably take the readTitles() part out of getXMLRevisions() to make things clearer.

I think it would be the occasion to make sure that we log something to error.log when we catch an exception or call sys.exit(1), so that it's easier to inspect failed dumps and see what happened when they stopped. I have almost 4k interrupted Wikia dumps.

nemobis · 2018-05-25T06:43:45Z

Later I'll post a series of errors.log from failed dumps.

For now I tend to believe that, when the dump runs to the end, the XML really is as complete as possible. For instance, on a biggish wiki like http://finalfantasy.wikia.com/wiki/Special:Statistics :

$ grep -c "<revision>" finalfantasywikiacom-20180523-history.xml
1638424
$ grep -c "<page>" finalfantasywikiacom-20180523-history.xml
311259

That's over a million "missing" revisions compared to what Special:Statistics says, which however cannot really be trusted. The number of pages is pretty close.

On the other hand, it could be that the continuation is not working in some cases... In clubpenguinwikiacom-20180523-history.xml, I'm not sure I see the 3200 revisions that the main page ought to have.

Otherwise the query continuation may fail and only the top revisions will be exported. Tested with Wikia: http://clubpenguin.wikia.com/api.php?action=query&prop=revisions&titles=Club_Penguin_Wiki Also add parentid since it's available after all. #311 (comment)

nemobis · 2018-05-27T16:04:07Z

Some wiki might be in a loop...

1062 more revisions exported
1060 more revisions exported
1061 more revisions exported
1061 more revisions exported
1062 more revisions exported
1061 more revisions exported
1062 more revisions exported
1060 more revisions exported
1061 more revisions exported
1062 more revisions exported

Or not: it seems legit, some bot is editing a series of pages every day. http://runescape.wikia.com/wiki/Module:Exchange/Dragon_crossbow_(u)/Data?limit=1000&action=history

nemobis · 2020-02-08T09:44:59Z

Does not work in http://wiki.openkm.com/api.php (normal --xml --api works)

Getting the XML header from the API
Retrieving the XML for every page from the beginning
Invalid JSON, trying request again
Invalid JSON, trying request again
Invalid JSON, trying request again
Invalid JSON, trying request again
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 5 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 10 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 15 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 20 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 25 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 30 seconds

* It was just an old trick to get past some barriers which were waived with GET. * It's not conformant and doesn't play well with some redirects. * Some recent wikis seem to not like it at all, see also issue WikiTeam#311.

nemobis · 2020-02-09T11:09:03Z

Sometimes allpages works until it doesn't:

Analysing http://xn--b1amah.xn--d1ad.xn--p1ai/w/api.php

Warning!: "./xn__b1amahxn__d1adxn__p1ai_w-20200209-wikidump" path exists
There is a dump in "./xn__b1amahxn__d1adxn__p1ai_w-20200209-wikidump", probably incomplete.
If you choose resume, to avoid conflicts, the parameters you have chosen in the current session will be ignored
and the parameters available in "./xn__b1amahxn__d1adxn__p1ai_w-20200209-wikidump/config.txt" will be loaded.
Do you want to resume ([yes, y], [no, n])? n
You have selected: NO
Trying to use path "./xn__b1amahxn__d1adxn__p1ai_w-20200209-wikidump-2"...
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
16 namespaces found
    Retrieving titles in the namespace 0
..    602 titles retrieved in the namespace 0
    Retrieving titles in the namespace 1
.    1 titles retrieved in the namespace 1
    Retrieving titles in the namespace 2
.    3 titles retrieved in the namespace 2
    Retrieving titles in the namespace 3
.    3 titles retrieved in the namespace 3
    Retrieving titles in the namespace 4
.The allpages API returned nothing. Exit.

nemobis · 2020-02-09T11:13:04Z

How nice some webservers are:

Titles saved at... halachipediacom-20200209-titles.txt
2364 page titles loaded
http://www.halachipedia.com/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
Invalid JSON, trying request again
Invalid JSON, trying request again
HTTPError: HTTP Error 503: Service Unavailable trying request again in 5 seconds
HTTPError: HTTP Error 503: Service Unavailable trying request again in 10 seconds
HTTPError: HTTP Error 503: Service Unavailable trying request again in 15 seconds
Invalid JSON, trying request again
Invalid JSON, trying request again
HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found trying request again in 20 seconds
HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found trying request again in 25 seconds

nemobis · 2020-02-09T11:13:58Z

Gotta check for actual presence of the export field in the response:

Titles saved at... aroundisleofwightinfo-20200209-titles.txt
3230 page titles loaded
http://www.aroundisleofwight.info/api.php
Getting the XML header from the API
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2323, in <module>
    main()
  File "./dumpgenerator.py", line 2315, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1882, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 731, in generateXMLDump
    header, config = getXMLHeader(config=config, session=session)
  File "./dumpgenerator.py", line 471, in getXMLHeader
    xml = r.json()['query']['export']['*']
KeyError: 'export'
tail: cannot open ‘aroundisleofwightinfo-20200208-wikidump/aroundisleofwightinfo-20200208-history.xml’ for reading: No such file or directory
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

nemobis · 2020-02-09T11:15:24Z

HTTP 405:

Titles saved at... wikiainigmaeu-20200209-titles.txt
139 page titles loaded
http://wiki.ainigma.eu/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
HTTPError: HTTP Error 405: Method Not Allowed trying request again in 5 seconds
HTTPError: HTTP Error 405: Method Not Allowed trying request again in 10 seconds
HTTPError: HTTP Error 405: Method Not Allowed trying request again in 15 seconds

nemobis · 2020-02-09T11:19:32Z

Or even the query:

Titles saved at... masu6fsk-20200209-titles.txt
247 page titles loaded
http://masu.6f.sk/api.php
Getting the XML header from the API
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2323, in <module>
    main()
  File "./dumpgenerator.py", line 2315, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1882, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 731, in generateXMLDump
    header, config = getXMLHeader(config=config, session=session)
  File "./dumpgenerator.py", line 471, in getXMLHeader
    xml = r.json()['query']['export']['*']
KeyError: 'query'

nemobis · 2020-02-10T09:24:16Z

HTTP Error 493 :o

Titles saved at... opendiagnostixorg-20200210-titles.txt
28095 page titles loaded
http://opendiagnostix.org/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
16 namespaces found
Trying to export all revisions from namespace 0
Warning. Could not use allrevisions, wiki too old.
/home/federico/.local/lib/python2.7/site-packages/wikitools/api.py:155: FutureWarning: The querycontinue option is deprecated and will be removed
in a future release, use the new queryGen function instead
for queries requring multiple requests
  for queries requring multiple requests""", FutureWarning)
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
3 more revisions exported
1 more revisions exported
1 more revisions exported
4 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
5 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
HTTPError: HTTP Error 493: Forbidden WAF trying request again in 5 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 10 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 15 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 20 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 25 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 30 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 35 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 40 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 45 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 50 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 55 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 60 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 65 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 70 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 75 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 80 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 85 seconds

nemobis · 2020-02-10T09:28:19Z

Examples of wikis where --xmlrevisions didn't work and and dumpgenerator had to be killed manually:

http://encyclopaedia.herdereditorial.com/w/api.php
http://gobblerpedia.org/w/api.php
http://opendiagnostix.org/api.php
http://semantic.wiki/wiki-de/api.php
http://wiki.elitesoft.com.br/api.php
http://wiki.rabenthal.net/api.php
http://www.insult.wiki/w/api.php
http://nichework.com/w/api.php
http://mediawiki.xn--klarmachen-ndert-5nb.de/api.php
http://www.archi-wiki.org/api.php
http://www.gremiopedia.com/api.php
http://whythisway.org/w/api.php
http://wiki.delia-derbyshire.net/api.php
http://www.harmfrielink.nl/wiki/api.php
http://en.wiki.spotwizard.org/api.php|
http://fgo.wiki/api.php
http://nordicnames.de/w/api.php
http://nichework.com/w/api.php
http://roksao.com/api.php
http://secret-wiki.de/mediawiki/api.php
https://evilbabes.fandom.com/api.php
http://overwiki.ru/api.php
http://wiki.ainigma.eu/api.php
http://wiki.debianforum.de/wiki/api.php
http://wiki.dcinside.com/api.php
http://www.halachipedia.com/api.php
http://www.icp.uni-stuttgart.de/~icp/mediawiki/api.php
http://www.tinymicros.com/mediawiki/api.php

nemobis · 2020-02-10T11:33:18Z

I'm not quite sure why this happens in my latest local code, will need to check:

  <page>
    <title>Main Page</title>
    <ns>0</ns>
    <id>1</id>
    <redirect title="Main page" />
    <revision>
      <id>3677</id>
      <parentid>1</parentid>
      <timestamp>2018-12-19T22:15:31Z</timestamp>
      <contributor>
        <username>Wiki-admin</username>
        <id>45</id>
      </contributor>
      <comment>Redirected page to [[Main page]]</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve" bytes="23">#REDIRECT [[Main_page]]</text>
      <sha1>o2jw5c565achwt31azfnu9zc2zxgqpr</sha1>
    </revision>
  </page>
<page>
  <title>Main Page</title>
  <ns>0</ns>
  <id>1</id>
  <revision>
    <id>3677</id>
    <parentid>1</parentid>
    <timestamp>2018-12-19T22:15:31Z</timestamp>
    <contributor>
      <id>45</id>
      <username>Wiki-admin</username>
    </contributor>
    <comment>Redirected page to [[Main page]]</comment>
    <text bytes="23" space="preserve">#REDIRECT [[Main_page]]</text>
    <model>wikitext</model>
    <sha1>ce111c28c158bacd1ad89fbacb33e48d0e2e383f</sha1>
  </revision>
  <revision>
    <id>1</id>
    <parentid>0</parentid>
    <timestamp>2018-12-13T21:14:03Z</timestamp>
    <contributor>
      <id>0</id>
      <username>MediaWiki default</username>
    </contributor>
    <comment></comment>
    <text bytes="735" space="preserve">&lt;strong&gt;MediaWiki has been installed.&lt;/strong&gt;

Consult the [https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents User's Guide] for information on using the wiki software.

== Getting started ==
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:Configuration_settings Configuration settings list]
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:FAQ MediaWiki FAQ]
* [https://lists.wikimedia.org/mailman/listinfo/mediawiki-announce MediaWiki release mailing list]
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Localisation#Translation_resources Localise MediaWiki for your language]
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:Combating_spam Learn how to combat spam on your wiki]</text>
    <model>wikitext</model>
    <sha1>5702e4d5fd9173246331a889294caf01a3ad3706</sha1>
  </revision>
</page>

nemobis · 2020-02-10T12:27:41Z

28095 page titles loaded
http://opendiagnostix.org/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2363, in <module>
    main()
  File "./dumpgenerator.py", line 2355, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1922, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 755, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "./dumpgenerator.py", line 814, in getXMLRevisions
    site = mwclient.Site(apiurl.netloc, apiurl.path.replace("api.php", ""))
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 131, in __init__
    self.site_init()
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 153, in site_init
    retry_on_error=False)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 235, in get
    return self.api(action, 'GET', *args, **kwargs)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 286, in api
    info = self.raw_api(action, http_method, **kwargs)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 434, in raw_api
    http_method=http_method)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 395, in raw_call
    stream = self.connection.request(http_method, url, **args)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 486, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 598, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 370, in send
    timeout=timeout
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 544, in urlopen
    body=body, headers=headers)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 344, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 314, in _raise_timeout
    if 'timed out' in str(err) or 'did not complete (read)' in str(err):  # Python 2.6
TypeError: __str__ returned non-string (type SysCallError)
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

nemobis · 2020-02-10T13:44:48Z

mwclient doesn't seem to handle retries very well, need to check:

Traceback (most recent call last):
  File "dumpgenerator.py", line 2375, in <module>
    
  File "dumpgenerator.py", line 2367, in main
    resumePreviousDump(config=config, other=other)
  File "dumpgenerator.py", line 1934, in createNewDump
    getPageTitles(config=config, session=other['session'])
  File "dumpgenerator.py", line 755, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "dumpgenerator.py", line 875, in getXMLRevisions
    exportrequest = site.api(**exportparams)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 286, in api
    info = self.raw_api(action, http_method, **kwargs)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 434, in raw_api
    http_method=http_method)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 395, in raw_call
    stream = self.connection.request(http_method, url, **args)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='gobblerpedia.org', port=443): Read timed out. (read timeout=30)

nemobis · 2020-02-10T15:49:55Z

Seems fine now on a MediaWiki 1.16 wiki. There are some differences in what we get for some optional fields like parentid, userid, size of a revision; and our XML made by etree is less eager to escape Unicode characters. Hopefully doesn't matter, although we should ideally test an import on a recent MediaWiki.
wikirabenthalnet-20200210-history-test.zip

nemobis · 2020-02-10T20:04:48Z

HTTP Error 493 :o

This comes and goes, could try adding to status_forcelist together with 406 seen for other wikis.

http://masu.6f.sk/api.php

Here we can do little, the index.php and api.php responses confuse the script but indeed there isn't much we can do as even the most basic response gets a DB error:

internal_api_error_DBQueryError
http://masu.6f.sk/api.php?action=query&meta=siteinfo&siprop=general

HTTPError: HTTP Error 405: Method Not Allowed trying request again in 5 seconds

This is not helped by setting http_method="GET" (https://mwclient.readthedocs.io/en/latest/reference/site.html#mwclient.client.Site.api ). It's a MediaWiki 1.21.1 wiki so allrevisions is not available, but the HTTPError prevented the exception from making us switch to the next strategy. Once we catch that, it works via GET: 49017e3 . Ideally we'd need to check this only once at the beginning, but it seems that the webservers do not want to afford us this luxury.

http://www.aroundisleofwight.info/api.php

This is a misconfigured wiki, see #355 (comment)

http://www.halachipedia.com/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
Invalid JSON, trying request again

This one now (MediaWiki 1.31.1) gives:

http://www.halachipedia.com/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
20 namespaces found
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML
This mwclient version seems not to work for us. Exiting.

Sometimes allpages works until it doesn't:
Analysing http://xn--b1amah.xn--d1ad.xn--p1ai/w/api.php

Still broken (MediaWiki 1.23)

Does not work in http://wiki.openkm.com/api.php (normal --xml --api works)

Still broken (MediaWiki 1.27).

Analysing http://nimiarkisto.fi/w/api.php

Still broken (MediaWiki 1.31)

nemobis · 2020-02-13T07:15:52Z

The number of revisions cannot always be a multiple of 50 (example from https://villainsrpg.fandom.com/ ):

    Eve Man                                                                                                                                                                                    4 more revisions exported
    Event Horizon
50 more revisions exported
50 more revisions exported
    Evil
50 more revisions exported
50 more revisions exported
    Existence (Secret)
9 more revisions exported
Downloaded 400 pages
    Extinction
10 more revisions exported

It should be 51 in https://villainsrpg.fandom.com/wiki/Evil?offset=20111224190533&action=history
We're getting 49 revisions again and then the 1 we were missing. Not a big deal but not ideal either.

Ouch no, we were not using the new batch at all. Ahem.

nemobis · 2020-02-13T12:23:34Z

The XML doesn't validate against the respective schema:

$ xmllint --schema ../export-0.10.xsd --noout girlfriend_karifandomcom-20200213-history.xml
...
girlfriend_karifandomcom-20200213-history.xml:76504: element text: Schemas validity error : Element '{http://www.mediawiki.org/xml/export-0.10/}text': This element is not expected. Expected is one of ( {http://www.mediawiki.org/xml/export-0.10/}minor, {http://www.mediawiki.org/xml/export-0.10/}comment, {http://www.mediawiki.org/xml/export-0.10/}model ).
girlfriend_karifandomcom-20200213-history.xml fails to validate

But then even the vanilla Special:Export output doesn't. Makes me sad.

$ xmllint --schema export-0.10.xsd --noout /tmp/Girlfriend+Kari+Wiki-20200213070422.xml
/tmp/Girlfriend+Kari+Wiki-20200213070422.xml:52: element text: Schemas validity error : Element '{http://www.mediawiki.org/xml/export-0.10/}text': This element is not expected. Expected is one of ( {http://www.mediawiki.org/xml/export-0.10/}minor, {http://www.mediawiki.org/xml/export-0.10/}comment, {http://www.mediawiki.org/xml/export-0.10/}model ).
/tmp/Girlfriend+Kari+Wiki-20200213070422.xml fails to validate
$ xmllint --version
xmllint: using libxml version 20909
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib Lzma

nemobis · 2020-02-13T14:00:35Z

http://xn--b1amah.xn--d1ad.xn--p1ai/w/api.php

Fine now

Does not work in http://wiki.openkm.com/api.php (normal --xml --api works)

Fixed with API limit 50 at b162e7b

Analysing http://nimiarkisto.fi/w/api.php

Fixed with automatic switch to HTTPS at d543f7d

nemobis · 2020-02-14T10:24:17Z

Still have to implement resume:

Analysing https://gundam.fandom.com/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
Resuming XML dump from "File:G Saviour Bugu2 rear view.JPG"
https://gundam.fandom.com/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
40 namespaces found
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML
Warning. Could not use allrevisions. Wiki too old?
Getting titles to export all the revisions of each
    "Kurenai Musha" Red Warrior Amazing
        1 more revisions exported
    ...So We Meet Again
        5 more revisions exported
    0-Riser
        3 more revisions exported

It should just be a matter of passing start to getXMLRevisions() in generateXMLDump().

nemobis · 2020-02-14T11:23:39Z

I'm happy to see that we sometimes receive less than the requested 50 revisions and nothing bad happens:

"This result was truncated because it would otherwise be larger than the limit of 8388608 bytes"

nemobis · 2020-02-15T13:16:03Z

nothing bad happens

Except that they didn't check whether they had revisions bigger than that:
https://pvx.fandom.com/wiki/User_talk:PVX-Misfate?offset=20071116000000&limit=20&action=history

nemobis · 2020-02-24T17:33:42Z

Hm, I wonder why so many errors on this MediaWiki 1.25 wiki (the XML became half of the previous round) https://archive.org/download/wiki-wikimarionorg/wikimarionorg-20200224-history.xml.7z/errors.log

nemobis · 2020-03-02T11:39:31Z

        2 more revisions exported
'*'
Traceback (most recent call last):
  File "dumpgenerator.py", line 2528, in <module>
    main()
  File "dumpgenerator.py", line 2518, in main
    resumePreviousDump(config=config, other=other)
  File "dumpgenerator.py", line 2165, in resumePreviousDump
    session=other['session'])
  File "dumpgenerator.py", line 727, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session, start=start):
  File "dumpgenerator.py", line 829, in getXMLRevisions
    yield makeXmlFromPage(page)
  File "dumpgenerator.py", line 1083, in makeXmlFromPage
    raise PageMissingError(page['title'], e)
__main__.PageMissingError: page 'DevStack' not found

nemobis · 2020-03-07T21:52:07Z

http://www.veikkos-archiv.com/api.php fails completely

nemobis · 2020-03-07T21:53:13Z

Simple command with which I found some XML files which were actually empty (only the header):

find -maxdepth 1 -type f -name "*7z" -size -500k -print0 | xargs -0 -P32 -n1 7z l | grep xml | grep -E " [0-9]{4} " | grep -Ev " [0-9]{5,} "  | grep -Eo "[^ ]+$" | sed 's,.xml$,.xml.7z,g'

nemobis added enhancement help wanted labels May 7, 2018

nemobis added this to the 0.4 milestone May 7, 2018

nemobis mentioned this issue May 22, 2018

Add support to backup pages using API:Query instead of Special:Export #280

Closed

nemobis mentioned this issue May 25, 2018

Fall back from queryGen() to query() on versions with querycontinue only alexz-enwp/wikitools#56

Open

nemobis mentioned this issue Feb 7, 2020

Print titles of pages which dumpgenerator.py failed to download for MemoryError or other fatals #282

Closed

nemobis self-assigned this Feb 7, 2020

nemobis pinned this issue Feb 7, 2020

nemobis mentioned this issue Feb 7, 2020

Large histories memory error #8

Open

nemobis mentioned this issue Feb 8, 2020

Use GET rather than POST for API requests #349

Merged

nemobis mentioned this issue Feb 10, 2020

An error have occurred while retrieving "Main_Page" #318

Closed

API-only export option (without Special:Export) #311

API-only export option (without Special:Export) #311

Comments

nemobis commented May 7, 2018 • edited Loading

nemobis commented May 7, 2018

nemobis commented May 9, 2018 • edited Loading

nemobis commented May 18, 2018

nemobis commented May 18, 2018

nemobis commented May 19, 2018

nemobis commented May 19, 2018

nemobis commented May 19, 2018

nemobis commented May 21, 2018

nemobis commented May 21, 2018

nemobis commented May 22, 2018

nemobis commented May 25, 2018

nemobis commented May 27, 2018

nemobis commented Feb 8, 2020 • edited Loading

nemobis commented Feb 9, 2020

nemobis commented Feb 9, 2020

nemobis commented Feb 9, 2020

nemobis commented Feb 9, 2020

nemobis commented Feb 9, 2020

nemobis commented Feb 10, 2020

nemobis commented Feb 10, 2020

nemobis commented Feb 10, 2020

nemobis commented Feb 10, 2020

nemobis commented Feb 10, 2020

nemobis commented Feb 10, 2020

nemobis commented Feb 10, 2020

nemobis commented Feb 13, 2020 • edited Loading

nemobis commented Feb 13, 2020

nemobis commented Feb 13, 2020

nemobis commented Feb 14, 2020

nemobis commented Feb 14, 2020

nemobis commented Feb 15, 2020

nemobis commented Feb 24, 2020

nemobis commented Mar 2, 2020

nemobis commented Mar 7, 2020

nemobis commented Mar 7, 2020

nemobis commented May 7, 2018 •

edited

Loading

nemobis commented May 9, 2018 •

edited

Loading

nemobis commented Feb 8, 2020 •

edited

Loading

nemobis commented Feb 13, 2020 •

edited

Loading