You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While trying to generate an XML dump for the Touhou Wiki with the dumpgenerator.py (master#d7b6924), I noticed that no XML besides the Main Page was being generated, with every other entry being marked as missing in the wiki (probably deleted) in the errors.log file.
would successfully find and load all page titles from all namespaces, and then starting "downloading pages":
[...]
Analysing https://en.touhouwiki.net/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
24 namespaces found
Retrieving titles in the namespace 0
28061 titles retrieved in the namespace 0
[...]
71698 page titles loaded
https://en.touhouwiki.net/api.php
Retrieving the XML for every page from "start"
Downloaded 10 pages
[...]
But pausing the script and checking the errors.log file would result in:
2022-02-2821:26:22:The page "!?" was missing in the wiki (probably deleted)
2022-02-2821:26:22:The page ""Activity"Case:04 -Cosmic Horoscope-" was missing in the wiki (probably deleted)
2022-02-2821:26:22:The page ""Activity" Case:01 -Graveyard Memory-" was missing in the wiki (probably deleted)
2022-02-2821:26:22:The page ""Activity" Case:02 -Nightmare Counselor-" was missing in the wiki (probably deleted)
2022-02-2821:26:23:The page ""Activity" Case:03 -Historical Vacation-" was missing in the wiki (probably deleted)
2022-02-2821:26:23:The page ""Activity" Case:05 -Forgotten Paradise-" was missing in the wiki (probably deleted)
2022-02-2821:26:23:The page ""Activity" Case:06 -Shining Future-" was missing in the wiki (probably deleted)
2022-02-2821:26:23:The page ""Activity" Case:07 -Dominated Realism-" was missing in the wiki (probably deleted)
2022-02-2821:26:24:The page ""Activity" Case:08 -Midnight Syndrome-" was missing in the wiki (probably deleted)
2022-02-2821:26:24:The page ""Everflowering" Masterpieces of Hatsunetsumiko's 2011 - 2013" was missing in the wiki (probably deleted)
2022-02-2821:26:24:The page ""Everything but the Girl" Hatsunetsumiko's Dance Vocal Collection Vol.2" was missing in the wiki (probably deleted)
Looking more into it, I was able to generate a XML dump (albeit with just one revision, as the wiki API seems to not support it) by changing the scripts' code to make a GET request, instead of POST request, during the XML extraction process. More precisely:
--- a/dumpgenerator.py+++ b/dumpgenerator.py@@ -579,7 +579,7 @@ def getXMLPageCore(headers={}, params={}, config={}, session=None):
return '' # empty xml
# FIXME HANDLE HTTP Errors HERE
try:
- r = session.post(url=config['index'], params=params, headers=headers, timeout=10)+ r = session.get(url=config['index'], params=params, headers=headers, timeout=10)
handleStatusCode(r)
xml = fixBOM(r)
except requests.exceptions.ConnectionError as e:
This seems to work because doing a POST returned an XML without a </page> tag for the page, with would result in a PageMissingError during this code section:
while doing a GET would result in a page XML with the closing tag, thus saving it to the main XML file
I'm not really sure why this is the case, or even if making this kind of change would break the dump generation for other wikis (I tested with the InstallGentoo Wiki as well and the XML dump seemed to work just fine).
The text was updated successfully, but these errors were encountered:
I'm not really sure why this is the case, or even if making this kind of change would break the dump generation for other wikis
Yes, it does. We try to guess which method to use for each wiki, but in
the end we can't account for the quirks of every webserver. Maybe we
could add a command line option.
Federico
While trying to generate an XML dump for the Touhou Wiki with the dumpgenerator.py (master#d7b6924), I noticed that no XML besides the Main Page was being generated, with every other entry being marked as
missing in the wiki (probably deleted)
in theerrors.log
file.For example, executing:
would successfully find and load all page titles from all namespaces, and then starting "downloading pages":
But pausing the script and checking the
errors.log
file would result in:even though these pages actually exist.
Looking more into it, I was able to generate a XML dump (albeit with just one revision, as the wiki API seems to not support it) by changing the scripts' code to make a GET request, instead of POST request, during the XML extraction process. More precisely:
This seems to work because doing a POST returned an XML without a
</page>
tag for the page, with would result in aPageMissingError
during this code section:while doing a GET would result in a page XML with the closing tag, thus saving it to the main XML file
I'm not really sure why this is the case, or even if making this kind of change would break the dump generation for other wikis (I tested with the InstallGentoo Wiki as well and the XML dump seemed to work just fine).
The text was updated successfully, but these errors were encountered: