diff --git a/doc/widgets/SciHubator.rst b/doc/widgets/SciHubator.rst new file mode 100644 index 00000000..f2ee005c --- /dev/null +++ b/doc/widgets/SciHubator.rst @@ -0,0 +1,179 @@ +.. meta:: + :description: Orange3 Textable Prototypes documentation, SciHubator widget + :keywords: Orange3, Textable, Prototypes, documentation, SciHubator, widget + +.. _SciHubator: + +SciHubator +============== + +.. image:: https://github.com/sarahperettipoix/orange3-textable-prototypes/blob/master/orangecontrib/textable_prototypes/widgets/icons/scihubator.png + +Download pdf files from `Sci-HUB `_ and extract textual content into segmentations + +Authors +------- +Peretti-Poix Sarah, Borgeaud Matthias, Chétioui Orsowen, Luginbühl Colin + +Signals +------- + +Inputs: ``None`` + + None + + +Outputs: ``Text data`` + + Segmentation covering the content of downloaded pdf files + +Requirements +------------ + +* Orange 3.38.1 +* Orange Textable 3.2.2 +* from scidownl import scihub_download +* import pdfplumber + +Description +----------- + +This widget is designed to download pdf files from the SciHub project and outputs its content +into an annotated text segmentation. + + +Basic interface +~~~~~~~~~~~~~~~ + +In its basic version, +the **SciHubator** widget is limited to the import of a single DOI. +The interface contains a **Source** section enabling the user to type the DOI. + +.. _SciHubator_basicinterface: + +.. figure:: https://github.com/sarahperettipoix/orange3-textable-prototypes/blob/master/specs/images/scihubator_minimal.png + :align: center + :alt: Basic interface of the SciHubator widget + + Figure 1: **SciHubator** widget (basic interface). + +Note that pdfplumber might not work properly with none latin alphabets +and serif typefaces. + +The **Send** button triggers the emission of a segmentation to the output +connection(s). When it is selected, the **Send automatically** checkbox +disables the button and the widget attempts to automatically emit a +segmentation at every modification of its interface. + +The text below the **Send** button indicates the number TODO of characters in the single +segment contained in the output segmentation, or the reasons why no +segmentation is emitted (no input data, encoding issue, etc.). + +Advanced interface +~~~~~~~~~~~~~~~~~~ + +The advanced version of **SciHubator** allows the user to type several DOIs +in a determined order; each output text file can moreover be segmented into +specific segmentations (introduction, mais corpus and bibliography) with specific +annotations. The emitted segmentation contains a segment +for each imported file. + +.. _scihubator_advancedinterface: + +.. figure:: https://github.com/sarahperettipoix/orange3-textable-prototypes/blob/master/specs/images/scihubator_principal.png + :align: center + :alt: Advanced interface of the Super Text files widget + :scale: 80% + + Figure 2: **SciHubator** widget (advanced interface). + +The advanced interface presents similarities with that of the **URLs** and **Segment** +widgets. The **Sources** section allows the user to select the input +DOI(s). The list +of imported files appears at the top of the window; the columns of this list +indicate (a) the name of each file, (b) the corresponding annotation (if any), +and (c) the encoding with which each is associated. + +The first buttons on the right of the imported files' list enable the user to +modify the order in which they appear in the output segmentation (**Move Up** +and **Move Down**), to delete a file from the list (**Remove**) or to +completely empty it (**Clear All**). Except for **Clear All**, all these +buttons require the user to previously select an entry from the list. + +The **Send** button triggers the emission of a segmentation to the output +connection(s). When it is selected, the **Send automatically** checkbox +disables the button and the widget attempts to automatically emit a +segmentation at every modification of its interface. + +The text below the **Send** button indicates the length of the output segmentation in +characters, or the reasons why no segmentation is emitted (no selected file, +encoding issue, etc.). In the example, the two segments corresponding to the +imported files thus total up to 1'262'145 characters. + +Messages +-------- + +Information +~~~~~~~~~~~ + +*Data correctly sent to output: segments ( characters).* + This confirms that the widget has operated properly. + +*Settings were* (or *Input has*) *changed, please click 'Send' when ready.* + Settings and/or input have changed but the **Send automatically** checkbox + has not been selected, so the user is prompted to click the **Send** + button (or equivalently check the box) in order for computation and data + emission to proceed. + +*No data sent to output yet: no DOI selected.* + The widget instance is not able to emit data to output because no input + DOI has been selected. + +*No data sent to output yet, see 'Widget state' below.* + A problem with the instance's parameters and/or input data prevents it + from operating properly, and additional diagnostic information can be + found in the **Widget state** box at the bottom of the instance's + interface (see `Warnings`_ and `Errors`_ below). + +*Duplicate DOI(s) found and deleted.* + A duplicate DOI was found in the DOI list. + Adding operation is halted so that no duplicates appear + + +Warnings +~~~~~~~~ + +*Please enter one or many valid DOIs.* + A valid DOI is required for being processed by Sci-Hub. +The warning indicates that nothing was typed in the DOI field. + +*Not all sections were segmented* + The regex was not able to segment the content of certain DOIs. + +*Step 1/3: Pre-processing...* + The PDF is being downloaded +*Step 2/3: Processing...* + The PDF is being processed into a raw text. +*Step 3/3: Post-processing...* + Segmentations are applied to the text. + + + + +Errors +~~~~~~ + +*SciHub inaccessible - verify your connexion.* + Please verify your internet connexion or check if `Sci-HUB `_ is down. + +*An error occurred when downloading.* + Downloading the PDF didn't worked, please try again. + +*Error occurred when reading PDF:* + An unexpected error occurred when reading the downloaded PDF. Please try again, if the error still happen your DOI could be not compatible. + +*Download failed. Please, verify DOI or connexion.* + Sci-Hub is accessible but scihubator couldn't download the PDF. Your connexion has perhaps crashed in the download process or the DOI provided is not valid. + + + diff --git a/orangecontrib/textable_prototypes/widgets/DemoSciHub.py b/orangecontrib/textable_prototypes/widgets/DemoSciHub.py new file mode 100644 index 00000000..84488f38 --- /dev/null +++ b/orangecontrib/textable_prototypes/widgets/DemoSciHub.py @@ -0,0 +1,372 @@ +#/(?<=\n)\n((biblio|r(e|é)f)\w*\W*\n)(.|\n)*/ +#/(Abstract.+?\n{1,})((.|\n)*)(?=\n\n)/gmi +""" +Class DemoTextableWidget +Copyright 2025 University of Lausanne +----------------------------------------------------------------------------- +This file is part of the Orange3-Textable-Prototypes package. + +Orange3-Textable-Prototypes is free software: you can redistribute +it and/or modify it under the terms of the GNU General Public License +as published by the Free Software Foundation, either version 3 of the +License, or (at your option) any later version. + +Orange3-Textable-Prototypes is distributed in the hope that it will +be useful, but WITHOUT ANY WARRANTY; without even the implied warranty +of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with Orange3-Textable-Prototypes. If not, see + . +""" + +__version__ = u"0.0.1" +__author__ = "Sarah Perreti-Poix, Borgeaud Matthias, Chétioui Orsowen, Luginbühl Colin" +__maintainer__ = "Aris Xanthos" +__email__ = "aris.xanthos@unil.ch" + + +from functools import partial +import time +import tempfile +from scidownl import scihub_download +import pdfplumber +import os +import requests + +from _textable.widgets.TextableUtils import ( + OWTextableBaseWidget, VersionedSettingsHandler, ProgressBar, + InfoBox, SendButton, pluralize, Task +) + +from LTTL.Segmentation import Segmentation +#from LTTL.Input import Input + +# Using the threaded version of LTTL.Segmenter to create +# a "responsive" widget. +import LTTL.SegmenterThread as Segmenter + +from Orange.widgets import widget, gui, settings +from Orange.widgets.utils.widgetpreview import WidgetPreview +from LTTL.Input import Input + + +class DemoSciHUB(OWTextableBaseWidget): + """Demo Orange3-Textable widget""" + + name = "Demo Scihub" + description = "Export a text segmentation from a DOI or URL" + icon = "icons/scihubator.png" + priority = 99 + + # Input and output channels (remove if not needed)... + #inputs = [("Segmentation", Segmentation, "inputData")] + outputs = [("New segmentation", Segmentation)] + + # Copied verbatim in every Textable widget to facilitate + # settings management. + settingsHandler = VersionedSettingsHandler( + version=__version__.rsplit(".", 1)[0] + ) + + # Settings... + DOIContent = settings.Setting("") + #numberOfSegments = settings.Setting("10") + + want_main_area = False + + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + self.inputSegmentationLength = 0 + # The following attribute is required by every widget + # that imports new strings into Textable. + self.createdInputs = list() + + self.infoBox = InfoBox(widget=self.controlArea) + self.sendButton = SendButton( + widget=self.controlArea, + master=self, + callback=self.sendData, + cancelCallback=self.cancel_manually, + infoBoxAttribute="infoBox", + ) + + # GUI... + # Top-level GUI boxes are created using method + # create_widgetbox(), so that they are automatically + # enabled/disabled when processes are running. + sourceBox = self.create_widgetbox( + box=u'Options', + orientation='vertical', + addSpace=False, + ) + + # GUI elements can be assigned to variables or even + # attributes (e.g. self.DOIContentLineEdit) if + # they must be referred to elsewhere, e.g., to enable + # or disable them, etc. It is not the case below. + gui.lineEdit( + widget=sourceBox, + master=self, + value="DOIContent", + orientation="horizontal", + label="DOI:", + labelWidth=130, + # self.sendButton.settingsChanged should be used in + # in cases where using a GUI element should result + # in sending data to output. If it should result in + # other operations being done, use a custom method + # instead, and at the end of it, if data should be + # sent to output, call self.sendButton.settingsChanged(). + # If using the GUI element should not result in + # anything at that moment, delete the "callback" + # parameter. + callback=self.sendButton.settingsChanged, + tooltip=( + "A string that defines the content " + "each segment." + ), + ) + + # Stretchable vertical spacing between "options" + # and Send button etc. + gui.rubber(self.controlArea) + + # Draw send button & Info box... + self.sendButton.draw() + self.infoBox.draw() + + # Send data if needed. + self.sendButton.settingsChanged() + + def sendData(self): + """Perform every required check and operation + before calling the method that does the actual + processing. + """ + + if self.DOIContent == "": + # Use mode "warning" when user needs to do some + # action or provide some information; use mode "error" + # when invalid parameters have been provided; + # for notifications that don't require user action, + # don't use a mode. Use formulations that emphasize + # what should be done rather than what is wrong or + # missing. + self.infoBox.setText("Please type valid DOI.", + "warning") + # Make sure to send None and return if the widget + # cannot operate properly at this point. + self.send("New segmentation", None) + return + + # If the widget creates new LTTL.Input objects (i.e. + # if it imports new strings in Textable), make sure to + # clear previously created Inputs with this method. + self.clearCreatedInputs() + + # Notify processing in infobox. Typically, there should + # always be a "processing" step, with optional "pre- + # processing" and "post-processing" steps before and + # after it. If there are no optional steps, notify + # "Preprocessing...". + self.infoBox.setText("Step 1/2: Pre-processing...", "warning") + + # Progress bar should be initialized at this point. + self.progressBarInit() + + # Create a threaded function to do the actual processing + # and specify its arguments (here there are none). + threaded_function = partial( + self.processData, + # argument1, + # argument2, + # ... + ) + + # Run the threaded function... + self.threading(threaded_function) + + def processData(self): + """Actual processing takes place in this method, + which is run in a worker thread so that GUI stays + responsive and operations can be cancelled + """ + + # At start of processing, set progress bar to 1%. + # Within this method, this is done using the following + # instruction. + self.signal_prog.emit(1, False) + + DOIList = self.DOIContent.split(",") + #DOIList.append(self.DOIContent) + + # Indicate the total number of iterations that the + # progress bar will go through (e.g. number of input + # segments, number of selected files, etc.), then + # set current iteration to 1. + max_itr = len(DOIList) + cur_itr = 1 + + # Permet de tester la connexion à Sci-Hub + if not test_scihub_accessible(): + self.sendNoneToOutputs() + self.infoBox.setText("SciHub inaccessible - verify your connexion", 'error') + return + # Actual processing... + + # For each progress bar iteration... + tempdir = tempfile.TemporaryDirectory() + for DOI in DOIList: + + # Update progress bar manually... + self.signal_prog.emit(int(100*cur_itr/max_itr), False) + cur_itr += 1 + + # code ajouté ici + paper = DOI + paper_type = "doi" + out = f"{tempdir.name}/{DOIList.index(DOI)}" + try: + scihub_download(paper, paper_type=paper_type, out=out) + except Exception as ex: + print(ex) + self.sendNoneToOutputs() + self.infoBox.setText("An error occurred when downloading", 'error') + return + # Cancel operation if requested by user... + time.sleep(0.00001) # Needed somehow! + if self.cancel_operation: + self.signal_prog.emit(100, False) + return + + # Update infobox and reset progress bar... + self.signal_text.emit("Step 2/2: Processing...", + "warning") + cur_itr = 0 + self.signal_prog.emit(0, True) + for DOI in DOIList: + DOIText = "" + if os.path.exists(f"{tempdir.name}/{DOIList.index(DOI)}.pdf"): + try: + with pdfplumber.open(f"{tempdir.name}/{DOIList.index(DOI)}.pdf") as pdf: + for page in pdf.pages: + self.signal_prog.emit(int(100 * cur_itr / max_itr), False) + cur_itr += (1 / len(pdf.pages)) + DOIText += page.extract_text() + except Exception as e: + self.sendNoneToOutputs() + self.infoBox.setText(f"Error occurred when reading PDF: {str(e)}", 'error') + return + else: + self.sendNoneToOutputs() + self.infoBox.setText("Download failed. Please, verify DOI or connexion", 'error') + return + ######## + + # Create an LTTL.Input... + if len(DOIList) == 1: + # self.captionTitle is the name of the widget, + # which will become the label of the output + # segmentation. + label = self.captionTitle + else: + label = None # will be set later. + print(DOIText) + myInput = Input(DOIText, label) + + # Extract the first (and single) segment in the + # newly created LTTL.Input and annotate it with + # the length of the input segmentation. + segment = myInput[0] + segment.annotations["DOI"] \ + = DOI + # For the annotation to be saved in the LTTL.Input, + # the extracted and annotated segment must be re-assigned + # to the first (and only) segment of the LTTL.Input. + myInput[0] = segment + + # Add the LTTL.Input to self.createdInputs. + self.createdInputs.append(myInput) + + # Cancel operation if requested by user... + time.sleep(0.00001) # Needed somehow! + if self.cancel_operation: + self.signal_prog.emit(100, False) + return + tempdir.cleanup() + + + # If there's only one LTTL.Input created, it is the + # widget's output... + if len(DOIList) == 1: + return self.createdInputs[0] + + # Otherwise the widget's output is a concatenation... + else: + return Segmenter.concatenate( + caller=self, + segmentations=self.createdInputs, + label=self.captionTitle, + import_labels_as=None, + ) + + @OWTextableBaseWidget.task_decorator + def task_finished(self, f): + """All operations following the successful termination + of self.processData + """ + + # Get the result value of self.processData. + processed_data = f.result() + + # If it is not None... + if processed_data: + message = "text sent to output " + message = pluralize(message, len(processed_data)) + """numChars = 0 + for segment in processed_data: + segmentLength = len(Segmentation.get_data(segment.str_index)) + numChars += segmentLength + message += f"({numChars} character@p)." + message = pluralize(message, numChars)""" + self.infoBox.setText(message) + self.send("New segmentation", processed_data) + + # The following method should be copied verbatim in + # every Textable widget. + def setCaption(self, title): + """Register captionTitle changes and send if needed""" + if 'captionTitle' in dir(self): + changed = title != self.captionTitle + super().setCaption(title) + if changed: + self.cancel() # Cancel current operation + self.sendButton.settingsChanged() + else: + super().setCaption(title) + + # The following two methods should be copied verbatim in + # every Textable widget that creates LTTL.Input objects. + + def clearCreatedInputs(self): + """Clear created inputs""" + for i in self.createdInputs: + Segmentation.set_data(i[0].str_index, None) + del self.createdInputs[:] + + def onDeleteWidget(self): + """Clear created inputs on widget deletion""" + self.clearCreatedInputs() + + +def test_scihub_accessible(): + try: + response = requests.get("https://sci-hub.se", timeout=10) + return response.status_code == 200 + except: + return False + +if __name__ == '__main__': + WidgetPreview(DemoSciHub).run() diff --git a/orangecontrib/textable_prototypes/widgets/DemoTextableWidget.py b/orangecontrib/textable_prototypes/widgets/DemoTextableWidget.py new file mode 100644 index 00000000..30c058bc --- /dev/null +++ b/orangecontrib/textable_prototypes/widgets/DemoTextableWidget.py @@ -0,0 +1,340 @@ +""" +Class DemoTextableWidget +Copyright 2025 University of Lausanne +----------------------------------------------------------------------------- +This file is part of the Orange3-Textable-Prototypes package. + +Orange3-Textable-Prototypes is free software: you can redistribute +it and/or modify it under the terms of the GNU General Public License +as published by the Free Software Foundation, either version 3 of the +License, or (at your option) any later version. + +Orange3-Textable-Prototypes is distributed in the hope that it will +be useful, but WITHOUT ANY WARRANTY; without even the implied warranty +of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with Orange3-Textable-Prototypes. If not, see + . +""" + +__version__ = '0.0.1' +__author__ = "Aris Xanthos" +__maintainer__ = "Aris Xanthos" +__email__ = "aris.xanthos@unil.ch" + + +from functools import partial +import time + +from _textable.widgets.TextableUtils import ( + OWTextableBaseWidget, VersionedSettingsHandler, ProgressBar, + InfoBox, SendButton, pluralize, Task +) + +from LTTL.Segmentation import Segmentation +from LTTL.Input import Input + +# Using the threaded version of LTTL.Segmenter to create +# a "responsive" widget. +import LTTL.SegmenterThread as Segmenter + +from Orange.widgets import widget, gui, settings +from Orange.widgets.utils.widgetpreview import WidgetPreview + + +class DemoTextableWidget(OWTextableBaseWidget): + """Demo Orange3-Textable widget""" + + name = "Demo widget" + description = "Illustrates common code behind Textable widgets" + icon = "icons/someIcon.svg" + priority = 99 + + # Input and output channels (remove if not needed)... + inputs = [("Segmentation", Segmentation, "inputData")] + outputs = [("New segmentation", Segmentation)] + + # Copied verbatim in every Textable widget to facilitate + # settings management. + settingsHandler = VersionedSettingsHandler( + version=__version__.rsplit(".", 1)[0] + ) + + # Settings... + segmentContent = settings.Setting("sample text") + numberOfSegments = settings.Setting("10") + + want_main_area = False + + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + + # Attributes... + self.inputSegmentationLength = 0 + + # The following attribute is required by every widget + # that imports new strings into Textable. + self.createdInputs = list() + + self.infoBox = InfoBox(widget=self.controlArea) + self.sendButton = SendButton( + widget=self.controlArea, + master=self, + callback=self.sendData, + cancelCallback=self.cancel_manually, + infoBoxAttribute="infoBox", + ) + + # GUI... + + # Top-level GUI boxes are created using method + # create_widgetbox(), so that they are automatically + # enabled/disabled when processes are running. + optionsBox = self.create_widgetbox( + box=u'Options', + orientation='vertical', + addSpace=False, + ) + + # GUI elements can be assigned to variables or even + # attributes (e.g. self.segmentContentLineEdit) if + # they must be referred to elsewhere, e.g., to enable + # or disable them, etc. It is not the case below. + gui.lineEdit( + widget=optionsBox, + master=self, + value="segmentContent", + orientation="horizontal", + label="Segment text:", + labelWidth=130, + # self.sendButton.settingsChanged should be used in + # in cases where using a GUI element should result + # in sending data to output. If it should result in + # other operations being done, use a custom method + # instead, and at the end of it, if data should be + # sent to output, call self.sendButton.settingsChanged(). + # If using the GUI element should not result in + # anything at that moment, delete the "callback" + # parameter. + callback=self.sendButton.settingsChanged, + tooltip=( + "A string that defines the content " + "each segment." + ), + ) + + gui.comboBox( + widget=optionsBox, + master=self, + value="numberOfSegments", + items=["1", "10", "100", "1000", "10000"], + sendSelectedValue=True, + orientation='horizontal', + label="Number of segments:", + labelWidth=130, + callback=self.sendButton.settingsChanged, + tooltip="Number of segments to create.", + ) + + # Stretchable vertical spacing between "options" + # and Send button etc. + gui.rubber(self.controlArea) + + # Draw send button & Info box... + self.sendButton.draw() + self.infoBox.draw() + + # Send data if needed. + self.sendButton.settingsChanged() + + def inputData(self, segmentation): + """Handle segmentation on input connection""" + + # If the input is None and it is needed for the widget + # to operate, send None to output(s) then return. + # Here, the widget can still operate without input. + if segmentation is None: + self.inputSegmentationLength = 0 + else: + self.inputSegmentationLength = len(segmentation) + + # Display the standard message for "input changed". + self.infoBox.inputChanged() + + def sendData(self): + """Perform every required check and operation + before calling the method that does the actual + processing. + """ + + if self.segmentContent == "": + # Use mode "warning" when user needs to do some + # action or provide some information; use mode "error" + # when invalid parameters have been provided; + # for notifications that don't require user action, + # don't use a mode. Use formulations that emphasize + # what should be done rather than what is wrong or + # missing. + self.infoBox.setText("Please type segment content.", + "warning") + # Make sure to send None and return if the widget + # cannot operate properly at this point. + self.send("New segmentation", None) + return + + # If the widget creates new LTTL.Input objects (i.e. + # if it imports new strings in Textable), make sure to + # clear previously created Inputs with this method. + self.clearCreatedInputs() + + # Notify processing in infobox. Typically, there should + # always be a "processing" step, with optional "pre- + # processing" and "post-processing" steps before and + # after it. If there are no optional steps, notify + # "Preprocessing...". + self.infoBox.setText("Step 1/2: Processing...", "warning") + + # Progress bar should be initialized at this point. + self.progressBarInit() + + # Create a threaded function to do the actual processing + # and specify its arguments (here there are none). + threaded_function = partial( + self.processData, + # argument1, + # argument2, + # ... + ) + + # Run the threaded function... + self.threading(threaded_function) + + def processData(self): + """Actual processing takes place in this method, + which is run in a worker thread so that GUI stays + responsive and operations can be cancelled + """ + + # At start of processing, set progress bar to 1%. + # Within this method, this is done using the following + # instruction. + self.signal_prog.emit(1, False) + + # Indicate the total number of iterations that the + # progress bar will go through (e.g. number of input + # segments, number of selected files, etc.), then + # set current iteration to 1. + max_itr = int(self.numberOfSegments) + cur_itr = 1 + + # Actual processing... + + # For each progress bar iteration... + for _ in range(int(self.numberOfSegments)): + + # Update progress bar manually... + self.signal_prog.emit(int(100*cur_itr/max_itr), False) + cur_itr += 1 + + # Create an LTTL.Input... + if int(self.numberOfSegments) == 1: + # self.captionTitle is the name of the widget, + # which will become the label of the output + # segmentation. + label = self.captionTitle + else: + label = None # will be set later. + myInput = Input(self.segmentContent, label) + + # Extract the first (and single) segment in the + # newly created LTTL.Input and annotate it with + # the length of the input segmentation. + segment = myInput[0] + segment.annotations["demo_annotation"] \ + = self.inputSegmentationLength + # For the annotation to be saved in the LTTL.Input, + # the extracted and annotated segment must be re-assigned + # to the first (and only) segment of the LTTL.Input. + myInput[0] = segment + + # Add the LTTL.Input to self.createdInputs. + self.createdInputs.append(myInput) + + # Cancel operation if requested by user... + time.sleep(0.00001) # Needed somehow! + if self.cancel_operation: + self.signal_prog.emit(100, False) + return + + # Update infobox and reset progress bar... + self.signal_text.emit("Step 2/2: Post-processing...", + "warning") + self.signal_prog.emit(1, True) + + # If there's only one LTTL.Input created, it is the + # widget's output... + if int(self.numberOfSegments) == 1: + return self.createdInputs[0] + + # Otherwise the widget's output is a concatenation... + else: + return Segmenter.concatenate( + caller=self, + segmentations=self.createdInputs, + label=self.captionTitle, + import_labels_as=None, + ) + + @OWTextableBaseWidget.task_decorator + def task_finished(self, f): + """All operations following the successful termination + of self.processData + """ + + # Get the result value of self.processData. + processed_data = f.result() + + # If it is not None... + if processed_data: + message = f"{len(processed_data)} segment@p sent to output " + message = pluralize(message, len(processed_data)) + numChars = 0 + for segment in processed_data: + segmentLength = len(Segmentation.get_data(segment.str_index)) + numChars += segmentLength + message += f"({numChars} character@p)." + message = pluralize(message, numChars) + self.infoBox.setText(message) + self.send("New segmentation", processed_data) + + # The following method should be copied verbatim in + # every Textable widget. + def setCaption(self, title): + """Register captionTitle changes and send if needed""" + if 'captionTitle' in dir(self): + changed = title != self.captionTitle + super().setCaption(title) + if changed: + self.cancel() # Cancel current operation + self.sendButton.settingsChanged() + else: + super().setCaption(title) + + # The following two methods should be copied verbatim in + # every Textable widget that creates LTTL.Input objects. + + def clearCreatedInputs(self): + """Clear created inputs""" + for i in self.createdInputs: + Segmentation.set_data(i[0].str_index, None) + del self.createdInputs[:] + + def onDeleteWidget(self): + """Clear created inputs on widget deletion""" + self.clearCreatedInputs() + + +if __name__ == '__main__': + WidgetPreview(DemoTextableWidget).run() diff --git a/orangecontrib/textable_prototypes/widgets/SciHubator.py b/orangecontrib/textable_prototypes/widgets/SciHubator.py new file mode 100644 index 00000000..8ac497cd --- /dev/null +++ b/orangecontrib/textable_prototypes/widgets/SciHubator.py @@ -0,0 +1,585 @@ +""" +Class SuperTextFiles +Copyright 2020-2025 University of Lausanne +----------------------------------------------------------------------------- +This file is part of the Orange3-Textable-Prototypes package and based on the +file OWTextableTextFiles of the Orange3-Textable package. + +Orange3-Textable-Prototypes is free software: you can redistribute it +and/or modify it under the terms of the GNU General Public License as published +by the Free Software Foundation, either version 3 of the License, or +(at your option) any later version. + +Orange3-Textable-Prototypes is distributed in the hope that it will be +useful, but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with Orange-Textable-Prototypes. If not, see +. +""" + +__version__ = "0.0.1" +__author__ = "Sarah Perreti-Poix, Borgeaud Matthias, Chétioui Orsowen, Luginbühl Colin" +__maintainer__ = "Aris Xanthos" +__email__ = "aris.xanthos@unil.ch" + +# Standard imports... +import re +import time +import tempfile +import os + +from functools import partial +import pdfplumber +import requests +from scidownl import scihub_download +from _textable.widgets.TextableUtils import ( + OWTextableBaseWidget, + InfoBox, SendButton, pluralize +) +import LTTL.SegmenterThread as Segmenter +from LTTL.Segmenter import tokenize +from LTTL.Segmentation import Segmentation +from LTTL.Input import Input +from Orange.widgets import gui, settings +from Orange.widgets.utils.widgetpreview import WidgetPreview +from Orange.widgets.settings import Setting +from PyQt5.QtWidgets import QMessageBox + +class SciHubator(OWTextableBaseWidget): + """ + Orange widget for importing and segmenting text from DOIs using Sci-Hub. + + Attributes : + URLLabel (list) : List of labels for the DOIs. + selectedURLLabel (list) : List of selected labels from the URL list. + newDOI (str) : DOI entered by the user for addition. + extractedText (str) : Extracted text from the downladed PDF + DOI (str) : Single DOI value. + DOIs (list) : List of DOIs added by the user + createdInputs (list) : List of created LTTL.Inputs + """ + + #Version minimale + + # ---------------------------------------------------------------------- + # Widget's metadata... + + name = "Sci-Hubator" + description = "Export a text segmentation from a DOI or URL" + icon = "icons/scihubator.svg" + priority = 10 + + # ---------------------------------------------------------------------- + # Channel definitions (NB: no input in this case)... + + outputs = [('Segmentation', Segmentation)] + + # ---------------------------------------------------------------------- + # GUI layout parameters... + + want_main_area = False + resizing_enabled = True + + # ---------------------------------------------------------------------- + # Settings declaration and initializations (default values)... + + DOIs = Setting([]) + encoding = Setting('(auto-detect)') + autoNumber = Setting(False) + autoNumberKey = Setting('num') + autoSend = settings.Setting(False) + importDOIs = Setting(True) + importDOIsKey = Setting('url') + lastLocation = Setting('.') + DOI = Setting('') + + # Ici-dessous les variables qui n'ont pas été copiées, et conçues spécialement pour SciHubator + importAllorBib = Setting(0) + + def __init__(self): + """ + Initializes the SciHubator widget, including the GUI components and settings + """ + super().__init__() + self.URLLabel = self.DOIs[:] + print(self.URLLabel) + self.selectedURLLabel = [] + self.newDOI = '' + self.extractedText = '' + self.DOI = '' + self.createdInputs = [] + + self.infoBox = InfoBox(widget=self.controlArea) + self.sendButton = SendButton( + widget=self.controlArea, + master=self, + callback=self.sendData, + cancelCallback=self.cancel_manually, + infoBoxAttribute="infoBox", + ) + # ---------------------------------------------------------------------- + # User interface... + + # ADVANCED GUI... + + # URL box + URLBox = gui.widgetBox( + widget=self.controlArea, + box='Sources', + orientation='vertical', + addSpace=False, + ) + URLBoxLine1 = gui.widgetBox( + widget=URLBox, + box=False, + orientation='horizontal', + addSpace=True, + ) + self.fileListbox = gui.listBox( + widget=URLBoxLine1, + master=self, + value='selectedURLLabel', + labels='URLLabel', + callback=self.updateURLBoxButtons, + tooltip=( + "The list of DOIs whose content will be imported.\n" + "\nIn the output segmentation, the content of each\n" + "DOI appears in the same position as in the list.\n" + ), + ) + URLBoxCol2 = gui.widgetBox( + widget=URLBoxLine1, + orientation='vertical', + ) + self.removeButton = gui.button( + widget=URLBoxCol2, + master=self, + label='Remove', + callback=self.remove, + tooltip=( + "Remove the selected DOI from the list." + ), + disabled = True, + ) + self.clearAllButton = gui.button( + widget=URLBoxCol2, + master=self, + label='Clear All', + callback=self.clearAll, + tooltip=( + "Remove all DOIs from the list." + ), + disabled = True, + ) + URLBoxLine2 = gui.widgetBox( + widget=URLBox, + box=False, + orientation='vertical', + ) + # Add URL box + addURLBox = gui.widgetBox( + widget=URLBoxLine2, + box=True, + orientation='vertical', + addSpace=False, + ) + gui.lineEdit( + widget=addURLBox, + master=self, + value='newDOI', + orientation='horizontal', + label='DOI(s):', + labelWidth=101, + callback=self.updateURLBoxButtons, + tooltip=( + "The DOI(s) that will be added to the list when\n" + "button 'Add' is clicked.\n\n" + "Successive DOIs must be separated with ' , '. \n" + "Their order in the list\n" + " will be the same as in this field." + ), + ) + advOptionsBox = gui.widgetBox( + widget=self.controlArea, + box='Options', + orientation='vertical', + addSpace=False, + ) + gui.separator(widget=advOptionsBox, height=3) + gui.radioButtonsInBox( + widget=advOptionsBox, + master=self, + value='importAllorBib', + btnLabels=['All in one Segment', 'Bibliography'], + label='Choose what to import', + callback=self.sendButton.settingsChanged, + tooltips=[ + "Import all article's content in one segment", "Import only bibliography (if found)" + ] + ) + gui.separator(widget=addURLBox, height=3) + self.addButton = gui.button( + widget=addURLBox, + master=self, + label='Add', + callback=self.add, + tooltip=( + "Add the DOI(s) currently displayed in the 'DOI'\n" + "text field to the list." + ), + disabled = True, + ) + gui.rubber(self.controlArea) + self.URLLabel = self.URLLabel + self.updateURLBoxButtons() + self.sendButton.draw() + self.infoBox.draw() + self.sendButton.sendIf() + + def sendData(self): + """ + Trigger the data processing workflow from user-provided DOIs. + + This method: + - Validates the presence of at least one DOI. + - Displays a warning if no DOI is provided. + - Clears any previously created inputs. + - Updates the UI to indicate the start of preprocessing. + - Launches the processing asynchronously using a background thread + """ + # Verify DOIs + if not self.DOIs: + self.infoBox.setText("Please enter one or many valid DOIs.", "warning") + self.send("Segmentation", None) + return + + self.clearCreatedInputs() + + # Notify processing in infobox. Typically, there should + # always be a "processing" step, with optional "pre- + # processing" and "post-processing" steps before and + # after it. If there are no optional steps, notify + # "Preprocessing...". + self.infoBox.setText("Step 1/3: Pre-processing...", "warning") + + # Progress bar should be initialized at this point. + self.progressBarInit() + + # Create a threaded function to do the actual processing + # and specify its arguments (here there are none). + threaded_function = partial( + self.processData, + # argument1, + # argument2, + # ... + ) + + # Run the threaded function... + self.threading(threaded_function) + + def processData(self): + """ + Download and process academic articles from DOIs using Sci-Hub. + + This method handles the full pipeline for downloading PDFs via Sci-Hub, + extracting their text content, and converting them into LTTL-compatible + input segmentations. + + Steps: + 1. Verifies Sci-Hub accessibility. + 2. Downloads PDFs for each DOI. + 3. Extracts text from each PDF using pdfplumber. + 4. Wraps extracted text into LTTL.Inputs with DOI annotations. + 5. Concatenates inputs if multiple DOIs are processed. + + Returns : + Segmentation: A single or concatenated segmentation(s) ready for output. + + Raises: + Emits error messages and halts processing if: + - Sci-Hub is unreachable. + - A download fails. + - A PDF cannot be parsed. + """ + + # At start of processing, set progress bar to 1%. + # Within this method, this is done using the following + # instruction. + self.signal_prog.emit(1, False) + + # DOIList.append(self.DOIContent) + + # Indicate the total number of iterations that the + # progress bar will go through (e.g. number of input + # segments, number of selected files, etc.), then + # set current iteration to 1. + max_itr = len(self.DOIs) + cur_itr = 1 + + # Permet de tester la connexion à Sci-Hub + if not test_scihub_accessible(): + self.sendNoneToOutputs() + self.infoBox.setText("SciHub inaccessible - verify your connexion", 'error') + return + # Actual processing... + + # For each progress bar iteration... + tempdir = tempfile.TemporaryDirectory() + for DOI in self.DOIs: + + # Update progress bar manually... + self.signal_prog.emit(int(100 * cur_itr / max_itr), False) + cur_itr += 1 + + # code ajouté ici + paper = DOI + paper_type = "doi" + out = f"{tempdir.name}/{self.DOIs.index(DOI)}" + try: + scihub_download(paper, paper_type=paper_type, out=out) + except Exception as ex: + print(ex) + self.sendNoneToOutputs() + self.infoBox.setText("An error occurred when downloading", 'error') + return + # Cancel operation if requested by user... + time.sleep(0.00001) # Needed somehow! + if self.cancel_operation: + self.signal_prog.emit(100, False) + return + + # Update infobox and reset progress bar... + self.signal_text.emit("Step 2/3: Processing...", + "warning") + cur_itr = 0 + cur_itr_p3 = 0 + self.signal_prog.emit(0, True) + empty_re = False + for DOI in self.DOIs: + DOIText = "" + if os.path.exists(f"{tempdir.name}/{self.DOIs.index(DOI)}.pdf"): + try: + with pdfplumber.open(f"{tempdir.name}/{self.DOIs.index(DOI)}.pdf") as pdf: + for page in pdf.pages: + self.signal_prog.emit(int(100 * cur_itr / max_itr), False) + cur_itr += (1 / len(pdf.pages)) + DOIText += page.extract_text() + except Exception as e: + self.sendNoneToOutputs() + self.infoBox.setText(f"Error occurred when reading PDF: {str(e)}", 'error') + return + else: + self.sendNoneToOutputs() + self.infoBox.setText("Download failed. Please, verify DOI or connexion", 'error') + return + ######## + + # Create an LTTL.Input... + if len(self.DOIs) == 1: + # self.captionTitle is the name of the widget, + # which will become the label of the output + # segmentation. + label = self.captionTitle + else: + label = None # will be set later. + + myInput = Input(DOIText, label) + + self.signal_text.emit("Step 3/3: Post-processing...", + "warning") + max_itr = 2*len(self.DOIs) #+ int(self.importText) + if self.importAllorBib == 0: + cur_itr_p3 += 1 + # Extract the first (and single) segment in the + # newly created LTTL.Input and annotate it with + # the length of the input segmentation. + segment = myInput[0] + segment.annotations["DOI"] \ + = DOI + # For the annotation to be saved in the LTTL.Input, + # the extracted and annotated segment must be re-assigned + # to the first (and only) segment of the LTTL.Input. + myInput[0] = segment + # Add the LTTL.Input to self.createdInputs. + self.createdInputs.append(myInput) + if self.importAllorBib == 1: + cur_itr_p3 += 1 + ma_regex = re.compile(r'(?<=\n)\n?(([Bb]iblio|[Rr][eé]f)\w*\W*\n)(.|\n)*') + regexes = [(ma_regex, 'tokenize')] + self.signal_prog.emit(int(100 * cur_itr_p3 / max_itr), False) + new_segmentation = tokenize(myInput, regexes) + if len(new_segmentation) == 0: + empty_re = True + new_input = Input( + f"Empty search Bib for DOI: {DOI}", "Empty Bibliography section" + ) + else: + new_input = Input(new_segmentation.to_string(), "Bibliographies") + segment = new_input[0] + segment.annotations["part"] = "Bibliography" + segment.annotations["DOI"] = DOI + new_input[0] = segment + self.createdInputs.append(new_input) + + # Cancel operation if requested by user... + time.sleep(0.00001) # Needed somehow! + if self.cancel_operation: + self.signal_prog.emit(100, False) + return + tempdir.cleanup() + + + # If there's only one LTTL.Input created, it is the + # widget's output... + if empty_re: + QMessageBox.warning( + None, "SciHubator", "Not all sections were segmented", + QMessageBox.Ok + ) + if len(self.DOIs) == 1: + return self.createdInputs[0] + # Otherwise the widget's output is a concatenation... + return Segmenter.concatenate( + caller=self, + segmentations=self.createdInputs, + label=self.captionTitle, + import_labels_as=None, + ) + + @OWTextableBaseWidget.task_decorator + def task_finished(self, f): + """ + Handle the output after asynchronous DOI processing is complete. + + This method : + - Retrieves the result of the processing task. + - Calculates the number of segments and total characters. + - Displays an informative message to the user. + - Sends the processed data to the output. + + Args : + f (Future): A Future object containing the result from `processData`. + + """ + + # Get the result value of self.processData. + processed_data = f.result() + + # If it is not None... + if processed_data: + message = f"{len(processed_data)} segment@p sent to output " + message = pluralize(message, len(processed_data)) + self.infoBox.setText(message) + self.send("Segmentation", processed_data) + + # The following method should be copied verbatim in + # every Textable widget. + def setCaption(self, title): + """ + Set or update the widget's caption title. + + If the caption has changed, it triggers cancellation of ongoing tasks + and marks the settings as changed to prompt UI updates. + + Args : + title (str): The new caption/title to be displayed on the widget. + """ + if 'captionTitle' in dir(self): + changed = title != self.captionTitle + super().setCaption(title) + if changed: + self.cancel() # Cancel current operation + self.sendButton.settingsChanged() + else: + super().setCaption(title) + + def clearAll(self): + """ + Clear all stored DOIs and reset related UI elements. + + This method empties the DOI list and selection, + disables the 'Clear All' button, + and updates the interface state. + """ + del self.DOIs[:] + del self.selectedURLLabel[:] + self.sendButton.settingsChanged() + self.URLLabel = self.DOIs + self.clearAllButton.setDisabled(True) + self.removeButton.setDisabled(True) + + def remove(self): + """ + Remove the selected DOI from the list. + + Removes the DOI corresponding to the currently selected index in the GUI, + updates the list of DOIs and labels, and disables the clear button if the + list is empty. + """ + if self.selectedURLLabel: + index = self.selectedURLLabel[0] + self.DOIs.pop(index) + del self.selectedURLLabel[:] + self.sendButton.settingsChanged() + self.URLLabel = self.URLLabel + self.clearAllButton.setDisabled(not bool(self.URLLabel)) + + def add(self): + """ + Add new DOI(s) from the input field to the list. + + Parses the input string for comma-separated DOIs, adds them to the internal list, + removes duplicates if any, updates the display labels, and enables relevant UI buttons. + Shows a message box if duplicates are found and removed. + """ + DOIList = [x.strip() for x in self.newDOI.strip().split(',')] + + for DOI in DOIList: + self.DOIs.append(DOI) + if self.DOIs: + tempSet = set(self.DOIs) + if len(tempSet) + + + + + + + + + + + + + + + + + + diff --git a/specs/Sci-Hubator.rst b/specs/Sci-Hubator.rst new file mode 100644 index 00000000..3228770f --- /dev/null +++ b/specs/Sci-Hubator.rst @@ -0,0 +1,117 @@ +############################ +Specification widget SCI-HUbator +############################ + +1 Introduction +************** + +1.1 But du projet +================= +Créer un widget pour Orange Textable (v3.2.2) permettant l'importation et l'extraction de corpus tirés de `Sci-HUB `_ + +1.2 Aperçu des étapes +===================== +* Première version des spécifications: 13.03.2025 +* Remise des spécifications: 20.03.2025 +* Version alpha du projet: 17.04.2025 +* Version finale du projet: 22.05.2025 + +1.3 Equipe et résponsabilités +============================== + +* Luginbühl Colin (`colin.luginbuhl@unil.ch`_): + +.. _colin.luginbuhl@unil.ch: mailto:colin.luginbuhl@unil.ch + + - Specification + - Extraction des données + - Code + - Documentation + +* Borgeaud Matthias (`matthias.borgeaud@unil.ch`_): + +.. _matthias.borgeaud@unil.ch: mailto:matthias.borgeaud@unil.ch + + - Spécification + - Code + - Documentation + - Vérification orthographe + +* Peretti-Poix Sarah (`sarah.peretti-poix@unil.ch`_): + +.. _sarah.peretti-poix@unil.ch: mailto:sarah.peretti-poix@unil.ch + + - Spécification + - GitHub + - Code + - Débuggage + +* Chétioui Orsowen (`orsowen.chetioui@unil.ch`_): + +.. _orsowen.chetioui@unil.ch: mailto:orsowen.chetioui@unil.ch + + - Documentation + - Code + - Débuggage + +2. Technique +************ + +2.1 Dépendances +=============== +* Orange 3.38.1 +* Orange Textable 3.2.2 +* `scidownl `_ 1.0.2 +* `pdfplumber `_ 0.11.6 (déjà présent pour SuperTextFiles) + + +2.2 Fonctionnalités minimales +============================= + +.. image:: images/scihubator_minimal.png + +* permettre l'importation de pdf tirés de SCI-HUB à l'aide d'un DOI et l'extraction du corpus textuel. +* créer et émettre une segmentation avec un segment (=Input) comprenant l'entièreté du texte du PDF. + +2.3 Fonctionnalités principales +=============================== + +.. image:: images/scihubator_principal_specs.png + +* permettre l'importation de pdf tirés de SCI-HUB (à partir d'un DOI). +* permettre d'en tirer le texte. +* permettre la constitution d'une sélection de corpus multiples (add/remove/clear). +* créer et émettre une segmentation avec un segment (=Input) +pour chaque partie du corpus importé (résumé/abstract, bibliographie...). +* traitement correct des références + +2.4 Fonctionnalités optionnelles +================================ +* créer et émettre une segmentation par thème. +* créer et émettre un résumé/abstract. +* créer et émettre un tableau de cross-reference. +* importer un JSON contenant plusieurs DOI. + +2.5 Tests +========= + +TODO + +3. Etapes +********* + +3.1 Version alpha +================= +* L'interface graphique est complètement construite. +* Les fonctionnalités minimales sont prises en charge par le logiciel. + +3.2 Remise et présentation +========================== +* Les fonctionnalités principales sont complétement prises en charge par le logiciel. +* La documentation du logiciel est complète. + + +4. Infrastructure +================= +Le projet est disponible sur GitHub à l'adresse `https://github.com/sarahperettipoix/orange3-textable-prototypes +`_ diff --git a/specs/images/scihubator_minimal.png b/specs/images/scihubator_minimal.png new file mode 100644 index 00000000..ab8ceaf0 Binary files /dev/null and b/specs/images/scihubator_minimal.png differ diff --git a/specs/images/scihubator_principal.png b/specs/images/scihubator_principal.png new file mode 100644 index 00000000..f3b3e84e Binary files /dev/null and b/specs/images/scihubator_principal.png differ diff --git a/specs/images/scihubator_principal_specs.png b/specs/images/scihubator_principal_specs.png new file mode 100644 index 00000000..59d95c77 Binary files /dev/null and b/specs/images/scihubator_principal_specs.png differ