Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
Markus Jelsma committed Jan 19, 2024
2 parents d95e1a7 + 85fea6e commit 6b04554
Show file tree
Hide file tree
Showing 342 changed files with 4,407 additions and 7,127 deletions.
65 changes: 51 additions & 14 deletions .github/workflows/master-build.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
Expand All @@ -13,29 +12,67 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

name: master pr build
name: master pull request ci

on:
push:
branches: [ master ]
branches: [master]
pull_request:
branches: [ master ]
types: [opened, synchronize, reopened]
branches: [master]

jobs:
build:
runs-on: ubuntu-latest
javadoc:
strategy:
matrix:
java: [ '11' ]

java: ['11']
os: [ubuntu-latest]
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- name: Set up JDK ${{ matrix.java }}
uses: actions/setup-java@v3
with:
java-version: ${{ matrix.java }}
distribution: 'temurin'
- name: Javadoc
run: ant clean javadoc -buildfile build.xml
rat:
strategy:
matrix:
java: ['11']
os: [ubuntu-latest]
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- name: Set up JDK ${{ matrix.java }}
uses: actions/setup-java@v3
with:
java-version: ${{ matrix.java }}
distribution: 'temurin'
- name: Run Apache Rat
run: ant clean run-rat -buildfile build.xml
- name: Cache unknown licenses
run: echo "UNKNOWN_LICENSES=$(sed -n 18p /home/runner/work/nutch/nutch/build/apache-rat-report.txt)" >> $GITHUB_ENV
- name: Versions
run: |
echo $UNKNOWN_LICENSES
- name: Fail if any unknown licenses
if: ${{ env.UNKNOWN_LICENSES != '0 Unknown Licenses' }}
run: exit 1
test:
strategy:
matrix:
java: ['11']
os: [ubuntu-latest, macos-latest]
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Set up JDK ${{ matrix.java }}
uses: actions/setup-java@v1
uses: actions/setup-java@v3
with:
java-version: ${{ matrix.java }}
- name: Build with Ant
run: ant clean nightly javadoc -buildfile build.xml
distribution: 'temurin'
- name: Test
run: ant clean test -buildfile build.xml
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,5 @@ naivebayes-model
csvindexwriter
lib/spotbugs-*
ivy/dependency-check-ant/*
.gradle*
ivy/apache-rat-*
8 changes: 1 addition & 7 deletions LICENSE-binary
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,6 @@ This product bundles some components that are also licensed under
the Apache License Version 2.0:


ch.qos.reload4j:reload4j
com.101tec:zkclient
com.amazonaws:aws-java-sdk-cloudsearch
com.amazonaws:aws-java-sdk-core
Expand Down Expand Up @@ -327,11 +326,6 @@ net.sourceforge.owlapi:owlapi-impl
net.sourceforge.owlapi:owlapi-parsers
net.sourceforge.owlapi:owlapi-rio
net.sourceforge.owlapi:owlapi-tools
org.apache.any23:apache-any23-api
org.apache.any23:apache-any23-core
org.apache.any23:apache-any23-csvutils
org.apache.any23:apache-any23-encoding
org.apache.any23:apache-any23-mime
org.apache.avro:avro
org.apache.commons:commons-collections4
org.apache.commons:commons-compress
Expand Down Expand Up @@ -505,6 +499,7 @@ org.jetbrains.kotlin:kotlin-stdlib-jdk8
org.lz4:lz4-java
org.mapdb:mapdb
org.netpreserve.commons:webarchive-commons
org.opensearch.client:opensearch-rest-high-level-client
org.seleniumhq.selenium:htmlunit-driver
org.seleniumhq.selenium:selenium-api
org.seleniumhq.selenium:selenium-chrome-driver
Expand Down Expand Up @@ -757,7 +752,6 @@ org.jsoup:jsoup
org.rypt:f8
org.slf4j:jcl-over-slf4j
org.slf4j:slf4j-api
org.slf4j:slf4j-reload4j


Mozilla Public License 1.1 (MPL 1.1)
Expand Down
17 changes: 5 additions & 12 deletions NOTICE-binary
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ code and source code.

The following provides more details on the included cryptographic software:

The plugins parse-tika and any23 use Apache Tika and the Bouncy Castle
The parse-tika plugin uses Apache Tika and the Bouncy Castle
generic encryption libraries for extracting text content and metadata
from encrypted PDF files. See <https://www.bouncycastle.org/> for more
details on Bouncy Castle and <https://tika.apache.org/> for details
Expand All @@ -46,9 +46,6 @@ on Apache Tika.
Apache projects
---------------

Apache Any23 (https://any23.apache.org/)
see https://github.com/apache/any23/blob/master/NOTICE.txt

Apache Avro (https://avro.apache.org)
see https://github.com/apache/avro/blob/master/NOTICE.txt

Expand Down Expand Up @@ -163,10 +160,6 @@ AOP alliance (http://aopalliance.sourceforge.net)
- license: Public Domain
(licenses-binary/LICENSE-public-domain.txt)

# ch.qos.reload4j:reload4j
reload4j (https://reload4j.qos.ch)
- license: The Apache Software License, Version 2.0

# com.101tec:zkclient
ZkClient (https://github.com/sgroschupf/zkclient)
- license: The Apache Software License, Version 2.0
Expand Down Expand Up @@ -1021,6 +1014,10 @@ mapdb (http://www.mapdb.org)
webarchive-commons (https://github.com/iipc/webarchive-commons)
- license: The Apache Software License, Version 2.0

# org.opensearch.client:opensearch-rest-high-level-client
opensearch-rest-high-level-client (https://opensearch.org/)
- license: The Apache Software License, Version 2.0

# org.ow2.asm:asm
asm (http://asm.ow2.io/)
- license: BSD-3-Clause
Expand Down Expand Up @@ -1096,10 +1093,6 @@ JCL 1.2 implemented over SLF4J (http://www.slf4j.org)
(licenses-binary/LICENSE-mit-license.txt)
# org.slf4j:slf4j-api
SLF4J API Module (http://www.slf4j.org)
- license: MIT License
(licenses-binary/LICENSE-mit-license.txt)
# org.slf4j:slf4j-reload4j
SLF4J Reload4j Binding (http://reload4j.qos.ch)
- license: MIT License
(licenses-binary/LICENSE-mit-license.txt)

Expand Down
2 changes: 1 addition & 1 deletion NOTICE.txt
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ code and source code.

The following provides more details on the included cryptographic software:

The plugins parse-tika and any23 use Apache Tika and the Bouncy Castle
The parse-tika plugin uses Apache Tika and the Bouncy Castle
generic encryption libraries for extracting text content and metadata
from encrypted PDF files. See <https://www.bouncycastle.org/> for more
details on Bouncy Castle and <https://tika.apache.org/> for details
Expand Down
44 changes: 39 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ To contribute a patch, follow these instructions (note that installing
IDE setup
=========

### Eclipse

Generate Eclipse project files

```
Expand All @@ -48,13 +50,45 @@ ant eclipse

and follow the instructions in [Importing existing projects](https://help.eclipse.org/2019-06/topic/org.eclipse.platform.doc.user/tasks/tasks-importproject.htm).

For Intellij IDEA, first install the [IvyIDEA Plugin](https://plugins.jetbrains.com/plugin/3612-ivyidea). then run ```ant eclipse```.
You must [configure the nutch-site.xml](https://cwiki.apache.org/confluence/display/NUTCH/RunNutchInEclipse) before running. Make sure, you've added ```http.agent.name``` and ```plugin.folders``` properties. The plugin.folders normally points to ```<project_root>/build/plugins```.

Now create a Java Application Configuration, choose org.apache.nutch.crawl.Injector, add two paths as arguments. First one is the crawldb directory, second one is the URL directory where, the injector can read urls. Now run your configuration.

Then open the project in IntelliJ. You may see popups like "Ant build scripts found", "Frameworks detected - IvyIDEA Framework detected". Just follow the simple steps in these dialogs.
If we still see the ```No plugins found on paths of property plugin.folders="plugins"```, update the plugin.folders in the nutch-default.xml, this is a quick fix, but should not be used.

You must [configure the nutch-site.xml](https://cwiki.apache.org/confluence/display/NUTCH/RunNutchInEclipse) before running. Make sure, you've added ```http.agent.name``` and ```plugin.folders``` properties. The plugin.folders normally points to ```<project_root>/build/plugins```.

Now create a Java Application Configuration, choose org.apache.nutch.crawl.Injector, add two paths as arguments. First one is the crawldb directory, second one is the URL directory where, the injector can read urls. Now run your configuration.
### Intellij IDEA

If we still see the ```No plugins found on paths of property plugin.folders="plugins"```, update the plugin.folders in the nutch-default.xml, this is a quick fix, but should not be used.
First install the [IvyIDEA Plugin](https://plugins.jetbrains.com/plugin/3612-ivyidea). then run ```ant eclipse```. This will create the necessary
.classpath and .project files so that Intellij can import the project in the next step.

In Intellij IDEA, select File > New > Project from Existing Sources. Select the nutch home directory and click "Open".

On the "Import Project" screen select the "Import project from external model" radio button and select "Eclipse".
Click "Create". On the next screen the "Eclipse projects directory" should be already set to the nutch folder.
Leave the "Create module files near .classpath files" radio button selected.
Click "Next" on the next screens. On the project SDK screen select Java 11 and click "Create".

Once the project is imported, you will see a popup saying "Ant build scripts found", "Frameworks detected - IvyIDEA Framework detected". Click "Import".
If you don't get the pop-up, I'd suggest going through the steps again as this happens from time to time. There is another
Ant popup that asks you to configure the project. Do NOT click "Configure".

To import the code-style, Go to Intellij IDEA > Preferences > Editor > Code Style > Java.

For the Scheme dropdown select "Project". Click the gear icon and select "Import Scheme" > "Eclipse XML file".

Select the eclipse-format.xml file and click "Open". On next screen check the "Current Scheme" checkbox and hit OK.

### Running in Intellij IDEA

Running in Intellij

- Open Run/Debug Configurations
- Select "+" to create a new configuration and select "Application"
- For "Main Class" enter a class with a main function (e.g. org.apache.nutch.indexer.IndexingJob).
- For "Program Arguments" add the arguments needed for the class. You can get these by running the crawl executable for your job. Use full-qualified paths. (e.g. /Users/kamil/workspace/external/nutch/crawl/crawldb /Users/kamil/workspace/external/nutch/crawl/segments/20221222160141 -deleteGone)
- For "Working Directory" enter "/Users/kamil/workspace/external/nutch/runtime/local".
- Select "Modify options" > "Modify Classpath" and add the config directory belonging to the "Working Directory" from the previous step (e.g. /Users/kamil/workspace/external/nutch/runtime/local/conf). This will allow the resource loader to load that configuration.
- Select "Modify options" > "Add VM Options". Add the VM options needed. You can get these by running the crawl executable for your job (e.g. -Xmx4096m -Dhadoop.log.dir=/Users/kamil/workspace/external/nutch/runtime/local/logs -Dhadoop.log.file=hadoop.log -Dmapreduce.job.reduces=2 -Dmapreduce.reduce.speculative=false -Dmapreduce.map.speculative=false -Dmapreduce.map.output.compress=true)

**Note**: You will need to manually trigger a build through ANT to get latest updated changes when running. This is because the ant build system is separate from the Intellij one.
Loading

0 comments on commit 6b04554

Please sign in to comment.