What options I have if I want to index a huge, continuously changing code base? #3963

W-Sebastian · 2015-05-25T15:05:40Z

W-Sebastian
May 25, 2015

I have close to 1TB of data I want to index which covers several different codebases. Each one is modified during a day multiple times. I don't mind being out of sync since the index is regenerated each night but missing files are throwing FileNotFoundException in the search page thus making certain code bases unsearchable.

What options do I have to go around this? What I can think of is:

Have the code modified so it no longer throw those errors. I'd rather avoid this one.
Have a way to automatically update the index as soon as a file is changing (with a watcher) thus keeping the directories perfectly in sync with the index.

Are there any other suggestions I could try?

If relevant, I am using OpenGrok on a windows machine.

vladak · 2015-05-25T15:10:27Z

vladak
May 25, 2015
Maintainer

Ideally this should be fixed in OpenGrok itself. This is tracked by #851. I will keep this open as a question to see if there are any workarounds we can do. Also see #517.

0 replies

W-Sebastian · 2015-05-25T15:13:37Z

W-Sebastian
May 25, 2015
Author

Thank you. I've noticed #851 but found that the behavior could be considered desired as it is thus considered it would be best to open a new question regarding my specific scenario.

0 replies

tarzanek · 2015-05-26T09:11:41Z

tarzanek
May 26, 2015

workaround is to sync on separate dataset and flip over once both index and source match ... on windows this is harder of course(but doable e.g. by renaming dirs and flipping configs around and sending config from indexer to dummy address), on linux btrfs might help, on Solaris ZFS is obvious fix (which I am using for example in my test setup and this was originally done for public solaris source browser - it's described in docs how to have 2 DATA_DIRs and you can do the same for SOURCE_DIRS)

and I understand 1 TB is huge (that's why zfs might have eased your disk space pain by providing dataset cloning) - but this space is only prob on windows ... so I'd consider switching to a real (server) OS ;)
and I will try to have a look at #851

0 replies

W-Sebastian · 2015-05-26T15:58:45Z

W-Sebastian
May 26, 2015
Author

Thank you for your suggestion.

Unfortunately I don't see how that would work. Indexing takes quite a while and the code-base is continuously updated. As I stated I don't mind having some files being desynchronized but obviously is no good if I get an error.

I can understand if OpenGrok is not a solution that aims to support such scenarios and it may restrict itself to stable or relatively small code bases only that can have their index updated fast.

If you do plan to support such a feature my suggestion would be to have the search results include the file with no link and mark it visually as deleted. A faster option is to simply skip it (maybe add a notice that some files were deleted and are not included). Even no notice and just skipping will make OpenGrok totally usable.

0 replies

tarzanek · 2015-05-26T17:17:01Z

tarzanek
May 26, 2015

well we support even big deployments
I think our internal mirror does fresh update of indexes every 4 hours!

and I agree it's a bug if we break our search because a missing file (I already picked up original bug and will try to fix it)

0 replies

vladak · 2015-05-28T19:29:53Z

vladak
May 28, 2015
Maintainer

What I've done for our internal deployment was a partial indexing of projects once they are updated. The reason why this was done is a bit different than rapidly changing repositories. In our case it was the mirroring and indexing process. Before the change a mirror script was kicked off which performed mirroring of all the projects in parallel in the first phase of run. Most of the mirrors finish soon however there is couple of repositories which have significant latency for fetching individual files and the SCM has to check the repo file by file (think Teamware and NFS across the big pond). After all the mirrors were done, there was a second phase when indexer was kicked off to index all the projects. Obviously, this created quite a big time window during which the index was not consistent with what was on disk.

The solution is based on changes done in #876 (you can find the actual examples of commands there). Basically, after each individual mirror finishes the name of the project is added to a queue from where it is picked up by a script which processes the requests sequentially and performs partial index just for given project. This is done in order not to overwhelm the machine with hundreds of indexers run in parallel (in next version the script will spawn a bunch of partial indexers up to certain limit). This means while the mirroring is still in progress, those projects which finished mirroring can be indexed. To discover new projects and remove deleted projects, indexer of all the projects is run at the end of mirroring phase (now mirroring and partial indexing phase), just like before.

0 replies

vladak · 2016-11-03T13:33:06Z

vladak
Nov 3, 2016
Maintainer

Thinking of this again, maybe we should take more radical approach and be closer to what I believe most Lucene based search engines do: continuously add small pieces of data and immediately index them. For SCMs which operate on changesets this means OpenGrok would be watching for changes in the repositories and pull+index one changeset at a time. This would mean that OpenGrok should be pulling the updates itself.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What options I have if I want to index a huge, continuously changing code base? #3963

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What options I have if I want to index a huge, continuously changing code base? #3963

W-Sebastian May 25, 2015

Replies: 7 comments

vladak May 25, 2015 Maintainer

W-Sebastian May 25, 2015 Author

tarzanek May 26, 2015

W-Sebastian May 26, 2015 Author

tarzanek May 26, 2015

vladak May 28, 2015 Maintainer

vladak Nov 3, 2016 Maintainer

W-Sebastian
May 25, 2015

vladak
May 25, 2015
Maintainer

W-Sebastian
May 25, 2015
Author

tarzanek
May 26, 2015

W-Sebastian
May 26, 2015
Author

tarzanek
May 26, 2015

vladak
May 28, 2015
Maintainer

vladak
Nov 3, 2016
Maintainer