What options I have if I want to index a huge, continuously changing code base? #3963
Replies: 7 comments
-
Ideally this should be fixed in OpenGrok itself. This is tracked by #851. I will keep this open as a question to see if there are any workarounds we can do. Also see #517. |
Beta Was this translation helpful? Give feedback.
-
Thank you. I've noticed #851 but found that the behavior could be considered desired as it is thus considered it would be best to open a new question regarding my specific scenario. |
Beta Was this translation helpful? Give feedback.
-
workaround is to sync on separate dataset and flip over once both index and source match ... on windows this is harder of course(but doable e.g. by renaming dirs and flipping configs around and sending config from indexer to dummy address), on linux btrfs might help, on Solaris ZFS is obvious fix (which I am using for example in my test setup and this was originally done for public solaris source browser - it's described in docs how to have 2 DATA_DIRs and you can do the same for SOURCE_DIRS) and I understand 1 TB is huge (that's why zfs might have eased your disk space pain by providing dataset cloning) - but this space is only prob on windows ... so I'd consider switching to a real (server) OS ;) |
Beta Was this translation helpful? Give feedback.
-
Thank you for your suggestion. Unfortunately I don't see how that would work. Indexing takes quite a while and the code-base is continuously updated. As I stated I don't mind having some files being desynchronized but obviously is no good if I get an error. I can understand if OpenGrok is not a solution that aims to support such scenarios and it may restrict itself to stable or relatively small code bases only that can have their index updated fast. If you do plan to support such a feature my suggestion would be to have the search results include the file with no link and mark it visually as deleted. A faster option is to simply skip it (maybe add a notice that some files were deleted and are not included). Even no notice and just skipping will make OpenGrok totally usable. |
Beta Was this translation helpful? Give feedback.
-
well we support even big deployments and I agree it's a bug if we break our search because a missing file (I already picked up original bug and will try to fix it) |
Beta Was this translation helpful? Give feedback.
-
What I've done for our internal deployment was a partial indexing of projects once they are updated. The reason why this was done is a bit different than rapidly changing repositories. In our case it was the mirroring and indexing process. Before the change a mirror script was kicked off which performed mirroring of all the projects in parallel in the first phase of run. Most of the mirrors finish soon however there is couple of repositories which have significant latency for fetching individual files and the SCM has to check the repo file by file (think Teamware and NFS across the big pond). After all the mirrors were done, there was a second phase when indexer was kicked off to index all the projects. Obviously, this created quite a big time window during which the index was not consistent with what was on disk. The solution is based on changes done in #876 (you can find the actual examples of commands there). Basically, after each individual mirror finishes the name of the project is added to a queue from where it is picked up by a script which processes the requests sequentially and performs partial index just for given project. This is done in order not to overwhelm the machine with hundreds of indexers run in parallel (in next version the script will spawn a bunch of partial indexers up to certain limit). This means while the mirroring is still in progress, those projects which finished mirroring can be indexed. To discover new projects and remove deleted projects, indexer of all the projects is run at the end of mirroring phase (now mirroring and partial indexing phase), just like before. |
Beta Was this translation helpful? Give feedback.
-
Thinking of this again, maybe we should take more radical approach and be closer to what I believe most Lucene based search engines do: continuously add small pieces of data and immediately index them. For SCMs which operate on changesets this means OpenGrok would be watching for changes in the repositories and pull+index one changeset at a time. This would mean that OpenGrok should be pulling the updates itself. |
Beta Was this translation helpful? Give feedback.
-
I have close to 1TB of data I want to index which covers several different codebases. Each one is modified during a day multiple times. I don't mind being out of sync since the index is regenerated each night but missing files are throwing FileNotFoundException in the search page thus making certain code bases unsearchable.
What options do I have to go around this? What I can think of is:
Are there any other suggestions I could try?
If relevant, I am using OpenGrok on a windows machine.
Beta Was this translation helpful? Give feedback.
All reactions