-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need way to mirror definitions #386
Comments
@jeffmcaffer Today the If the harvested data and the curations are out of scope and the definition cannot be found in the definitions replicated store, the crawler won't be notified about the missing definition and it won't be queued up. Regarding the definitions syncing, the following options are possible starting from the least amount of effort. I am also open to any other options and ideas.
Additionally, Azure Data Factory provides blobs compressing capability. It looks like this capability is only for every blob individually, so it won't be possible to compress all or incremental blobs in one archive. I haven't tested this option but it seems to be the case based on the official documentation. |
@bduranc @iamwillbar @pombredanne Could you please review and comment here? |
My preferred option would be to use |
@pombredanne We can set up Azure Data Factory pipeline to incrementally copy blobs to Azure file share. The files should be then accessible via Server Message Block (SMB) protocol that works with rsync. Azure file shares are more expensive than Azure blob storage but still relatively cheap. |
Hi Everyone,
I am a bit confused about this statement. I thought a definition consists of both harvested and curated data? Is this intended to mean that individual curations and harvest (tool) data would not be available in the mirror, but the computed definition itself (with its baseline harvest data and any applied curations) would still be? @geneh : I think it may be a good idea to study a few more use-cases where folks may want to consume CD data and processes internally so we have a clear idea about what option would work best. I propose we create a Google Doc and call out to the community in order to obtain specific use-cases. I think this would also help obtain a better understanding of what business needs would require the interval of the data to be replicated/refreshed "on-demand" vs. "periodic" (hourly, daily, etc.) For one of my team's own potential use-cases, @jeffshi88 did provide a bit of detail in #650. What's proposed in #650 I think is a bit different as it focuses more on mirroring part of the actual CD process in one's own internal environment while still being able to contribute back to the public. The main reasons for this would be to have priority of harvesting resources for one's own workloads and tighter integration with their own internal systems. But I feel there is a general interest to having offline, read-only access to the data as well. I should be transparent and say my colleagues and I are still determining if this is something our project requires and exactly to what extent... The existing REST APIs do appear to support most common operations of the CD process (harvesting, curation, definition retrieval, etc.) but might become problematic when working with very large volumes of data (when contrasted to the "rsync" approach described here, which as I understand gives people access to the actual definition blobs). Also, if an "rsync" approach does indeed "expose the user to internal details of ClearlyDefined", or otherwise provide more verbose data than would normally be available through the REST APIs, with proper documentation this could be beneficial. |
@geneh sorry for the delay in responding. The proposal here is to mirror only definitions. That is all the locally running replica service would access. No harvested data, no curations, no recomputing definitions, ... It is a dumb, read-only replica of the clearlydefined.io service. The replica service both responses to a limited set of API calls and is responsible for mirroring the definitions "locally" (i.e., into the control of the party running the replica). Given the volume of data, the locally replicated data should be locally persistent. If the replica service crashes or needs to be redeployed/updated, the data should not have to be re-mirrored. It is highly likely that different replica hosts will want to persist their mirrored copy differently. Amazon folks for example will reasonably want to put the data in S3 or some such. Whatever folks are using to support their engineering system. So we cannot assume that they have it, for example, on a local file system. Replica hosts should be able to plug in different definition storage providers. Unfortunately this implies that the mirroring system needs internally to call the storage provider APIs and not be concerned about how the data is stored. I'm not sure if/how rsync can help in that context as IIRC it assumes a filesystem copy of the millions of definition files and then syncs these copy with the master. Assuming that the replica has it's copy of the definitions available in an rsync compatible filesystem is likely too restrictive. |
@bduranc great feedback. For this issue we are looking to enable "at scale" users of ClearlyDefined to operate with confidence, independence and privacy. These are concerns we've heard from several folks (including my own team at GitHub). clearlydefined.io has best effort SLAs. While many of the folks running the service are responsible for their own company's engineering system, they are not responsible for yours 😄 Folks reasonably want better control over systems that can, in theory, break their builds and disrupt their engineering systems. In that scenario, only the definitions need be mirrored. As you observed, the definitions are a synthesis of the harvested data and optional curations. Since that processing has already been done and the resultant definitions mirrored, the harvested data and curations need not be mirrored. Your points about tool chain integration and curation contribution are great (thanks for keeping that in mind!). Given the range of tools and workflows, I suggest that we keep that separate from this discussion and see how it evolves. make sense? |
Thanks, @jeffmcaffer! In this case I think the following should be the best option both for the users concerned about their confidence, independence, and privacy as well as the ClearlyDefined project interested in harvesting as much data as possible:
Please let me know if it makes sense. |
Step 2 should just be:
Everything else is lower priority (maybe do in the next iteration), and should be behind a feature flag when implemented. Those who care about security/privacy will not want an automated call to api.clearlydefined.io based on their internal usage. For those who are mirroring for reliability reasons only could use this proxy/caching behavior, but we might be better served working on a more robust copying tool and scheme (up for discussion). |
Most of the mirroring scenarios I've seen where some combination of:
I light of these requirements, I think we need a solution that includes
It is that last point that has the most questions around it. Would love to talk about options there. On-demand fallback to clearlydefined.io is an interesting option to have but goes against several of the requirements. |
I'd like to use CD to identify open source components in closed source firmware by cross-referencing hashes from extracted files with those from CD. Making API requests isn't a realistic option because we have several million hashes and would rather avoid bottlenecking on web requests. The rsync-like option seems optimal for my use case, as I'll almost certainly need to transform the data again, and I have no problem having to deal with CD internals where necessary. I know that AWS allows for requester pays buckets, this seems like a good application thereof assuming that Azure has a similar option. |
thanks for the detail @eShuttleworth. A quick clarification on your use case to ensure that ClearlyDefined is applicable here... Are these extracted files source? binaries taken from packages? Binaries that you (someone not the package publisher) built? ClearlyDefined will only have data related to files found in the canonical packages or corresponding source repos. |
These are files that have been extracted from images of *nix devices, usually containing a mix of binaries compiled from source and from packages. I don't expect to be able to use ClearlyDefined to get too much information about these binaries, but I am hoping that it will help identify packages from interpreted/JIT languages like Python, JS, and Ruby. |
I am making this a Google Summer of Code idea |
Together with @jeffmendoza and @pombredanne, we have drafted a GSoC topic in the context of Software Heritage for mirroring and integrating ClearlyDefined data into the Software Heritage archive (and also worok on integration aspects in the converse direction, which are not relevant for this issue so I'm leaving them aside for now). Completing such task will not address this issue in its full generality, but hopefully it will provide a useful return on experience on a first mirroring attempt of ClearlyDefined data. |
My team and I recently integrated with CD to pull packages metadata, focusing on licencing. Our use case is quite simple but we faced challenges that a mirroring solution can greatly help with. One thing that wasn't mentioned earlier but can be a potential use-case(it is for us) is owning the data(definitions) to run more complex, business specific queries on the data. For this use-case, replicating the service won't work. I see 2 scenarios for this:
Other challenges were already mentioned but I'll add them here for completeness.
We have a solution for these challenges but it's flaky and will require further work and tweaking for stability, robustness and to avoid rate limit. It's also not very scalable as the number of packages relevant for us grows. |
A few examples of use-cases where data replication will be more beneficial than service replication:
These are just a few examples but anything that requires consuming/processing a large dataset will be easier with access to the data instead of a web service. |
FWIW, I have developed a reasonably robust tool that can sync and keep syncing the whole ClearlyDefined dataset and I will release it tomorrow to present in the weekly call. |
@zacchiro @romanlab @kpfleming @jefshi88 @bduranc @fossygirl and all: @romanlab I added a new option to fetch only the definitions in https://github.com/nexB/clearcode-toolkit/tree/only-def-and-misc based on your request during the meeting :) @jeffmcaffer you reached out to get the details since you could no join yesterday, here they are ^ All: at this stage my suggested and preferred course of action would be to adopt this tool as the solution for this ticket. It works, it is tested and is available now. Any objection? if that is so please voice them here. Separately it would be greatly helpful to work to fix some of the issues I faced such as:
And in a lesser way:
|
Thanks for this @pombredanne. Interesting. I took a quick look at the slides and have some thoughts (below). Happy to follow up in a call with interested parties. My apologies for missing the call where I suspect at least some of this was discussed.
Would love to know more about the Cloudflare issues. That feels like a separate topic. If Cloudflare is causing problems, they should be fixed or Cloudflare eliminated. People should not *have to set up a mirror to get reliability. |
Hi @jeffmcaffer than you for your reply! You wrote:
That's a one time operation so that's not a big concern now that I am over that hump. I am working out something so we make a seed DB available for public consumption, making this a non-event.
I am assuming you mean step 3 and step 1 in the presentation at https://github.com/nexB/clearcode-toolkit/tree/32f310669603d17c9adc594104694db0a3f0a878/docs Step 1 and 3 are the same: items are fetched (so I guess pulled) from the ClearlyDefined API and stored in the filesystem and/or in a database. There is also a command line utility to import a filesystem layout to a DB layout since we started with files until that proved impractical, and then switched to using a DB.
It makes sense to me and I will maintain that tool whether this happens or not.
This is a tool that you can checkout and runs as-is with minimal configuration, so IMHO that's not an issue, especially since Python is already part of the node.js stack and required for node-gyp.
It uses the API, so it will have to adapt to any API changes. I hope that such changes are uncommon and small as they have been in the past, so that's unlikely to be an issue.
The scenario for step 5 is a case where I use the data in an air-gaped, isolated environment with no internet access, so by definition I cannot call ClearlyDefined or anything from there. The process is therefore:
It is completely unrelated yet a highly similar and simplified version but could easily be made to have the same semantics.
I sure do and I have mirrored them all. I have a GSoC student that is looking at the data to help spots ScanCode license detection inconsistencies using stats and machine learning. All the cases where you want to to trust but verify would need them too, as well when there are curation that require more digging and getting back to actual ScanCode scans (say if you report and/or curate a license as
I agree that's a separate topic. I have no idea what the core issue may be, I just happen to have traced Cloudflare as the cause of hiccups in sustained API calls. It could just be that any glitch looks like a Cloudflare issue since they are fronting everything? That's minor anyway now that I have a complete a base seed DB mirror and that I only increments. |
@romanlab FYI, I just merged aboutcode-org/clearcode-toolkit#20 that added the ability to mirror only definitions as we discussed during the call. Feel free to reach out for help as needed. |
We briefly discussed during and after the call how to go about this. Various interested parties (including Software Heritage) can easily offer storage space for the seed DB, but the problem is supporting egress traffic for downstream mirrors starting from scratch. Torrent is also a possibility, but it would still require a decent amount of seeders participating.
I agree with both Philippe's points about the existing of such use cases, and with you that mirroring only the definitions will probably be a much more common use case. @pombredanne: given the new ability to mirror only definitions, would it also be possible to seed an initial mirror whose main aim is to mirror only definitions? If so, how much will that change the volume of the initial seed DB to be hosted? It might be worth having the two different kinds of initial seed databases if the demand of use cases is very different (as I think it is). |
@zacchiro re:
my hunch is that this should be a few 10's of GB, therefore much easier to handle indeed. @MaJuRG since you own the DB it would be interesting to get a rough cut estimate of what a dump of definitions only would be. |
@zacchiro re:
To your earlier point 10's of GB as an order of magnitude becomes much simpler and I could (relatively) straightfowardly open up a server for reliable resumable rsync fetching of the seed data (possibly split in chunks, but that's JSON so no biggie). |
If that number is in bytes, then its ~105 GB for all the definitions (in compressed form) |
@zacchiro that makes it a much smaller data set alright, roughly 10 times smaller and much simpler to seed alright, including doing increments! Let's do it ! @MaJuRG I wonder what the most effective way to craft selective dumps then:
|
I believe the REST API could be modified to do what you describe to handle increments @pombredanne. The mirrors could even update via http if desired. A PostgeSQL COPY script would probably work as well. In my opinion, it depends on the use-case and hardware setup |
@MaJuRG Thanks |
I would like to move swiftly enough here, so unless there is an objection in the next few days, my plan would be to:
|
@jeffmcaffer re again:
Actually that's even a non issue IMHO since ScanCode is already part of the overall stack and requires Python. |
Hi, |
@sdavtaker you wrote:
ClearCode did not receive any interest or traction to mirror and unlock the data here. But "ClearCode" is not stale and lives a happy life in the AboutCode org, ... it was originally at https://github.com/aboutcode-org/clearcode-toolkit and this has been integrated since and merged in the PurlDB project https://github.com/aboutcode-org/purldb which is a database of all PURLs https://github.com/package-url/purl-spec And beyond I am working with AboutCode maintainers to eventually mirror all the PurlDB (including and beyond ClearlyDefined) using the code/work in progress in https://github.com/aboutcode-org/federatedcode that is also documented in this paper https://www.tdcommons.org/dpubs_series/5632/ Feel free to reach out directly if you want to understand more about it [email protected] |
There is interest in people replicating definition data locally to support robustness/performance, infrastructure control, and privacy.
Principles:
Options:
Random thoughts/topics
cc: @jeffmendoza
The text was updated successfully, but these errors were encountered: