Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support streamed download to a path #94

Open
59e5aaf4 opened this issue May 3, 2023 · 5 comments
Open

Support streamed download to a path #94

59e5aaf4 opened this issue May 3, 2023 · 5 comments
Assignees
Labels
enhancement 🌟 New feature or request

Comments

@59e5aaf4
Copy link

59e5aaf4 commented May 3, 2023

There is currently no way either in falconpy or caracara to stream a file down to the disk. That causes 4GB requests to be entirely loaded in memory at some point.

We're using ugly hacks to pass stream=True to a raw requests.request and call it a day, so that we can stream a large file to disk chunk by chunk.

def rtr_session_download_to_path(self, session_id, sha256, destination, known_size = None):
    '''
    Downloads an extracted file straight into a file (7z -pinfected) using
    chunks, so that we don't have a 4GB single http request in memory at
    some point. Or several in parallel.
    '''
    # First, prepare a HTTP request by stealing the self.falcon config for URL & token
    url = f'{self.falcon.base_url}/real-time-response/entities/extracted-file-contents/v1'
    params = {
        'session_id': session_id,
        'sha256': sha256,
    }
    self.logger.debug(f'Getting file sha256={sha256}, session_id={session_id} into {destination}')
    total_written_bytes = 0
    with request(
        'get',url,
        # Here we assume the token is fresh enough, which is usually the case since we just listed the file properties.
        headers = self.falcon.headers(),
        verify = self.falcon.ssl_verify,
        stream = True,
        params = params,
        ) as r:
        if not destination.parent.exists():
            self.logger.info(f'Creating folder {destination.parent}')
            destination.parent.mkdir(parents = True, exist_ok = True)
        with destination.open('wb') as f, tqdm(
            desc=str(destination),
            total=known_size,
            unit='iB',
            unit_scale=True,
            unit_divisor=1024,
        ) as bar:
            self.logger.debug(f'Actual download iteration start')
            for chunk in r.iter_content(chunk_size=10*1024):
                written_bytes = f.write(chunk)
                bar.update(written_bytes)
                total_written_bytes += written_bytes

    return destination, total_written_bytes

Could this be done natively by caracara ? I'm no asyncio expert but there's some http + file magic to be done here imo.

Thanks !

@ChristopherHammond13
Copy link
Member

This is a really neat trick, and actually close to something I implemented internally for this type of use case. I agree that we should do this.

In Caracara, we just call the endpoint directly in FalconPy -- https://github.com/CrowdStrike/caracara/blob/3cf2e49440b6ea2d05dc2a9a65b3e8052de144b8/caracara/modules/rtr/get_file.py#LL72C66-L76C66

@jshcodes is this something that we can build into FalconPy? Otherwise, I could short-circuit FalconPy here and perform this request manually, but obviously we would like API operations to be handled natively by FalconPy where possible to take advantage of the abstraction layer there. It might be that you'll need to return a request object back directly, or at least enable stream=True so that we do not have to download the whole blob at once.

@59e5aaf4
Copy link
Author

59e5aaf4 commented May 3, 2023

Also, ahem, https://eu-1.ideas.crowdstrike.com/ideas/IDEA-I-10248 , there's no support for the Range HTTP header, preventing partial downloads (on either the normal API or the WebUI API). Another missing pretty relevant header would be Content-Length, the closest thing we can get to is the size field of a RTR file object, but that describes the size of the in-7z file, not the 7z file itself.

If you know anyone that might be able to take a look at this, feel free to share the concern :D I am really surprised we only get 1 try to download 4GB files, and then have to start over from offset=0 if the VPN gods decide to banish a socket into the void.

@ChristopherHammond13
Copy link
Member

@59e5aaf4 This is great feedback. I have posed the question over to the RTR team to see what they say. I suspect it could require a few teams to get involved, but I'll keep tabs on it internally. For now, I think getting the chunking support provided first party would be an important first step. I am speaking with @jshcodes on the side to figure out which code should own this functionality, and will update this ticket once we figure out the best place for this to live.

@jshcodes
Copy link
Member

jshcodes commented May 3, 2023

Providing support for the stream keyword in requests is a neat idea and should not be difficult to implement.

Content-Length will require a bit more effort, but it makes sense and we should support it if we can.

FalconPy v1.2.16 development will start here soon, this enhancement has been added to the punch list. 👊

@ChristopherHammond13
Copy link
Member

Awesome, thanks @jshcodes! Once 1.2.16 is launched, we can figure out the best way to expose this to Caracara users in a way that makes the most sense. Ultimately this SDK is about doing as much for users as possible, so I'm open to taking something more raw from FalconPy, or perhaps we can get FalconPy to perform the download and move the chunking there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement 🌟 New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants