Conversation
|
Thanks for opening the PR! I think that log schema is almost good--we also need an "input_ids" field, so we can trace back a row in the table (i.e., a log) to its inputs. So every row will represent an output document of an operation. I don't think we need a "log_message" field; this table is more about enabling us to reconstruct the lineage of any DocETL pipeline output, which just requires us to attach an id to every output, and have it point to its input_ids in the table. We will also want another table with just the id of an input or output, and its value. So two tables overall. I don't think we need to do anything regarding console.logging! I'm imagining the PR logic can be something like:
Maybe we can make a PR with this code for just a map operation, then extend step 4 to the other operators? LMK what makes sense or doesn't make sense! |
|
Hi @shreyashankar again will take some time to make more changes, yes this makes sense, log message is basically the operation that went through and the data. I am thinking every log.message would be logged, or do we want the intermediate data to be logged as well ? Haven't run in debug mode, so don't know much about the log traces. The major problem I am feeling is the schema for the table, if we are confining it to one default schema its good, but if user can give their own schema it becomes problematic especially to do any mappings against it. |
|
Yes, let me get back to you about an example in the next few days. Trying to wrap up some stuff before the end of the week for a paper deadline |
@shreyashankar kind of faded out on how should we have it.
So what is in the PR :
Where I am stuck :
log schema I am considering
I want lineage only for operations, every yml given can have a "process_id" or will be randomly generated as a key for the logs
User defined schema kinds of mess up, bcz code needs to be altered in that specific way
The database connection happens at the start of the application
I want console logging function itself to have an extra feature where it uses the connection and pushes log message to logging connection
obviously haven't tested any parts :(