Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store List of Fields in Segment #2279

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Store List of Fields in Segment #2279

wants to merge 1 commit into from

Conversation

PSeitz
Copy link
Contributor

@PSeitz PSeitz commented Dec 13, 2023

Fiels may be encoded in the columnar storage or in the inverted index
for JSON fields.
Add a new Segment file that contains the list of fields (schema +
encoded)

Fiels may be encoded in the columnar storage or in the inverted index
for JSON fields.
Add a new Segment file that contains the list of fields (schema +
encoded)
Comment on lines +304 to 322
if let Some(list_fields_file) = self.list_fields_file.as_ref() {
let file = list_fields_file.read_bytes()?;
let fields_metadata =
read_split_fields(file)?.collect::<io::Result<Vec<FieldMetadata>>>();
fields_metadata.map_err(|e| e.into())
} else {
// Schema fallback
Ok(self
.schema()
.fields()
.map(|(_field, entry)| FieldMetadata {
field_name: entry.name().to_string(),
typ: entry.field_type().value_type(),
indexed: entry.is_indexed(),
stored: entry.is_stored(),
fast: entry.is_fast(),
})
.collect())
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is admittedly code golfing, but I think it makes it a bit less noisier:

Suggested change
if let Some(list_fields_file) = self.list_fields_file.as_ref() {
let file = list_fields_file.read_bytes()?;
let fields_metadata =
read_split_fields(file)?.collect::<io::Result<Vec<FieldMetadata>>>();
fields_metadata.map_err(|e| e.into())
} else {
// Schema fallback
Ok(self
.schema()
.fields()
.map(|(_field, entry)| FieldMetadata {
field_name: entry.name().to_string(),
typ: entry.field_type().value_type(),
indexed: entry.is_indexed(),
stored: entry.is_stored(),
fast: entry.is_fast(),
})
.collect())
}
let fields_metadata = if let Some(list_fields_file) = self.list_fields_file.as_ref() {
let file = list_fields_file.read_bytes()?;
read_split_fields(file)?.collect::<io::Result<Vec<FieldMetadata>>>()?
} else {
// Schema fallback
self.schema()
.fields()
.map(|(_field, entry)| FieldMetadata {
field_name: entry.name().to_string(),
typ: entry.field_type().value_type(),
indexed: entry.is_indexed(),
stored: entry.is_stored(),
fast: entry.is_fast(),
})
.collect()
};
Ok(fields_metadata)

@@ -24,34 +26,44 @@ impl From<u32> for OrderedPathId {

#[derive(Default)]
pub(crate) struct PathToUnorderedId {
map: FnvHashMap<String, u32>,
/// TinySet contains the type codes of the columns in the path.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think defining a struct with named fields would improve readability here, likely making the comment unnecessary or at least attaching it to the named field directly.

let mut sorted_ids: Vec<(&str, (u32, TinySet))> = self
.map
.iter()
.map(|(k, (id, typ_code))| (k.as_str(), (*id, *typ_code)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using the plural typ_codes would be good here and below, or maybe all_codes or typ_code_bitvec as above, to visually indicate that this is not a single code as in e.g. insert_new_path.

@@ -81,6 +84,11 @@ impl SegmentSerializer {
&mut self.postings_serializer
}

/// Accessor to the ``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing name inside the backticks?

serializer: &mut InvertedIndexSerializer,
) -> crate::Result<()> {
// Replace unordered ids by ordered ids to be able to sort
let unordered_id_to_ordered_id: Vec<OrderedPathId> =
ctx.path_to_unordered_id.unordered_id_to_ordered_id();
let ordered_id_to_path = ctx.path_to_unordered_id.ordered_id_to_path();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move the binding up here if it is not used in the loop? I would suggest leaving it below to make that clear to the reader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Vec is reused in another method

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it unordered_id_to_ordered_id that is reused in serialize_segment_fields? I was thinking about the binding for ordered_id_to_path which moved up here from line 79 without any change to how it is used.

// In this case we need to map the potential fast field to the field name
// accepted by the query parser.
let create_canonical =
!field_entry.is_expand_dots_enabled() && json_path.contains('.');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems dangerous to pass the name (which is derived from field_entry) as an argument but access field_entry as a capture. Maybe build_path should be a free function taking field_entry and json_path (and map_to_canonical) as arguments to make it obvious what it accesses and how.

write_field(field_metadata, &mut payload);
}
let compression_level = 3;
let payload_compressed = zstd::stream::encode_all(&mut &payload[..], compression_level)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are compressing from full buffers to full buffers, Zstd's bulk API might yield better results (as @trinity-1686a found elsewhere IIRC).

Copy link
Contributor

@trinity-1686a trinity-1686a Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #1946 (comment) and #1946 (comment). This might be more of an issue when decompressing than compressing, but either way the bulk API is faster, so if it's easy to use, we should use it

mut reader: R,
) -> io::Result<impl Iterator<Item = io::Result<FieldMetadata>>> {
let format_version = read_exact_array::<_, 1>(&mut reader)?[0];
assert_eq!(format_version, 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be an error instead of a panic?

}

/// Reads the Split fields from a stream of bytes
fn read_split_fields_from_zstd<R: Read>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I would suggest merging this into read_split_fields. The name was surprising to me as it does nothing specific to Zstd and I think following the logic of constructing the iterator is easier if it is not split over multiple functions.

(If you do go for Zstd's bulk API, this would probably become simpler as well due to working on a full in-memory representation (if that is not consider too memory hungry) which can simply be chunked instead of calling read_exact repeatedly.)

) -> io::Result<impl Iterator<Item = io::Result<FieldMetadata>>> {
let mut num_fields = u32::from_le_bytes(read_exact_array::<_, 4>(&mut reader)?);

Ok(std::iter::from_fn(move || {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this break w.r.t. error handling, but wouldn't something like

Ok((0..num_fields).map(move |_| read_field(&mut reader))

work as well?

@PSeitz PSeitz marked this pull request as draft December 14, 2023 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants