Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CQL Vector support #1165

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

CQL Vector support #1165

wants to merge 7 commits into from

Conversation

smoczy123
Copy link

@smoczy123 smoczy123 commented Jan 6, 2025

This PR adds serialization and deserialization of CQL Vector (as implemented in Cassandra) therefore achieving compatibility with Cassandra's Vector type. It's important to note that Cassandra implements Vector serialization and deserialization in a way that
contradicts the CQL protocol, using [unsigned vint] instead of [int] as the element size encoding for variable type length vectors.

Fixes #1014

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@github-actions github-actions bot added the semver-checks-breaking cargo-semver-checks reports that this PR introduces breaking API changes label Jan 6, 2025
Copy link

github-actions bot commented Jan 6, 2025

cargo semver-checks detected some API incompatibilities in this PR.
Checked commit: 8e128e1

See the following report for details:

cargo semver-checks output
./scripts/semver-checks.sh --baseline-rev 4a9367c1e1671773a704670de3c6acd089dc70d0
+ cargo semver-checks -p scylla -p scylla-cql --baseline-rev 4a9367c1e1671773a704670de3c6acd089dc70d0
     Cloning 4a9367c1e1671773a704670de3c6acd089dc70d0
    Building scylla v0.15.0 (current)
       Built [  22.570s] (current)
     Parsing scylla v0.15.0 (current)
      Parsed [   0.051s] (current)
    Building scylla v0.15.0 (baseline)
       Built [  22.889s] (baseline)
     Parsing scylla v0.15.0 (baseline)
      Parsed [   0.051s] (baseline)
    Checking scylla v0.15.0 -> v0.15.0 (no change)
     Checked [   0.110s] 107 checks: 107 pass, 0 skip
     Summary no semver update required
    Finished [  47.245s] scylla
    Building scylla-cql v0.4.0 (current)
       Built [  11.481s] (current)
     Parsing scylla-cql v0.4.0 (current)
      Parsed [   0.036s] (current)
    Building scylla-cql v0.4.0 (baseline)
       Built [  11.335s] (baseline)
     Parsing scylla-cql v0.4.0 (baseline)
      Parsed [   0.033s] (baseline)
    Checking scylla-cql v0.4.0 -> v0.4.0 (no change)
     Checked [   0.111s] 107 checks: 106 pass, 1 fail, 0 warn, 0 skip

--- failure enum_variant_added: enum variant added on exhaustive enum ---

Description:
A publicly-visible enum without #[non_exhaustive] has a new variant.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#enum-variant-new
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.38.0/src/lints/enum_variant_added.ron

Failed in:
  variant ColumnType:Vector in /home/runner/work/scylla-rust-driver/scylla-rust-driver/scylla-cql/src/frame/response/result.rs:87
  variant CqlValue:Vector in /home/runner/work/scylla-rust-driver/scylla-rust-driver/scylla-cql/src/frame/response/result.rs:216

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  23.727s] scylla-cql
make: *** [Makefile:61: semver-rev] Error 1

@smoczy123
Copy link
Author

I'm not sure this is the correct way to split this PR into commits (I'm pretty sure it isn't, as the commits won't compile), however I can't think of a proper way.

@smoczy123 smoczy123 marked this pull request as ready for review January 10, 2025 03:19
@smoczy123 smoczy123 force-pushed the vector-type branch 2 times, most recently from 6aee097 to 440d63a Compare January 13, 2025 12:38
@wprzytula wprzytula added this to the 0.16.0 milestone Jan 13, 2025
Copy link
Contributor

@muzarski muzarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed the first commit (introduction of TypeParser)

Some general comments:

  1. The logic of TypeParser is quite complex. I suggest adding some docstrings next to the type definitions and methods. For example, I have no idea what TypeParser::from_hex does. Docstrings will also help a lot in the future in case some other developer touches this piece of code.
  2. It's worth adding some comments next to the non-intuitive parts of the code. Example:
        if name.is_empty() {
            if !self.is_eos() {
                return Err(CqlTypeParseError::AbstractTypeParseError());
            }
            return Ok(ColumnType::Blob);
        }

It's not obvious why we return Blob if name is empty. A link to the corresponding part of original source code would be helpful.

  1. Please, add some unit tests. I saw that there is some small test of TypeParser in a later commit. I think we should add more tests and try to handle as many parsing cases as we can. In addition, I think that in this case, unit tests should be added in the same commit (they help during review - it's easier to reason about the complex code when there are some use case examples one can look at)
  2. This implementation is based on some existing (probably Java) implementation, correct? If so, please, provide the link to the source in the commit. Ideally, the link should be placed in the comments in code as well.

Comment on lines 487 to 491
InvalidInetLength(u8),
#[error("UTF8 deserialization failed: {0}")]
UTF8DeserializationError(#[from] std::str::Utf8Error),
#[error(transparent)]
ParseIntError(#[from] std::num::ParseIntError),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is some remnant from one of your initial implementations. It's not needed anymore AFAIU (I deleted it locally and everything compiled).

Comment on lines 457 to 458
#[error("Failed to parse abstract type")]
AbstractTypeParseError(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What is an abstract type? I thought that what TypeParser does is it parses some specific Custom CQL types. Maybe it should be called CustomTypeParseError.
  2. Need more context - what exactly failed during parsing of custom type? I propose to create a new error type called CustomTypeParseError. It should be an enum with variants corresponding to the possible cause of failures. Then CqlTypeParseError could have a variant like:
#[error("Failed to parse custom CQL type: {0}")]
CustomTypeParseError(#[from] CustomTypeParseError)

cc: @wprzytula

Comment on lines +6 to +10
type UDTParameters<'result> = (
Cow<'result, str>,
Cow<'result, str>,
Vec<(Cow<'result, str>, ColumnType<'result>)>,
);
Copy link
Contributor

@muzarski muzarski Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd prefer to have it as a struct instead of a type alias. Cow<'result, str> type appears twice and it's hard to reason about it without explicit field names.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd prefer to have it as a struct instead of a newtype.

Actually, a struct is called a newtype. type is a type alias, which is not a new type, yet just a new name for an existing type.

Comment on lines +12 to +15
pub(crate) struct TypeParser<'result> {
pos: usize,
str: Cow<'result, str>,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Probably need to rename it to CustomTypeParser (or AbstractTypeParser if we decide to stick to abstract naming convention). Same goes for the name of the module - type_parser.rs is not specific enough IMO.

Comment on lines 22 to 27
pub(crate) fn parse(str: Cow<'result, str>) -> Result<ColumnType<'result>, CqlTypeParseError> {
let mut parser = TypeParser::new(str);
parser.do_parse()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the functions/methods in this module unnecessarily return such broad error type as CqlTypeParseError. We could narrow it - see my other comment about introducing separate error type for custom type parsing failures.

Comment on lines 283 to 296
if !self.is_eos() && self.str.as_bytes()[self.pos] == b':' {
self.pos += 1;
let _ = usize::from_str_radix(&name, 16)
.map_err(|_| CqlTypeParseError::AbstractTypeParseError());
name = self.read_next_identifier();
}
Copy link
Contributor

@muzarski muzarski Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this part do? Is it tested somewhere?

@smoczy123
Copy link
Author

Whole TypeParser logic was ripped straight out of ScyllaDB's vector implementation, however, as it still in development and probably won't be merged for a while, it will be hard to link directly. IIRC there is a lot of tests there for this functionality, so thay also can be borrowed.

@muzarski
Copy link
Contributor

Whole TypeParser logic was ripped straight out of ScyllaDB's vector implementation, however, as it still in development and probably won't be merged for a while, it will be hard to link directly. IIRC there is a lot of tests there for this functionality, so thay also can be borrowed.

Ok, makes sense. And let's borrow the tests in such case :)

Copy link
Collaborator

@wprzytula wprzytula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 What a great piece of code! Thank you for the contribution!

There are quite many comments, though.
I think that the new parser module needs much more unit tests.
Also, tests for particular errors upon serialization and deserialization of Vector are missing.

Comment on lines +6 to +10
type UDTParameters<'result> = (
Cow<'result, str>,
Cow<'result, str>,
Vec<(Cow<'result, str>, ColumnType<'result>)>,
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd prefer to have it as a struct instead of a newtype.

Actually, a struct is called a newtype. type is a type alias, which is not a new type, yet just a new name for an existing type.

Comment on lines 18 to 27
fn new(str: Cow<'result, str>) -> TypeParser<'result> {
TypeParser { pos: 0, str }
}

pub(crate) fn parse(str: Cow<'result, str>) -> Result<ColumnType<'result>, CqlTypeParseError> {
let mut parser = TypeParser::new(str);
parser.do_parse()
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⛏️ Let's not use str as a name for a variable - it's a name of a type.

Comment on lines +39 to +43
fn char_at_pos(&self) -> char {
self.str.as_bytes()[self.pos] as char
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 This may panic. Wouldn't it be better to use the checked get(self.pos) method?

Comment on lines +43 to +53
fn read_next_identifier(&mut self) -> Cow<'result, str> {
let start = self.pos;
while !self.is_eos() && TypeParser::is_identifier_char(self.char_at_pos()) {
self.pos += 1;
}
match &self.str {
Cow::Borrowed(s) => Cow::Borrowed(&s[start..self.pos]),
Cow::Owned(s) => Cow::Owned(s[start..self.pos].to_owned()),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 This logic requires comments.


pub(crate) struct TypeParser<'result> {
pos: usize,
str: Cow<'result, str>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⛏️ Let's not use str as a name for a field - it's a name of a type.

Comment on lines 77 to 81
pub struct CellWriter<'buf> {
buf: &'buf mut Vec<u8>,
cell_len: Option<usize>,
size_as_uvarint: bool,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lorak-mmk As you know the serialization framework quite well, could you please aid in review of this commit?

Comment on lines 207 to 211
impl<'buf> CellValueBuilder<'buf> {
#[inline]
fn new(buf: &'buf mut Vec<u8>) -> Self {
fn new(buf: &'buf mut Vec<u8>, size_as_uvar_int: bool) -> Self {
// "Length" of a [bytes] frame can either be a non-negative i32,
// -1 (null) or -1 (not set). Push an invalid value here. It will be
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🏕️ This looks like a typo in the comment: -1 is mentioned twice.
According to the CQL specs, not set is represented using -2.

Could you please fix this as a bonus, @smoczy123?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This typo is present in more than one place in this file.

Comment on lines 190 to 205
pub struct CellValueBuilder<'buf> {
// Buffer that this value should be serialized to.
buf: &'buf mut Vec<u8>,
pub(crate) buf: &'buf mut Vec<u8>,

// Starting position of the value in the buffer.
starting_pos: usize,

// If writing to a fixed length type vector, the type length.
cell_len: Option<usize>,

//If serializing a variable length vector cell, the size is encoded as a varint.
is_variable_length: bool,

// Buffer for variable length vector cell.
variable_length_buffer: Option<Vec<u8>>,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 🔧 I believe that with the new possible fields, CellValueBuilder should be made an enum, with distinct variants for non-Vector, const Vector and variable Vector cases (though I'm not sure about separating the latter two cases).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same goes for CellWriter - I think it would benefit from being made an enum.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be hard to split it into enums properly, as the behavior here depends not only on if we are serializing a vector, but also if we are serializing an element of a vector

Comment on lines +305 to +317
if let Some(buffer) = self.variable_length_buffer {
let value_len = buffer.len();
let mut len = Vec::new();
types::unsigned_vint_encode(value_len as u64, &mut len);
self.buf.extend_from_slice(&len);
self.buf.extend_from_slice(&buffer);
} else {
let value_len: i32 = (self.buf.len() - self.starting_pos - 4)
.try_into()
.map_err(|_| CellOverflowError)?;
self.buf[self.starting_pos..self.starting_pos + 4]
.copy_from_slice(&value_len.to_be_bytes());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 The logic around Cell* got much convoluted with these alterations. Let's think how we can make it more digestable. @Lorak-mmk @muzarski

Comment on lines 3172 to 3173
use std::vec;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ ♻️ Is this used anywhere?

@@ -454,6 +454,8 @@ pub enum CqlTypeParseError {
TupleLengthParseError(LowLevelDeserializationError),
#[error("CQL Type not yet implemented, id: {0}")]
TypeNotImplemented(u16),
#[error("Failed to parse abstract type")]
AbstractTypeParseError(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it's idiomatic to avoid the parentheses if the list of arguments for the variant is empty

Vec<(Cow<'result, str>, ColumnType<'result>)>,
);

pub(crate) struct TypeParser<'result> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit sad that you had to introduce this type from scratch, we already have very similar parsing utilities in the scylla crate (scylla::utils::parse::ParserState).

I learned from @wprzytula that he suggested moving the scylla::utils::parse module to scylla-cql and then rework your TypeParser to reuse the existing code. I highly suggest that you do that, we would rather avoid maintaining two separate parsers.

@@ -864,17 +913,12 @@ fn deser_type_generic<'frame, 'result, StrT: Into<Cow<'result, str>>>(
types::read_short(buf).map_err(|err| CqlTypeParseError::TypeIdParseError(err.into()))?;
Ok(match id {
0x0000 => {
// We use types::read_string instead of read_string argument here on purpose.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit sad that Cassandra folks didn't bother to add proper support for expressing the vector type in the protocol, instead relying on the custom type...

@@ -0,0 +1,26 @@
## Vector (for Cassandra only!)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather skip "for Cassandra only" here. This piece of text will eventually become outdated when Scylla starts supporting the type, and nobody will remember to remove it (and you can't really remove it in released versions of the driver, I suppose).

}
}

pub fn type_size(&self) -> Option<usize> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a public method with unclear meaning, please add a docstring.

Comment on lines 30 to 95
#[test]
fn test_cassandra_type_parser() {
let type_name =
"org.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.Int32Type, 5)";
assert_eq!(
TypeParser::parse(Cow::Borrowed(type_name)).unwrap(),
ColumnType::Vector(Box::new(ColumnType::Int), 5)
)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You introduced a beast of a module (type_parser) which is capable of parsing the syntax of any type (be it a primitive type, list, UDT, vector, etc...), but you added only this one, short test. Please add more tests for that module (preferably in the commit which introduced it) in order to increase the coverage.

let string_class_name: String;
let class_name: Cow<'result, str>;
if name.contains("org.apache.cassandra.db.marshal.") {
class_name = name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: semicolon missing at the end of line

@@ -1534,6 +1534,10 @@ mod legacy {
CqlValue::Map(m) => serialize_map(m.iter().map(|p| (&p.0, &p.1)), m.len(), buf),
CqlValue::Tuple(t) => serialize_tuple(t.iter(), buf),

CqlValue::Vector(_) => {
unimplemented!("Vector serialization is not implemented yet");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should start by introducing lower layers of the code (i.e. serialization / deserialization) and only then move to extend the ColumnType/CqlValue. This way you will avoid awkward unimplemented! invocations which are a bit cumbersome for reviewers to track and make sure they are removed at the end.

_ => Err(mk_typck_err::<Self>(
typ,
BuiltinTypeCheckErrorKind::SetOrListError(
SetOrListTypeCheckErrorKind::NotSetOrList,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NotSetListOrVector?

Besides, now I realize that the name of this error type is bad, we will basically have to change its name every time we add support for a new type which deserializes to Vec - which may or may not happen anymore in the future, but it did happen for the vector data type.

Comment on lines 1079 to 1121
impl<'frame, 'metadata, T> Iterator for VariableLengthVectorIterator<'frame, 'metadata, T>
where
T: DeserializeValue<'frame, 'metadata>,
{
type Item = Result<T, DeserializationError>;

fn next(&mut self) -> Option<Self::Item> {
self.remaining = self.remaining.checked_sub(1)?;
let size = types::unsigned_vint_decode(self.slice.as_slice_mut()).map_err(|err| {
mk_deser_err::<Self>(
self.coll_typ,
BuiltinDeserializationErrorKind::RawCqlBytesReadError(
LowLevelDeserializationError::IoError(Arc::new(err)),
),
)
});
let raw = size.and_then(|size| {
self.slice
.read_subslice(size.try_into().unwrap())
.map_err(|err| {
mk_deser_err::<Self>(
self.coll_typ,
BuiltinDeserializationErrorKind::RawCqlBytesReadError(err),
)
})
});

Some(raw.and_then(|raw| {
T::deserialize(self.elem_typ, raw).map_err(|err| {
mk_deser_err::<Self>(
self.coll_typ,
VectorDeserializationErrorKind::ElementDeserializationFailed(err),
)
})
}))
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is VectorBytesSequenceIterator used anywhere outside of VariableLengthVectorIterator? Does it make sense to keep it separate? Maybe we can inline it?

@smoczy123 smoczy123 force-pushed the vector-type branch 2 times, most recently from db320e6 to 78fb8f2 Compare January 22, 2025 20:51
@smoczy123 smoczy123 force-pushed the vector-type branch 2 times, most recently from 5939507 to 23d6dad Compare January 22, 2025 21:32
This is needed to deserialize vector metadata
as it is implemented as a Custom type with
VectorType as its class
Due to the fact that Cassandra implements
variable type length vectors in a way
that contradicts the CQL protocol, special
care must be given when deserializing them
as sizes of their elements are encoded as
unsigned vint instead of an int
Similarly to the previous commit, special care
must be given when serializing variable type length
vectors, as sizes of their elements must be written
as an unsigned varint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-checks-breaking cargo-semver-checks reports that this PR introduces breaking API changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CQL Vector type
4 participants