-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SELECT behavior #97
Comments
According to what is written in the documentation, in the SELECT predicate on metadata only metadata exactly equal to what is specified should be matched, e.g., only metadata which are exactly "provider" (without prefix of any kind). I think the matter with the separator (".") is only relevant when we talk about the default/EXACT/FULL options for semijoin/joinby_/_groupby predicates. Therefore, @pp86 I would fix the metadata predicate to only match the indicated metadata. |
I confirm; it must be fixed. Thanks @pp86 for spotting it |
Can this be fixed by @OlgaGorlova ? |
In the case we have a sample with: According to @sunbrn it should not, right? @marcomass @OlgaGorlova I am taking care of it. |
The definition in the doc is the following: So, the query original.provider = "Campaner" should match only attributes original.provider, or *.original.provider not just provider |
@marcomass but these three options should only be defined for the semijoin predicate of the SELECT operator. @pp86's example is concerning the metadata predicate of the SELECT operator. |
I do believe that the issue at hand is more simply resolved by changing the way that "by metadata_attribute_name" selection (point 1) works in the above definition by @marcomass. In particular, checking that the metadata value ends in a certain way poses ambiguity cases such as the one found by @pp86 (cf. example below). What was probably intended is that the last token of the metadata name after splitting according to the '.' separator should match the metadata attribute mentioned in the SELECT clause. Some examples to clarify. Let the following SELECT be of interest: Let S1 have the following metadata: If I use the "ending with" definition as above, all samples will be valid (as the criteria is equivalent to looking for strings in the form "*data", where * is a wildcard). Instead, if I tokenize metadata with the '.' separator and match only the last element, only S1 and S3 will be chosen. Note that it is entirely possible that a sample S4 having the
EDIT: some confusion between dataset and sample |
Thanks @sunbrn for the clarification and to @Erlaad for the examples. We introduced the: As @sunbrn clarified, the three options: Please note that in doing so, the issue pointed out by @Erlaad would not hold anymore, since in the provided example case it would be possible to disambiguate between S1 and S3 using the one of the three options (default, EXACT, FULL), according to the desired behavior. In summary, according to the issue pointed out initially in this thread by @pp86, SELECT behavior should be fixed as here above mentioned. Please comment if you see anything unclear/wrong in the above. |
I can do that, but keep in mind that this is a major change (Compiler, Dag, Executor, API, Python, R have to be touched). Since in select we can use AND, OR, NOT to compose complex conditions, EXACT and FULL my not be strictly necessary (although surely helpful). Just fixing the behaviour to split on "." should be much easier (only the implementation will be touched). |
Yes, the fix of splitting on "." must be done for sure. I understand the other change (introduction also of EXACT and FULL in metadata main predicate) requires more effort. My opinion is that it would be much better to do it now when we are closing the API implementation and contemporary we are also actively working on closing RGMQL and fixing bugs on pyGMQL. Postponing it would then require much more effort to retaken also the other projects (RGMQL, pyGMQL) once they will be closed. |
Refactor the Select Execution in Spark, so as to produce a smaller execution plan
The problem of attribute name splitting has been solved by my last commit. I still have to work on the EXACT/FULL options. |
I started thinking on EXACT/FULL. |
@pp86 |
I am missing the difference between the two. |
I copy here below the current definition in the manual for metajoin conditions (we will extend it also to select, when implemented). ● In a metajoin condition (i.e., for all operators that include such condition: SELECT has semijoin, DIFFERENCE, MAP, and JOIN have joinby, GROUP, MERGE, and COVER have groupby), different matching options can be used:
|
Just to understand: let
I’d like to understand a bit more of the example. Let D be the following dataset:
Sample_ID attribute value
S1 att 1
S2 other_ds.att 1
S3 other_ds.yet_another_ds.att 1
S4 other_ds.att 2
S5 att 1
S6 other_ds.att 1
S7 other_ds.yet_another_ds.att 1
S8 other_ds.yet_another_ds.att 2
then, using GROUPBY as an example, the groups formed are as follows:
* GROUPBY(att): S1+S2+S3+S5+S6+S7, S4+S8 (2 groups)
* GROUPBY(EXACT(att)): S1+S5 (1 group*?)
* GROUPBY(FULL(att)): S1+S5, S2+S6, S3+S7, S4, S3+S7, S8 (6 groups)
Is this correct? What happens to leftover samples that are not grouped in EXACT(att)? Are samples in subgroups of FULL(att) then also split in their own groups by attribute, like in my example?
Regards,
Stefano Perna, MSc
PhD Student
Dipartimento di Elettronica, Informazione e Bioingegneria - Politecnico di Milano
Skype Contact: stefano.e.perna
On 10 Jan 2018, at 15:22, marcomass <[email protected]<mailto:[email protected]>> wrote:
I copy here below the current definition in the manual for metajoin conditions (we will extend it also to select, when implemented).
● In a metajoin condition (i.e., for all operators that include such condition: SELECT has semijoin, DIFFERENCE, MAP, and JOIN have joinby, GROUP, MERGE, and COVER have groupby), different matching options can be used:
○ metadata_attribute_name: it matches all attributes that are equal to OR end with the dot-separated suffix specified name (regardless additional metadata_attribute_name dot-separated prefixes not explicitly specified);
○ EXACT(metadata_attribute_name): it matches all attributes that are equal to the specified name (without any prefixes);
○ FULL(metadata_attribute_name): it matches two attributes if they end with the specified name AND their full names are equal;
For instance, if we consider the following attributes:
1. pref1.pref2.att
2. pref1.att
3. att
4. pref1.att
Then:
• att matches all of the above attributes;
• EXACT(att) matches only attribute 3. (i.e., att);
• FULL(att) matches attributes 2. and 4. (i.e., pref1.att).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#97 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AQfxmxuRhxt8jBGlGk6ERsxG8rNu7HaRks5tJMe9gaJpZM4RDnnM>.
|
You did not understand my point. Definition and example describe metajoin or groupby, where we compare two attribute names in order to tell if they are in the same group. If not please provide a counter example |
I think @pp86 is correct.
Therefore default and EXACT should be enough to handle the requested cases. |
@sunbrn |
Please consider this query:
It returns two samples.
Btw, no sample with
provider = "Campaner"
exists. The two output samples have the metadataoriginal_provider = "Campaner"
.What happens is that the SELECT checks is the attribute name ends with "provider".
I remember we decided to have a SQL-like approach, i.e. the names are splitted on "." (similar to the "mariage"/"age" example we discussed). So the results would be correct if the metadata was "original.provider", but not "original_provider".
Is anything changed in the specification, or should this be fixed?
@marcomass @sunbrn
The text was updated successfully, but these errors were encountered: