Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tv_grab_zz_sdjson: improve episode/season metadata handling #222

Conversation

kgroeneveld
Copy link
Contributor

What type of Pull Request is this?

  • adds new functionality
  • fixes/improves existing functionality

Does this PR close any currently open issues?

#221

Please explain what this PR does

The program metadata from SD is an array of different providers. The previous code was specifically looking for 'Gracenote' provider in the first array element and ignoring everything else. For example, it seems the first element is now often 'TVmaze' (at least for the lineups I use) and the episode/season data was all being ignored.

The code now iterates over all the metadata providers generally preferring the earlier entries unless a later one looks more complete.

Where have you tested these changes?

Operating System:
ubuntu 22.04

Perl Version:
v5.34.0

I have only tested this very quickly so far. I would like to do more testing / review before merging. But I thought I would create the PR now in case @rob-vh who reported the issue wants to test it or is very eager for a fix.

@kgroeneveld kgroeneveld self-assigned this Jan 20, 2024
@kgroeneveld
Copy link
Contributor Author

I just noticed my commit changes the file permissions which I should probably fix before merging.

Copy link

@rob-vh rob-vh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure what the "|| !length($season)" in line 1277 purports to do. It looks like

if we have not seen the season number yet, but we did see the episode number before, we will discard the episode number previously found in favor of whatever we found this time (even if $_season is empty!).

The code seems to work as intended, but I just cannot get my head around the if stmt.

@honir
Copy link
Contributor

honir commented Jan 21, 2024

Does the fix work when the metadata contains just an episode number but no series?

@rob-vh
Copy link

rob-vh commented Jan 21, 2024

I've seen programs (using EPG data from Channel 5+1) with only the episode number getting extracted correctly. In that case $season = $_season is executed only once and passed down for storing (unless there is also a provider that only contains the Season later on in the array of providers).

@kgroeneveld
Copy link
Contributor Author

It is hard to decide if the code is correct without agreeing on the requirements. I didn’t even fully formulate a full set of requirements in my mind when I wrote and pushed my first attempt at fix.

Some thoughts I had while working it:

When there is more than one source of metadata, which do you choose to include in the xml output?

Do you ever mix and match info from different providers?

If one provider only has the season and one only has the episode do you combine them?

What if the providers have conflicting information, which do you use?

My initial thinking was to never mix and match info from different providers but instead try to find the most detailed entry, preferring earlier entries with the same level of detail but possibly conflicting info.

But if one source only has season and one only has episode which one is “more detailed”?

The API for the source data from Schedules Direct states the only mandatory field is season. So I was giving preference to season over episode and that is where the !length($season) part of one of the statements comes from.

Does the fix work when the metadata contains just an episode number but no series?

If the source data follows the spec this should never happen. But maybe it should be considered anyway, especially since it seems @rob-vh has found a case where it happens . Right now I think it will only use an isolated episode number if it was the first metadata entry in the array. So there is probably room for improvement there.

But before tweaking the code further I would be interested to hear others thoughts on possibly combining info from multiple providers or other requirements in general.

@rob-vh
Copy link

rob-vh commented Jan 21, 2024

Scanning through Channel5 +1 EPG I found an occurrence where Gracenote has only the season but TVMaze has both fields:

"metadata":[{"TVmaze":{"season":2,"episode":34,"url":"https://www.tvmaze.com/episodes/2215086/the-adventures-of-paddington-2x34-paddington-feels-the-music"}},{"Gracenote":{"season":2}}]

So I would not like to wipe the info collected from TVmaze when Gracenote only contains partial information.
If there is no provider with both fields, I would not mind taking season from one provider and episode from another, but don't overwrite info from a previous provider with null.

Earlier I saw isolated episode numbers with no season, but not in the data I have checked today.

@rob-vh
Copy link

rob-vh commented Jan 21, 2024

I realize why I have isolated episode numbers in my data. I also parse the Smm/Enn info from over the air EPG, esp. for sat channels that SD does not cover. Just now I saw Doctor Who mentioned on ONS with E01.

@honir
Copy link
Contributor

honir commented Jan 22, 2024

I probably wouldn't recommend merging different sources. I have seen many instances where Gracenote, Epguides website, IMDb website have different opinions. Althoough, to be fair, it mostly seems to be IMDb which is wrong (Gracenote and Epguides usually agree).

@honir
Copy link
Contributor

honir commented Jan 22, 2024

(In the UK at least) we used to get isolated episode numbers, particularly for things like soaps, or multi-part documentaries. Recently though it seems the soaps are coming through with the year as the season (S/E = year/num-in-year). I assume that is in the Gracenote data, and seems to be something invented by Gracenote.

For example, EastEnders tonight is listed on TVGuide as Episode 6843, but I think Gracenote has it as Season 2024 Episode 12 !

TVGuide as Ep. 6843
BBC use the broadcast date as the Ep. 22/01/2024
Sky hedge their bets and call it S/E 2024/6843 !

Many guides use the isolated episode number ("6843")

It seems the json feed S/E numbers are invented by Gracenote (or SD?).

Especially since they are actually out of sequence:

S/E 2024/11 Friday 19 January, 2024 (EP100330186088)
S/E 2024/12 Monday 22 January, 2024 (EP100330186900)
S/E 2024/13 Wednesday 17 January, 2024 (EP100330187077) **** OUT OF SEQUENCE
S/E 2024/14 Tuesday 23 January, 2024 (EP100330187121)
S/E 2024/15 Wednesday 24 January, 2024 (EP100330187338)
.

Whether Gracenote / SD should be inventing S/E numbers is maybe a discussion for another thread.

@garybuhrmaster
Copy link
Contributor

(In the UK at least) we used to get isolated episode numbers, particularly for things like soaps, or multi-part documentaries. Recently though it seems the soaps are coming through with the year as the season (S/E = year/num-in-year). I assume that is in the Gracenote data, and seems to be something invented by Gracenote.

I don't know who invented it, but I have seen the series being the year in more than one guide source, so it seems to have become some new form of normal.

Especially since they are actually out of sequence:

Are they really out of some, arguably valid, sequence? As discussed before, there are numerous orderings to chose from. Some may be original production numbers/order (as assigned during the planning process at the beginning of the series just to have unique numbers to fill in the blanks later), some may be order of start, or completion, of production (some shows may start production earlier than others but take longer to complete than others), some may be scheduled broadcast order, and some may be actual broadcast order (which due to preemption may not be the same as scheduled broadcast order). All of those orders may correct to some of the people some of the time for some of the shows (and I have seen some of all of those values from different sources).

@honir
Copy link
Contributor

honir commented Jan 22, 2024

All of those orders may correct to some of the people some of the time for some of the shows

Spoken like a true academic :-) Anyone watching this serial in that s/e order is going to get very confused though (with dramatic events happening out of sequence)

@rob-vh
Copy link

rob-vh commented Jan 22, 2024

I would, however, be great to have consistency in the choice of season number. I use PVR automation logic to record episodes of favorite shows for season > season last completed. So flipping between "sequence season" and "annual season" numbers is going to hurt.

@rmeden
Copy link
Contributor

rmeden commented Jan 22, 2024 via email

@garybuhrmaster
Copy link
Contributor

All of those orders may correct to some of the people some of the time for some of the shows

Spoken like a true academic :-) Anyone watching this serial in that s/e order is going to get very confused though (with dramatic events happening out of sequence)

And then there is the order that the episodes SHOULD have been broadcast in (and should be viewed in) except for the stupidity of certain network executives (cough Firefly cough).

And when a series gets rebooted, some (but not all) guide sources start with a new series (season) number of 1.

Standards are wonderful things, and there are many of them: https://xkcd.com/927/

@honir
Copy link
Contributor

honir commented Jan 25, 2024

I think a grabber should report all available episode-num options using different "system" attributes, and report its best guess (based on knowledge of the data source)  in the "xmltv_ns" system.

That sounds very sensible. Then it is up to the consuming program to pick what it thinks best.

( I can easily imagine a situation where a source has good data for US programmes but poor for UK ones (in which they are not really interested!). And vice-versa. )

BST (blue-sky thinking): Perhaps the grabber could offer a choice of which source to use for the xmltv_ns system?

@rmeden
Copy link
Contributor

rmeden commented Jan 25, 2024 via email

@kgroeneveld
Copy link
Contributor Author

I am not sure if all of the above helps or just leaves me even more confused...

I think everyone pretty much agrees that mixing and matching the data from different providers for the xmltv_ns system entry should not be done, which was my original thought as well.

I think a grabber should report all available episode-num options using different "system" attributes, and report its best guess (based on knowledge of the data source) in the "xmltv_ns" system.

This sounds reasonable. And it looks like the tv_grab_zz_sdjson_sqlite grabber is doing something along these lines.

But it is unclear to me how the SD JSON data should be formatted in the XML. Sample output from tv_grab_zz_sdjson_sqlite:

<episode-num system="dd_progid">EP02696547.0168</episode-num>
<episode-num system="xmltv_ns"> 5 . 21 .  </episode-num>
<episode-num system="tvmaze.com">episode/22</episode-num>
<episode-num system="tvmaze.com">series/6</episode-num>
<episode-num system="tvmaze.com">url/https://www.tvmaze.com/episodes/2505976/swat-6x22-legacy</episode-num>
<episode-num system="schedulesdirect.org">programID/EP026965470168</episode-num>
<episode-num system="schedulesdirect.org">series/6</episode-num>
<episode-num system="schedulesdirect.org">episode/22</episode-num>
<episode-num system="schedulesdirect.org">resourceID/14158903</episode-num>
<episode-num system="schedulesdirect.org">originalAirDate/20230519 +0000</episode-num>

Is this schema defined anywhere? Or does each grabber author just make up whatever they want?

It seems strange to have multiple entries with the same system. It seems instead it should be something more like:

<episode-num system="tvmaze.com">
  <episode>22</episode>
  <series>6</series>
  <url>https://www.tvmaze.com/episodes/2505976/swat-6x22-legacy</url>
</episode-num>

In some ways it seems system is being used as the "provider" and not a "system" in the above.

And if we are just trying to pass on all the keys in the SD JSON data in a uniform way maybe it should be something more like:

<episode-num system="sdjson">
  <provider name="Gracenote">
    <season>6</season>
    <episode>22</episode>
  </provider>
  <provider name="TVmaze">
    <season>6</season>
    <episode>22</episode>
    <url>https://www.tvmaze.com/episodes/2505976/swat-6x22-legacy</url>
  </provider>
<episode-num>

(plus the standard dd_progid and xmltv_ns entries)

I don't really have a strong preference myself on any of this. Maybe this is already defined a bit better somewhere. Ideally the different xmltv grabbers would be somewhat consistent.

@garybuhrmaster
Copy link
Contributor

Some random thoughts (which should likely be ignored):

  • One should never try to merge data from multiple sources.

  • One thing to be aware of is that while Gracenote (the upstream source for Schedules Direct) has professional paid curation of their data(*), and sources such as tvmaze, or thetvdb (or imdb) depend on user contributions, which may be of different correctness (sometimes the "crowd" gets it more right, sometimes less so, but most of the time the episode/series values end up the same, but one should not depend on being consistent).

  • As this is a grabber for Schedules Direct data and that means Gracenote data (and not 3rdparty data) I would claim the xmltv_ns values SHOULD (IETF definition) use the gracenote upstream's data (as all the rest of the data is from schedulesdirect (and gracenote), and is consistent internally.

  • Customers of Schedules Direct can open cases with them to have the data corrected.

  • It should also be noted that in this case the root cause of the error (no episode/series) appears to actually be due to coding error presuming that the first element of an unordered array (with potentially multiple sources) would be from a particular source 'gracenote' (the upstream definition clearly stated that this was an unordered array).  When the other sources did not exist (almost certainly true when the code was first written) that worked, since the first element was also the only element, and it was almost always 'gracenote'.  Once some other sources appeared, it would randomly fail, as the first element was not always the 'gracenote' case (sometimes it was 'gracenote', sometimes 'tvmaze', and, at least for a while, 'thetvdb'). It would be my opinion that the grabber should have always iterated through the unordered array looking for the correct ('gracenote') element (as the "other" SD grabber did, as was defined in the Schedules Direct API)

  • While expanding the DTD to allow other elements might a nice idea, in practice it is a non-starter most of the time (as the consumers of the data sometimes depend on the existing DTD).

  • For better or worse (mostly worse), the episode_num element is the only mechanism to (under the DTD) expand the returned data, as it is the only (semi) unstructured return value for systems outside of xmltv_ns and onscreen.

(*) At least for content originated in the US. Outside of the US Gracenote has to depend (to various extents) on the local guide source providers, but, importantly, Gracenote is still responsible for curating the data, and they pay people to do it.

@honir
Copy link
Contributor

honir commented Jan 28, 2024

I agree. There has not been a general call for numbering systems outside of xmltv_ns, and this is the one which is pretty-much universally output and used by downstream programs.

Any other numbering systems (i.e. other 'system' elements/attributes) is simply froth. It MAY be used by a downstream program but I think that would be a niche use case.

The focus here should be on populating xmltv_ns and, as pointed out, I would expect Gracenote data in preference to any other. (Gracenote data is the reason I signed up to Schedules Direct.)

Only if Gracenote was fully empty should another provider be considered for populating xmltv_ns. (On balance I'd rather have 'some' data than 'none', even if it's potentially unreliable.)

@kgroeneveld kgroeneveld force-pushed the sdjson-improve-episode-metadata-handling branch from a0c8c88 to a3bd1b4 Compare January 29, 2024 00:03
@kgroeneveld
Copy link
Contributor Author

Thanks everyone for the feedback. I have been getting more helpful feedback than expected.

It should also be noted that in this case the root cause of the error (no episode/series) appears to actually be due to coding error presuming that the first element of an unordered array

That is about the only thing I have been sure of in this discussion. It was so long ago when I wrote that code I don't remember what I was thinking at the time but I would like to think I knew it was dumb / wrong at the time and that it was just a quick temporary way to get the the other parts working and then I forget to fix it.

Based on the most recent comments I am also starting to think it is best to keep things simple for now (and probably for a long time) and only populate xmltv_ns (and dd_progid) preferring the Gracenote data if available. Anything beyond that seems unlikely to even be used with standard way to represent it anyway.

I just pushed an updated version of my branch which implements the above.

The program metadata from SD is an array of different providers. The
previous code was specifically looking for 'Gracenote' provider in the
first array element and ignoring everything else. For example, it seems the
first element is now often 'TVmaze' (at least for the lineups I use) and
the episode/season data was all being ignored.

The code now iterates over all the metadata providers preferring Gracenote
but using the first available as a fallback.

Fixes: XMLTV#221
@kgroeneveld kgroeneveld force-pushed the sdjson-improve-episode-metadata-handling branch from a3bd1b4 to ad15def Compare January 29, 2024 00:06
@kgroeneveld
Copy link
Contributor Author

And of course immediately after pushing the update I see a mistake... so I just pushed again. But with my luck there is still something stupid in there.

@knowledgejunkie
Copy link
Contributor

@honir @rmeden @garybuhrmaster @kgroeneveld With a view to getting a new release out very soon, are we happy that the latest changes provide the expected data (if available)?

@kgroeneveld
Copy link
Contributor Author

With a view to getting a new release out very soon, are we happy that the latest changes provide the expected data (if available)?

No news is good news?

I have been using the latest version of this PR without any issues on my MythTV system from around the time I pushed it. Although, I must admit I don't pay much attention to the episode / season metadata. I did also do some more specific testing with various hand created test data and everything I tried worked as per my expectations.

Unless someone spots a major bug I say we should merge it before the next release. It is definitely wrong in the current release.

Maybe I should just merge it myself in a couple days if I don't hear anything. I think I can merge it... I have not made an actual commit / merge since XMLTV was still in CVS!

@knowledgejunkie knowledgejunkie merged commit 86ac9c4 into XMLTV:master Feb 12, 2024
25 checks passed
@knowledgejunkie
Copy link
Contributor

Thank you @kgroeneveld !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants