Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataCite timeout: Datasets repeatedly lock on publication for >30min and process fails, second try works #10992

Open
kbrueckmann opened this issue Oct 31, 2024 · 15 comments
Labels
Type: Bug a defect

Comments

@kbrueckmann
Copy link

What happens?

When trying to publish datasets, we repeatedly experience that they are first locked for an extraordinarily long time (up to 40 minutes) before failing the publication process. However, on the second try, the dataset is published after just a few seconds and without any issues. The error message after the first try shows that the dataset "could not be published due to a failure to register, or update the Global Identifier for the dataset or one of the files in it."

Here is a server log entry from an attempt earlier today:

[2024-10-31T13:41:47.539+0100] [Payara 6.2024.6] [WARNING] [] [edu.harvard.iq.dataverse.pidproviders.doi.datacite.DataCiteDOIProvider] [tid: _ThreadID=259 _ThreadName=__ejb-thread-pool15] [timeMillis: 1730378507539] [levelValue: 900] [[
modifyMetadata failed: Connection timed out
java.net.SocketException: Connection timed out
at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:330)
at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355)
at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808)
at java.base/java.net.Socket$SocketInputStream.read(Socket.java:966)
at java.base/sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:484)
at java.base/sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:478)
at java.base/sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:70)
at java.base/sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1465)
at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1069)
at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at edu.harvard.iq.dataverse.pidproviders.doi.datacite.DataCiteRESTfullClient.postMetadata(DataCiteRESTfullClient.java:163)
at edu.harvard.iq.dataverse.pidproviders.doi.datacite.DOIDataCiteRegisterService.registerIdentifier(DOIDataCiteRegisterService.java:73)
at edu.harvard.iq.dataverse.pidproviders.doi.datacite.DataCiteDOIProvider.publicizeIdentifier(DataCiteDOIProvider.java:206)
at edu.harvard.iq.dataverse.engine.command.impl.FinalizeDatasetPublicationCommand.publicizeExternalIdentifier(FinalizeDatasetPublicationCommand.java:418)
at edu.harvard.iq.dataverse.engine.command.impl.FinalizeDatasetPublicationCommand.execute(FinalizeDatasetPublicationCommand.java:216)
...

[2024-10-31T13:41:47.541+0100] [Payara 6.2024.6] [WARNING] [] [edu.harvard.iq.dataverse.engine.command.impl.FinalizeDatasetPublicationCommand] [tid: _ThreadID=259 _ThreadName=__ejb-thread-pool15] [timeMillis: 1730378507541] [levelValue: 900] [[
Failed to register the identifier doi:10.11588/data/IUTCDN, or to register a file in the dataset; notifying the user(s), unlocking the dataset]

It seems like it might be connected to a timeout of the DataCite API. Outages etc. as reported here (https://status.datacite.org/) don't seem to be the problem as they appear far less often than our problems with the dataset publications.

What steps does it take to reproduce the issue?

We are unsure if others have experienced the same problems and if the described behavior is reproducible. For us, we simply try to publish a dataset via the publish button. The second try can take place just after the first one failed or later on.

When does this issue occur?

On hitting the publish button on any dataset for the first time. It happens for new datasets as well as for new versions of existing datasets. The datasets involved can have very few files with small sizes (e.g. doi:10.11588/data/NCQJJR -> 5 files with max. 90 KB each and without PID registry for each file) or be larger; we still see similar behaviour.

To whom does it occur (all users, curators, superusers)?
Superusers, curators.

What did you expect to happen?
Ideally, a publication on the first try without delays above a few minutes.

Which version of Dataverse are you using?
v. 6.3 build v6.3+10797_dls

Any related open or closed issues to this bug report?

Possibly #7393

@kbrueckmann kbrueckmann added the Type: Bug a defect label Oct 31, 2024
@qqmyers
Copy link
Member

qqmyers commented Oct 31, 2024

FWIW: DataCite is reporting Degraded Service at the moment and there have also been times when I've seen timeouts when their status is all green. So - I suspect the issue is real in your case/is a problem at DataCite.

We have discussed at times whether we should change how Dataverse responds to that. It seems unwise to publish and not be sure that the PIDs are actually searchable/only failures report in the log, so we've so far stayed with having the publication fail so you can retry later. Work could probably be done to track the failures and retry them later, warn in the UI that not all PIDs are findable yet, etc. I think so far the problem has been rare enough that no one has prioritized anything like this though.

@lmaylein
Copy link
Contributor

lmaylein commented Nov 4, 2024

and there have also been times when I've seen timeouts when their status is all green

I'm glad that we're not the only ones who notice this (even with services other than Dataverse that use Datacite).

Unfortunately, we often run into this timeout (after checking the checksums of the individual files for a very long time).

We have now tried to publish the record in question again. This usually happens very quickly. This time, however, we ran into the timeout again.

[2024-11-04T14:12:41.058+0100] [Payara 6.2024.6] [WARNING] [] [edu.harvard.iq.dataverse.pidproviders.doi.datacite.DataCiteDOIProvider] [tid: _ThreadID=261 _ThreadName=__ejb-thread-pool10] [timeMillis: 
1730725961058] [levelValue: 900] [[
  modifyMetadata failed: Connection timed out
java.net.SocketException: Connection timed out
        at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:330)
        at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355)

...

[2024-11-04T14:12:41.059+0100] [Payara 6.2024.6] [WARNING] [] [edu.harvard.iq.dataverse.engine.command.impl.FinalizeDatasetPublicationCommand] [tid: _ThreadID=261 _ThreadName=__ejb-thread-pool10] [timeMillis: 1730725961059] [levelValue: 900] [[
  Failed to register the identifier doi:10.11588/data/IUTCDN, or to register a file in the dataset; notifying the user(s), unlocking the dataset]]

status.datacite.org says:

"All Systems Operational "

@lmaylein
Copy link
Contributor

lmaylein commented Nov 4, 2024

Just to be on the safe side: How long the validation of the file checksums takes has no influence on the timeout during the registration of the DOI (which presumably only starts afterwards), right? In this case, we have a dataset with just under 100 GB of files. That takes some time.

@pdurbin
Copy link
Member

pdurbin commented Nov 4, 2024

@lmaylein yes, I would think these are unrelated.

@lmaylein
Copy link
Contributor

lmaylein commented Nov 5, 2024

Same problem:

[2024-11-04T22:56:37.541+0100] [Payara 6.2024.6] [INFO] [] [edu.harvard.iq.dataverse.engine.command.impl.FinalizeDatasetPublicationCom
mand] [tid: _ThreadID=255 _ThreadName=__ejb-thread-pool4] [timeMillis: 1730757397541] [levelValue: 800] [[
  Finalizing publication of the dataset doi:10.11588/data/VUWDMY]]

[2024-11-04T22:56:37.768+0100] [Payara 6.2024.6] [INFO] [] [edu.harvard.iq.dataverse.util.FileUtil] [tid: _ThreadID=255 _ThreadName=__
ejb-thread-pool4] [timeMillis: 1730757397768] [levelValue: 800] [[
  successfully validated DataFile 22533; checksum daf443bfacf5a0664ac0a4bd0bda83f2]]

[2024-11-04T22:56:37.841+0100] [Payara 6.2024.6] [INFO] [] [edu.harvard.iq.dataverse.util.FileUtil] [tid: _ThreadID=255 _ThreadName=__
ejb-thread-pool4] [timeMillis: 1730757397841] [levelValue: 800] [[
  successfully validated DataFile 22534; checksum 7c93de2395f7ede8812cf484a536528e]]

[2024-11-04T22:56:37.845+0100] [Payara 6.2024.6] [INFO] [] [edu.harvard.iq.dataverse.util.FileUtil] [tid: _ThreadID=255 _ThreadName=__
ejb-thread-pool4] [timeMillis: 1730757397845] [levelValue: 800] [[
  successfully validated DataFile 22535; checksum 96daf9d8e6ca25152fcc06cd37db9bc6]]

...

[2024-11-04T23:12:36.321+0100] [Payara 6.2024.6] [WARNING] [] [edu.harvard.iq.dataverse.pidproviders.doi.datacite.DataCiteDOIProvider]
 [tid: _ThreadID=255 _ThreadName=__ejb-thread-pool4] [timeMillis: 1730758356321] [levelValue: 900] [[
  modifyMetadata failed: Connection timed out
java.net.SocketException: Connection timed out
        at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:330)
        at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355)
        at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808)
        at java.base/java.net.Socket$SocketInputStream.read(Socket.java:966)
...

[2024-11-04T23:12:36.322+0100] [Payara 6.2024.6] [WARNING] [] [edu.harvard.iq.dataverse.engine.command.impl.FinalizeDatasetPublication
Command] [tid: _ThreadID=255 _ThreadName=__ejb-thread-pool4] [timeMillis: 1730758356322] [levelValue: 900] [[
  Failed to register the identifier doi:10.11588/data/VUWDMY, or to register a file in the dataset; notifying the user(s), unlocking the dataset]]

And then a few seconds later:


[2024-11-04T23:13:13.028+0100] [Payara 6.2024.6] [INFO] [] [edu.harvard.iq.dataverse.engine.command.impl.FinalizeDatasetPublicationCommand] [tid: _ThreadID=252 _ThreadName=__ejb-thread-pool1] [timeMillis: 1730758393028] [levelValue: 800] [[
  Finalizing publication of the dataset doi:10.11588/data/VUWDMY]]

[2024-11-04T23:13:13.038+0100] [Payara 6.2024.6] [INFO] [] [edu.harvard.iq.dataverse.util.FileUtil] [tid: _ThreadID=252 _ThreadName=__ejb-thread-pool1] [timeMillis: 1730758393038] [levelValue: 800] [[
  successfully validated DataFile 22533; checksum daf443bfacf5a0664ac0a4bd0bda83f2]]

[2024-11-04T23:13:13.039+0100] [Payara 6.2024.6] [INFO] [] [edu.harvard.iq.dataverse.util.FileUtil] [tid: _ThreadID=252 _ThreadName=__ejb-thread-pool1] [timeMillis: 1730758393039] [levelValue: 800] [[
  successfully validated DataFile 22534; checksum 7c93de2395f7ede8812cf484a536528e]]

[2024-11-04T23:13:13.040+0100] [Payara 6.2024.6] [INFO] [] [edu.harvard.iq.dataverse.util.FileUtil] [tid: _ThreadID=252 _ThreadName=__ejb-thread-pool1] [timeMillis: 1730758393040] [levelValue: 800] [[
  successfully validated DataFile 22535; checksum 96daf9d8e6ca25152fcc06cd37db9bc6]]

[2024-11-04T23:13:14.092+0100] [Payara 6.2024.6] [INFO] [] [edu.harvard.iq.dataverse.engine.command.impl.FinalizeDatasetPublicationCommand] [tid: _ThreadID=252 _ThreadName=__ejb-thread-pool1] [timeMillis: 1730758394092] [levelValue: 800] [[
  Successfully published the dataset doi:10.11588/data/VUWDMY]]

In significantly more than 50 percent of our publications, it only works on the second attempt.

@lmaylein
Copy link
Contributor

Unfortunately, this problem also seems to affect the creation of new datasets. I suspect that in this case DOIs are being reserved at Datacite. The timeouts here result in datasets being created twice very frequently because the first attempt apparently doesn't work and the frontend no longer responds :-(

@pdurbin
Copy link
Member

pdurbin commented Nov 22, 2024

Hmm, when creating datasets too, huh? It sounds like we need more robust error handling of DataCite outages.

@lmaylein
Copy link
Contributor

I'm just wondering whether this isn't a local problem here. As often as it occurs, I would have assumed that more users would be complaining. I assume that quite a few Dataverse installations use Datacite. The strange thing is that publishing or dataset creation is then very fast again at other times.

@qqmyers
Copy link
Member

qqmyers commented Nov 22, 2024

I haven't heard of problems like this elsewhere (except in rare cases). Definitely worth asking DataCite whether they can see anything. You might also be able to make the same DataCite API calls Dataverse makes using curl (maybe against their test server or just registering/deleting rather than trying to make the DOI findable/permanent) to see if those also timeout and/or if you can get more information about how long the calls are taking etc. (I looked in the code - if you bin/asadmin set-log-levels edu.harvard.iq.dataverse.pidproviders.doi.datacite=FINE, Dataverse should print a statement when it starts most of the calls to DataCite, so you might be able to see how long the timeout is, but I don't see any logging that would give more info about the error itself.)

@jggautier
Copy link
Contributor

Hey all. I just found this GitHub issue and wanted to mention that I and support staff at Harvard Dataverse sometimes have to unlock datasets that have been locked for a long time, although we haven't looked into why these datasets are locked for so long. In https://github.com/IQSS/dataverse-HDV-Curation/issues/402 I've been reporting the number of these unlocks.

@kbrueckmann
Copy link
Author

@jggautier Somehow, I get "error 404" when I try to open the link. I checked for leading/trailing dots in the URL etc., but that doesn't seem to be the issue.

@jggautier
Copy link
Contributor

Ah, sorry about that @kbrueckmann. I didn't realize that the GitHub repository of that GitHub issue is private. In the GitHub issue I've mostly just reported the number of times repository staff need to remove long publication locks. I mentioned it here just to say that these long locks happen regularly for a relatively small number of Harvard Dataverse users, who used to report their stuck datasets more often before I started checking for locks daily.

@qqmyers
Copy link
Member

qqmyers commented Jan 22, 2025

FWIW: I was hoping that #10794 in 6.5 would help - closing the one known way of causing a stuck lock. Have you seen a reduction since 6.5? If not/if it is still occurring, it would be useful to track circumstances so we can identify and fix the cause.

@jggautier
Copy link
Contributor

Heya @qqmyers, did you mean a reduction in Harvard Dataverse since 6.5 or in a repository where @kbrueckmann has seen stuck datasets, or both?

In case you meant for Harvard Dataverse, I think v6.5 was applied to that repository in early January, so it's too soon to say if there's been a reduction compared to months when it was using earlier versions of Dataverse. I agree about tracking circumstances, although I think there hasn't been the bandwidth for this.

@qqmyers
Copy link
Member

qqmyers commented Jan 22, 2025

Harvard w.r.t. stuck locks. I may be misunderstanding, but it doesn't sound like the problem in this issue actually leaves a stuck lock - the log ends in notifying the user(s), unlocking the dataset . FWIW: The fix in #10794 avoids a stuck lock when a user has tried to publish more than once (e.g. in two separate browsers/panes), so nothing directly related to DataCite, but one where a lock could be left forever/until cleared by an admin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug a defect
Projects
None yet
Development

No branches or pull requests

5 participants