Akamai-ULS Stops Sending Messages to Splunk Due to HTTP Aggregation Queue Being Full #77

sethumadhav07 · 2024-10-04T19:54:45Z

Description:
I am encountering an issue with the Akamai-ULS app, where, in some cases, it stops sending eaa access log messages to Splunk because it gets stuck for an unknown reason. This issue seems to be related to the HTTP Aggregation queue getting full and not clearing up. As a result, no messages are being sent to Splunk for extended periods.

Observed Behavior:

The following messages appear repeatedly in the logs for various eaa access log messages:

UlsOutput Trying to send data via HTTP
UlsOutput HTTP Aggregation queue is already full - not adding any more entries. Size: (1/1)
MSG[29349] Delivery (output) attempt 1 of 10
ULS was not able to deliver the log message after 10 attempts - (continuing anyway as my config says)

This pattern is repeated for many access log messages over several hours, during which none of the messages are being sent to Splunk.
The issue only appears after the pod has been running for a few days without any problems. After an undefined period, the above messages begin to appear, and messages stop being delivered.

Configuration:
Below is the relevant configuration:

akamai_uls:
  eaa_access:
    environment:
      ULS_LOGLEVEL: "DEBUG"
      ULS_INPUT: "EAA"
      ULS_FEED: "ACCESS"
      ULS_OUTPUT: "HTTP"
      ULS_HTTP_AGGREGATE: 1
      ULS_EDGERC: /opt/akamai-uls/.edgerc
      ULS_DEBUGLOGLINES: "True"
      ULS_AUTORESUME: "True"
      ULS_NOCALLHOME: "True"
      ULS_HTTP_INSECURE: "True"
      NO_PROXY: "localhost,127.0.0.1,::1"
      REQUESTS_CA_BUNDLE: /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt

Expected Behavior:

The HTTP Aggregation queue should not get stuck. EAA access log messages should be consistently delivered to Splunk without repeated failures.
If there is a failure in delivering eaa access log messages, the error message should provide a reason for the failure, so it's easier to troubleshoot.

Request for Assistance:

Has anyone else observed this issue?
Is there a known solution or workaround?
Can you suggest a way to get more detailed error messages in the logs, so it’s clear why HTTP sends are failing?

Additional Information:
It’s unclear why the HTTP sends fail, as the error message does not indicate the specific reason. It would be beneficial if a more detailed error message could be logged, providing insights into the failure reason.

ULS Version
1.8.3

The text was updated successfully, but these errors were encountered:

MikeSchiessl · 2024-10-07T10:58:54Z

Hi @sethumadhav07 ,
i have not seen this behavior, yet.

Do you have any insight (maybe on the SPLUNK end) why the data could not be delivered? Any Error number or anything you could point us to?

I am seeing your point that ULS should be a little be more verbose about whats going on on the HTTP level - I will try to get a grip on the sending function and force it to spit out more logging details

MikeSchiessl · 2024-10-07T13:46:15Z

I have added a more verbose output for HTTP transmission into the "development" branch - the version tag there should be 1.8.4-alpha

Feel free to plunge that version into your setup and see if you're able to catch the reason why the transmission fails

best
Mike

sethumadhav07 · 2024-10-07T14:53:21Z

Thanks Mike. I will incorporate the changes into mine and see. It happens only after a certain period; it can't be easily reproduced.

sethumadhav07 · 2024-10-11T15:10:00Z

I am closing this issue for now. Will reopen it if I determine that the issue on your side.

sethumadhav07 · 2024-10-15T15:38:20Z

logs.csv

I am attaching the logs. You can see that, it gets stuck without any error .

All you can see is this pattern:

UlsOutput Trying to send data via HTTP
UlsOutput HTTP Aggregation queue is already full - not adding any more entries. Size: (1/1)
MSG[29349] Delivery (output) attempt 1 of 10
ULS was not able to deliver the log message after 10 attempts - (continuing anyway as my config says)

Please let me know if you need anything else from me.

MikeSchiessl · 2024-10-15T15:58:16Z

Hi @sethumadhav07 , I'll gonna have a look at this !
Too weird - really - i would have expected some output on the HTTP side of the house ... I'll gonna review the code and maybe need to add a couple more debug points to it

MikeSchiessl · 2024-10-24T12:31:03Z

Hi @sethumadhav07 ,

sorry for being a little silent the last couple of days - lots of stuff going on ;)

So I did now 2 things:
a) optimized the code a little in the HTTP stack (I am still unaware of what is happening in your case )
b) copied and modified a Python test web server which you can use to fire data against (and see the result) - it is in test/opt/webserver.py

please give the latest development version a run and report back to me.

Bizarre thing! Even with the test webserver I can produce specific errors and get proper output ... so I am more than keen to understand what is going on here ;)

best
Mike

sethumadhav07 · 2024-10-24T23:55:26Z

I'll try out the latest development version and let you know what happens. Unfortunately, the problem is difficult to reproduce, and only occurs after the pod has been running for several days.

mschiessl · 2024-11-06T08:04:07Z

Hi @sethumadhav07 , quick ping if you have spotted the behavior again, i am planning to release the "new" version within the next couple of $days/$weeks

sethumadhav07 · 2024-11-06T23:19:02Z

hi mike,

This issue is still happening. I have given this less importance because it is not happening in production and only happens in our testing environment. whenever i get time, i will try to investigate further. if you can put your changes in a seperate branch and share here, that is good. if not, i can always look into history.

please go ahead and release the new version.

Regards,
Sethu

MikeSchiessl · 2025-01-07T14:26:00Z

Hi Sethu,

happy new year. We have just released the latest ULS version.
How is your observation going so far?

What should we do with this ticket?

Best
Mike

sethumadhav07 added the bug Something isn't working label Oct 4, 2024

sethumadhav07 assigned bitonio and MikeSchiessl Oct 4, 2024

sethumadhav07 closed this as completed Oct 11, 2024

sethumadhav07 reopened this Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Akamai-ULS Stops Sending Messages to Splunk Due to HTTP Aggregation Queue Being Full #77

Akamai-ULS Stops Sending Messages to Splunk Due to HTTP Aggregation Queue Being Full #77

sethumadhav07 commented Oct 4, 2024 •

edited

Loading

MikeSchiessl commented Oct 7, 2024

MikeSchiessl commented Oct 7, 2024

sethumadhav07 commented Oct 7, 2024

sethumadhav07 commented Oct 11, 2024

sethumadhav07 commented Oct 15, 2024

MikeSchiessl commented Oct 15, 2024

MikeSchiessl commented Oct 24, 2024

sethumadhav07 commented Oct 24, 2024 •

edited

Loading

mschiessl commented Nov 6, 2024

sethumadhav07 commented Nov 6, 2024

MikeSchiessl commented Jan 7, 2025

Akamai-ULS Stops Sending Messages to Splunk Due to HTTP Aggregation Queue Being Full #77

Akamai-ULS Stops Sending Messages to Splunk Due to HTTP Aggregation Queue Being Full #77

Comments

sethumadhav07 commented Oct 4, 2024 • edited Loading

MikeSchiessl commented Oct 7, 2024

MikeSchiessl commented Oct 7, 2024

sethumadhav07 commented Oct 7, 2024

sethumadhav07 commented Oct 11, 2024

sethumadhav07 commented Oct 15, 2024

MikeSchiessl commented Oct 15, 2024

MikeSchiessl commented Oct 24, 2024

sethumadhav07 commented Oct 24, 2024 • edited Loading

mschiessl commented Nov 6, 2024

sethumadhav07 commented Nov 6, 2024

MikeSchiessl commented Jan 7, 2025

sethumadhav07 commented Oct 4, 2024 •

edited

Loading

sethumadhav07 commented Oct 24, 2024 •

edited

Loading