Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akamai-ULS Stops Sending Messages to Splunk Due to HTTP Aggregation Queue Being Full #77

Open
sethumadhav07 opened this issue Oct 4, 2024 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@sethumadhav07
Copy link

sethumadhav07 commented Oct 4, 2024

Description:
I am encountering an issue with the Akamai-ULS app, where, in some cases, it stops sending eaa access log messages to Splunk because it gets stuck for an unknown reason. This issue seems to be related to the HTTP Aggregation queue getting full and not clearing up. As a result, no messages are being sent to Splunk for extended periods.

Observed Behavior:

  • The following messages appear repeatedly in the logs for various eaa access log messages:
    UlsOutput Trying to send data via HTTP
    UlsOutput HTTP Aggregation queue is already full - not adding any more entries. Size: (1/1)
    MSG[29349] Delivery (output) attempt 1 of 10
    ULS was not able to deliver the log message after 10 attempts - (continuing anyway as my config says)
    
  • This pattern is repeated for many access log messages over several hours, during which none of the messages are being sent to Splunk.
  • The issue only appears after the pod has been running for a few days without any problems. After an undefined period, the above messages begin to appear, and messages stop being delivered.

Configuration:
Below is the relevant configuration:

akamai_uls:
  eaa_access:
    environment:
      ULS_LOGLEVEL: "DEBUG"
      ULS_INPUT: "EAA"
      ULS_FEED: "ACCESS"
      ULS_OUTPUT: "HTTP"
      ULS_HTTP_AGGREGATE: 1
      ULS_EDGERC: /opt/akamai-uls/.edgerc
      ULS_DEBUGLOGLINES: "True"
      ULS_AUTORESUME: "True"
      ULS_NOCALLHOME: "True"
      ULS_HTTP_INSECURE: "True"
      NO_PROXY: "localhost,127.0.0.1,::1"
      REQUESTS_CA_BUNDLE: /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt

Expected Behavior:

  • The HTTP Aggregation queue should not get stuck. EAA access log messages should be consistently delivered to Splunk without repeated failures.
  • If there is a failure in delivering eaa access log messages, the error message should provide a reason for the failure, so it's easier to troubleshoot.

Request for Assistance:

  1. Has anyone else observed this issue?
  2. Is there a known solution or workaround?
  3. Can you suggest a way to get more detailed error messages in the logs, so it’s clear why HTTP sends are failing?

Additional Information:
It’s unclear why the HTTP sends fail, as the error message does not indicate the specific reason. It would be beneficial if a more detailed error message could be logged, providing insights into the failure reason.


ULS Version
1.8.3

@sethumadhav07 sethumadhav07 added the bug Something isn't working label Oct 4, 2024
@MikeSchiessl
Copy link
Collaborator

Hi @sethumadhav07 ,
i have not seen this behavior, yet.

Do you have any insight (maybe on the SPLUNK end) why the data could not be delivered? Any Error number or anything you could point us to?

I am seeing your point that ULS should be a little be more verbose about whats going on on the HTTP level - I will try to get a grip on the sending function and force it to spit out more logging details

@MikeSchiessl
Copy link
Collaborator

I have added a more verbose output for HTTP transmission into the "development" branch - the version tag there should be 1.8.4-alpha

Feel free to plunge that version into your setup and see if you're able to catch the reason why the transmission fails

best
Mike

@sethumadhav07
Copy link
Author

Thanks Mike. I will incorporate the changes into mine and see. It happens only after a certain period; it can't be easily reproduced.

@sethumadhav07
Copy link
Author

I am closing this issue for now. Will reopen it if I determine that the issue on your side.

@sethumadhav07
Copy link
Author

logs.csv

I am attaching the logs. You can see that, it gets stuck without any error .

All you can see is this pattern:

UlsOutput Trying to send data via HTTP
UlsOutput HTTP Aggregation queue is already full - not adding any more entries. Size: (1/1)
MSG[29349] Delivery (output) attempt 1 of 10
ULS was not able to deliver the log message after 10 attempts - (continuing anyway as my config says)

Please let me know if you need anything else from me.

@sethumadhav07 sethumadhav07 reopened this Oct 15, 2024
@MikeSchiessl
Copy link
Collaborator

Hi @sethumadhav07 , I'll gonna have a look at this !
Too weird - really - i would have expected some output on the HTTP side of the house ... I'll gonna review the code and maybe need to add a couple more debug points to it

@MikeSchiessl
Copy link
Collaborator

Hi @sethumadhav07 ,

sorry for being a little silent the last couple of days - lots of stuff going on ;)

So I did now 2 things:
a) optimized the code a little in the HTTP stack (I am still unaware of what is happening in your case )
b) copied and modified a Python test web server which you can use to fire data against (and see the result) - it is in test/opt/webserver.py

please give the latest development version a run and report back to me.

Bizarre thing! Even with the test webserver I can produce specific errors and get proper output ... so I am more than keen to understand what is going on here ;)

best
Mike

@sethumadhav07
Copy link
Author

sethumadhav07 commented Oct 24, 2024

I'll try out the latest development version and let you know what happens. Unfortunately, the problem is difficult to reproduce, and only occurs after the pod has been running for several days.

@mschiessl
Copy link

Hi @sethumadhav07 , quick ping if you have spotted the behavior again, i am planning to release the "new" version within the next couple of $days/$weeks

@sethumadhav07
Copy link
Author

hi mike,

This issue is still happening. I have given this less importance because it is not happening in production and only happens in our testing environment. whenever i get time, i will try to investigate further. if you can put your changes in a seperate branch and share here, that is good. if not, i can always look into history.

please go ahead and release the new version.

Regards,
Sethu

@MikeSchiessl
Copy link
Collaborator

Hi Sethu,

happy new year. We have just released the latest ULS version.
How is your observation going so far?

What should we do with this ticket?

Best
Mike

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

4 participants