Skip to content

Releases: openzipkin/zipkin

Zipkin 2.17

24 Sep 15:20
Compare
Choose a tag to compare

Zipkin 2.17 publishes a "master" Docker tag, adds a traceMany endpoint, and improves startup performance.

Before we get into these features, let's talk about logo :) You'll notice our website has been sporting a new logo designed by LINE recently. Thanks to @making, you can see this when booting zipkin, too!

Screenshot 2019-09-24 at 8 20 42 PM

Publishing a "master" docker tag

We've been doing docker images since 2015. One thing missing for a long time was test images. For example, if you wanted to verify Zipkin before a release in Docker, you'd have to build the image yourself. @anuraaga sorted that out by setting up automatic publication of the master branch as the tag "master" in docker.

Ex. "latest" is the last release, "master" is the last commit.

$ docker pull openzipkin/zipkin:master

Probably more interesting to some is how he used this. With significant advice from @bsideup, @anuraaga setup integrated benchmarks of integrated zipkin servers. This allows soak tests without the usual environment volatility.

https://github.com/openzipkin/zipkin/blob/master/benchmarks/src/test/java/zipkin2/server/ServerIntegratedBenchmark.java

A keen eye will notice that these benchmarks literally run our sleuth example! This is really amazing work, so let's all thank Rag!

Api for getting many traces by ID

While the primary consumer of our api is the Lens UI, we know there are multiple 3rd party consumers. Many of these prefer external discovery of trace IDs and just need an api to grab the corresponding json.

Here are some use cases:

  • out-of-band aggregations, such histograms with attached representative IDs
  • trace comparison views
  • decoupled indexing and retrieval (Ex lucene over a blob store)
  • 3rd party apis that offer stable pagination (via a cursor over trace IDs).

of the api was made to me like other endpoints in the same namespace, and be easy to discover. We have a rationale document if you are interested in details.

Thanks much for the code work from @zeagord and @llinder and api design review by @jcarres-mdsol @jeqo @devinsba and @jcchavezs!

Quicker startup

After hearing some complaints, we did a lot of digging in the last release to find ways to reduce startup time. For example, we found some folks were experiencing nearly 10 second delays booting up an instance of Zipkin, eventhough many were in the 3.5s range. Among many things, we turned off configuration not in use and deleted some dependencies we didn't need. Next, we directly implemented our supported admin endpoints with Armeria. Here are those endpoints in case you didn't read our README

  • /health - Returns 200 status if OK
  • /info - Provides the version of the running instance
  • /metrics - Includes collector metrics broken down by transport type
  • /prometheus - Prometheus scrape endpoint

Off the bat, Zipkin 2.17 should boot faster. For example, one "laptop test" went from "best of 5" 3.4s down to under 3s with no reduction in functionality. While docker containers always start slower than java directly, you should see similar reductions in startup time there, too.

For those still unsatisfied, please upvote our slim distribution which can put things in the 2s range with the same hardware, for those who are only using in-memory storage or Elasticsearch.

While we know bootstrap time is unimportant to most sites, it is very important emotionally to some. We'd like to make things as fun as possible, and hope this sort of effort keeps things snappy for you!

Zipkin 2.16

26 Aug 08:19
Compare
Choose a tag to compare

Zipkin 2.16 includes revamps of two components, our Lens UI and our Elasticsearch storage implementation. Thanks in particular to @tacigar and @anuraaga who championed these two important improvements. Thanks also to all the community members who gave time to guide, help with and test this work! Finally, thanks to the Armeria project whose work much of this layers on.

If you are interested in joining us, or have any questions about Zipkin, join our chat channel!

Lens UI Revamp

As so much has improved in the Zipkin Lens UI since 2.15, let's look at the top 5, in order of top of screen to bottom. The lion's share of effort below is with thanks to @tacigar who has put hundreds of hours effort into this release.

To understand these, you can refer to the following images which annotate the number discussed. The first image in each section refers to 2.15 and the latter 2.16.2 (latest patch at the time)

2 15 search
2 16 search
2 15 detail
2 16 detail
2 15 dependencies
2 16 dependencies

1. Default search to 15minutes, not 1 hour, and pre-can 5 minute search

Before, we had an hour search default which unnecessarily hammers the backend. Interviewing folks, we realized that more often it is 5-15minutes window of interest when searching for traces. By changing this, we give back a lot of performance with zero tuning on the back end. Thanks @zeagord for the help implementing this.

2. Global search parameters apply to the dependency diagram

One feature of Expedia Haystack we really enjoy is the global search. This is where you can re-use context added by the user for trace queries, for other screens such as network diagrams. Zipkin 2.16 is the first version to share this, as before the feature was stubbed out with different controls.

3. Single-click into a trace

Before, we had a feature to preview traces by clicking on them. The presumed use case was to compare multiple traces. However, this didn't really work as you can't guarantee traces will be near eachother in a list. Moreover, large traces are not comparable this way. We dumped the feature for a simpler single-click into the trace similar to what we had before Lens. This is notably better when combined with network improvements described in 5. below.

4. So much better naming

Before, in both the trace list and also detail, names focused on the trace ID as opposed to what most are interested in (the top-level span name). By switching this out, and generally polishing the display, we think the user interface is a lot more intuitive than before.

5. Fast switching between Trace search and detail screen.

You cannot see 5 unless you are recording, because the 5th is about network performance. Lens now shares data between the trace search and the trace detail screen, allowing you to quickly move back and forth with no network requests and reduced rendering overhead.

2 15 network
2 16 network

Elasticsearch client refactor

Our first Elasticsearch implementation allowed requests to multiple HTTP endpoints to failover on error. However, it did not support multiple HTTPS endpoints, nor any load balancing features such round-robin or health checked pools.

For over two years, Zipkin sites have asked us to support sending data to an Elasticsearch cluster of multiple https endpoints. While folks have been patient, workarounds such as "setup a load balancer", or change your hostnames and certificates, have not been received well. It was beyond clear we needed to do the work client-side. Now, ES_HOSTS can take a list of https endpoints.

Under the scenes, any endpoints listed receive periodic requests to /_cluster/health. Endpoints that pass this check receive traffic in a round-robin fashion, while those that don't are marked bad. You can see detailed status from the Prometheus endpoint:

$ curl -sSL localhost:9411/prometheus|grep ^armeria_client_endpointGroup_healthy
armeria_client_endpointGroup_healthy{authority="search-zipkin-2rlyh66ibw43ftlk4342ceeewu.ap-southeast-1.es.amazonaws.com:443",ip="52.76.120.49",name="elasticsearch",} 1.0
armeria_client_endpointGroup_healthy{authority="search-zipkin-2rlyh66ibw43ftlk4342ceeewu.ap-southeast-1.es.amazonaws.com:443",ip="13.228.185.43",name="elasticsearch",} 1.0

Note: If you wish to disable health checks for any reason, set zipkin.storage.elasticsearch.health-check.enabled=false using any mechanism supported by Spring Boot.

The mammoth of effort here is with thanks to @anuraaga. Even though he doesn't use Elasticsearch anymore, he volunteered a massive amount of time to ensure everything works end-to-end all the way to prometheus metrics and client-side health checks. A fun fact is Rag also wrote the first Elasticsearch implementation! Thanks also to the brave who tried early versions of this work, including @jorgheymans, @jcarres-mdsol and stanltam

If you have any feedback on this feature, or more questions about us, please reach out on gitter

Test refactoring

Keeping the project going is not automatic. Over time, things take longer because we are doing more, testing more, testing more dimensions. We ran into a timeout problem in our CI server. Basically, Travis has an absolute time of 45 minutes for any task. When running certain integration tests, and publishing at the same time, we were hitting near that routinely, especially if the build cache was purged. @anuraaga did a couple things to fix this. First, he ported the test runtime from classic junit to jupiter, which allows more flexibility in how things are wired. Then, he scrutinized some expensive cleanup code, which was unnecessary when consider containers were throwaway. At the end of the day, this bought back 15 minutes for us to.. later fill up again 😄 Thanks, Rag!

Small changes

Background on Elasticsearch client migration

The OkHttp java library is everywhere in Zipkin.. first HTTP instrumentation in Brave, the encouraged way to report spans, and relevant to this topic, even how we send data to Elasticsearch!

For years, the reason we didn't support multiple HTTPS endpoints was the feature we needed was on OkHttp backlog. This is no criticism of OkHttp as it is both an edge case feature, and there are ways including layering a client-side load balancer on top. This stalled out for lack of volunteers to implement the OkHttp side or an alternative. Yet, people kept asking for the feature!

We recently moved our server to Armeria, resulting in increasing stake, experience and hands to do work. Even though its client side code is much newer than OkHttp, it was designed for advanced features such as client-side load balancing. The idea of re-using Armeria as an Elasticsearch client was interesting to @anuraaga, who volunteered both ideas and time to implement them. The result was a working implementation complete with client-side health checking, supported by over a month of Rag's time.

The process of switching off OkHttp taught us more about its elegance, and directly influenced improvements in Armeria. For example, Armeria's test package now includes utilities inspired by OkHttp's MockWebServer.

What we want to say is.. thanks OkHttp! Thanks for the formative years of our Elasticsearch client and years ahead as we use OkHttp in other places in Zipkin. Keep up the great work!

Zipkin 2.15

05 Jul 08:48
Compare
Choose a tag to compare

ActiveMQ 5.x span transport

Due to popular demand, we've added support for ActiveMQ 5.x. Zipkin server will connect to ActiveMQ when the env variable ACTIVEMQ_URL is set to a valid broker. Thanks very much to @IAMTJW for work on this feature and @thanhct for testing it against AWS MQ.

Ex. simple usage against a local broker

ACTIVEMQ_URL=tcp://localhost:61616 java -jar zipkin.jar

Ex. usage with docker against a remote AWS MQ failover group

docker run -d -p 9411:9411 -e ACTIVEMQ_URL='failover:(ssl://b-da18ebe4-54ff-4dfc-835f-3862a6c144b1-1.mq.ap-southeast-1.amazonaws.com:61617,ssl://b-da18ebe4-54ff-4dfc-835f-3862a6c144b1-2.mq.ap-southeast-1.amazonaws.com:61617)' -e ACTIVEMQ_USERNAME=zipkin -e ACTIVEMQ_PASSWORD=zipkin12345678 -e ACTIVEMQ_CONCURRENCY=8 openzipkin/zipkin

Rewrite of Zipkin Lens global search component

One of the most important roles in open source is making sure the project is maintainable. As features were added in our new UI, maintainability started to degrade. Thanks to an immense amount of effort by @tacigar, we now have new, easier to maintain search component. Under the covers, it is implemented in Material-UI and React Hooks.

Screenshot 2019-07-05 at 2 54 37 PM

Behind the crisp new look is clean code that really helps the sustainability of our project. Thanks very much to @tacigar for his relentless attention.

Refreshed Grafana Dashboard

While many have tried our Grafana dashboard either directly or via our docker setup, @mstaalesen really dug deep. He noticed some things drifted or were in less than ideal places. Through a couple weeks of revision, we now have a tighter dashboard. If you have suggestions, please bring them to Gitter as well!

Screenshot 2019-07-05 at 2 41 30 PM

Small, but appreciated fixes

  • Fixes a bug where a Java 8 class could be accidentally loaded when in Java 1.7
  • Ensures special characters are not used in RabbitMQ consumer tags (thx @bianxiaojin)
  • Shows connect exceptions when using RabbitMQ (thx @thanhct)
  • Fixes glitch where health check wasn't reported properly when throttled (thx @lambcode)

Zipkin 2.14

15 May 15:44
Compare
Choose a tag to compare

Zipkin 2.14 adds storage throttling and Elasticsearch 7 support. We've also improved efficiency around span collection and enhanced the UI. As mentioned last time, this release drops support for Elasticsearch v2.x and Kafka v0.8.x. Here's a run-down of what's new.

Storage Throttling (Experimental)

How to manage surge problems in collector architecture is non-trivial. While we've collected resources for years about this, only recently we had a champion to take on some mechanics in practical ways. @Logic-32 fleshed out concerns in collector surge handling and did an excellent job evaluating options for those running pure http sites.

Towards that end, @Logic-32 created an experimental storage throttling feature (bundled for your convenience). When STORAGE_THROTTLE_ENABLED=true calls to store spans pay attention to storage errors and adjust backlog accordingly. Under the hood, this uses Netfix concurrency limits.

Craig tested this at his Elasticsearch site, and it resulted in far less dropped spans than before. If you are interested in helping test this feature, please see the configuration notes and join gitter to let us know how it works for you.

Elasticsearch 7.x

Our server now supports Elasticsearch 6-7.x formally (and 5.x as best efforts). Most notably, you'll no longer see colons in your index patterns if using Elasticsearch 7.x. Thank to @making and @chefky for the early testing of this feature as quite a lot changed under the hood!

Lens UI improvements

@tacigar continues to improve Lens so that it can become the default user interface. He's helped tune the trace detail screen, notably displaying the minimap more intuitively based on how many spans are in the trace. You'll also notice the minimap has a slider now, which can help stabilize the area of the trace you are investigating.

Significant efficiency improvements

Our Armeria collectors (http and grpc) now work natively using pooled buffers as opposed to byte arrays with renovated protobuf parsers. The sum this is more efficient trace collection when using protobuf encoding. Thanks very much to @anuraaga for leading and closely reviewing the most important parts of this work.

No more support for Elasticsearch 2.x and Kafka 0.8.x

We no longer support Elasticsearch 2.x or Kafka 0.8.x. Please see advice mentioned in our last release if you are still on these products.

Scribe is now bundled (again)

We used to bundle Scribe (Thrift RPC span collector), but eventually moved it to a separate module due to it being archived technology with library conflicts. Our server is now powered by Armeria, which natively supports thrift. Thanks to help from @anuraaga, the server has built-in scribe support for those running legacy applications. set SCRIBE_ENABLED=true to use this.

Other notable updates

  • Elasticsearch span documents are written with ID ${traceID}-${MD5(json)} to allow for server-side deduplication
  • Zipkin Server is now using the latest Spring Boot 2.1.5 and Armeria 0.85.0

Zipkin 2.13.0

01 May 11:07
Compare
Choose a tag to compare

Zipkin 2.13 includes several new features, notably a gRPC collection endpoint and remote service name indexing. Lens, our new UI, is fast approaching feature parity, which means it will soon be default. End users should note this is the last release to support Elasticsearch v2.x and Kafka v0.8.x. Finally, this our first Apache Incubating release, and we are thankful to the community's support and patience towards this.

Lens UI Improvements

Led by @tacigar, Lens has been fast improving. Let's look at a couple recent improvements: Given a trace ID or json, the page will load what you want.

Open trace json

Go to trace ID

Right now, you can opt-in to Lens by clicking a button. The next step is when Lens is default and finally when the classic UI is deleted. Follow the appropriate projects for status on this.

gRPC Collection endpoint

Due to popular demand, we now publish a gRPC endpoint /zipkin.proto3.SpanService/Report which accepts the same protocol buffers ListOfSpans message as our POST /api/v2/spans endpoint. This listens on the same port as normal http traffic when COLLECTOR_GRPC_ENABLED=true. We will enable this by default after the feature gains more experience.

We chose to publish a unary gRPC endpoint first, as that is most portable with limited clients such as grpc-web. Our interop tests use the popular Android and Java client Square Wire. Special thanks to @ewhauser for leading this effort and @anuraaga who championed much of the work in Armeria.

Remote Service Name indexing

One rather important change in v2.13 is remote service name indexing. This means that the UI no longer confuses local and remote service name in the same drop-down. The impact is that some sites will have much shorter and more relevant drop-downs, and more efficient indexing. Here are some screen shots from Lens and Classic UIs:

Schema impact

We have tests to ensure the server can be upgraded ahead of schema change. Also, most storage types have the ability to automatically upgrade the schema. Here are relevant info if you are manually upgrading:

STORAGE_TYPE=cassandra

If you set CASSANDRA_ENSURE_SCHEMA=false, you are opting out of automatic schema management. This means you need to execute these CQL commands manually to update your keyspace

STORAGE_TYPE=cassandra3

If you set CASSANDRA_ENSURE_SCHEMA=false, you are opting out of automatic schema management. This means you need to execute these CQL commands manually to update your keyspace

STORAGE_TYPE=elasticsearch

No index changes were needed

STORAGE_TYPE=mysql

Logs include the following message until instructions are followed:

zipkin_spans.remote_service_name doesn't exist, so queries for remote service names will return empty.
Execute: ALTER TABLE zipkin_spans ADD `remote_service_name` VARCHAR(255);
ALTER TABLE zipkin_spans ADD INDEX `remote_service_name`;

Dependency group ID change

For those using Maven to download, note that the group ID for libraries changed from "io.zipkin.zipkin2" to "org.apache.zipkin.zipkin2". Our server components group ID changed from "io.zipkin.java" to "org.apache.zipkin"

This is the last version to support Elasticsearch 2.x

Elastic's current support policy is latest major version (currently 7) and last minor (currently 6.7). This limits our ability to support you. For example, Elasticsearch's hadoop library is currently broken for versions 2.x and 5.x making our dependencies job unable to work on that range and also work on version 7.x.

Our next release will support Elasticsearch 7.x, but we have to drop Elasticsearch 2.x support. Elasticsearch 5.x will be best efforts. We advise users to be current with Elastic's supported version policy, to avoid being unable to upgrade Zipkin.

This is the last version to support Kafka 0.8x

This is the last release of Zipkin to support connecting to a Kafka 0.8 broker (last release almost 4 years ago). Notably, this means those using KAFKA_ZOOKEEPER to configure their broker need to switch to KAFKA_BOOTSTRAP_SERVERS instead.

Other notable updates

  • Zipkin Server is now using the latest Spring boot 2.1.4

Zipkin 2.12.6

10 Mar 19:00
Compare
Choose a tag to compare

Zipkin 2.12.6 migrates to the Armeria http engine. We also move to Distroless to use JRE 11 in our docker images.

Interest in Armeria 3 years ago originated at LINE, a long time supporter of Zipkin. This was around its competency in http/2 and asynchronous i/o. Back then, we were shifting towards a more modular server so that they could create their own. Over time, interest and our use case for Armeria grown. Notably, @ewhauser has led an interest in a gRPC endpoint for zipkin. Typically, people present different listen ports for gRPC, but Armeria allows the same engine to be used for both usual web requests and also gRPC. Moreover, now more than ever LINE, the team behind Armeria are involved deeply in our community, as are former LINE engineers like @anuraaga. So, we had a match of both supply and demand for the technology.

End users will see no difference in Zipkin after we replaced the http engine. Observant administrators will notice some of the console lines being a bit different. The whole experience has been drop-in thanks to spring boot integration efforts led by @anuraaga @trustin and @hyangtack. For those interested in the technology for their own apps, please check out Armeria's example repository.

There's more to making things like http/2 work well than the server framework code. For example, there is nuance around OpenSSL which isn't solved well until newer runtimes. For a long time, we used alpine JRE 1.8 because some users were, justifiably or not, very concerned about the size of our docker image. As years passed, we kept hitting fragile setup concerns around OpenSSL. This would play out as bad releases of stackdriver integration, as that used gRPC. As we owe more service to users than perceptions around dist sizes, we decided it appropriate to move a larger, but still slim distroless JRE 11 image.

The MVP of all of this is @anuraaga, the same person who made an armeria zipkin server at LINE 3 years ago, who today backfilled functionality where needed, and addressed the docker side of things. Now, you can use this more advanced technology without thinking. Thank you, Rag!

2.12.4 DO NOT USE

07 Mar 19:52
Compare
Choose a tag to compare
2.12.4 DO NOT USE Pre-release
Pre-release

This version returned corrupt data under /prometheus (/actuator/prometheus). Do not use it

Zipkin 2.12.3

02 Mar 12:33
Compare
Choose a tag to compare

Zipkin 2.12.3 provides an easy way to preview Lens, our new user interface.

Introduction to Zipkin Lens

Zipkin was open sourced by Twitter in 2012. Zipkin was the first OSS distributed tracing system shipped complete with instrumentation and a UI. This "classic" UI persisted in various forms for over six years before Lens was developed. We owe a lot of thanks to the people who maintained this, as it is deployed in countless sites. We also appreciate alternate Zipkin UIs attempts, and the work that went into them. Here are milestones in the classic UI, leading to Lens.

  • Mid 2012-2015 - Twitter designed and maintained Zipkin UI
  • Early 2016 - Eirik and Zoltan team up to change UI code and packaging from scala to pure javascript
  • Late 2016 - Roger Leads Experimental Angular UI
  • Late 2017 to Early 2018 - Mayank Leads Experimental React UI
  • Mid to Late 2018 - Raja renovates classic UI while we investigate options for a standard React UI
  • December 7, 2018 - LINE contributes their React UI as Zipkin Lens
  • Early 2019 - Igarashi and Raja complete Zipkin Lens with lots of usage feedback from Daniele

Lens took inspiration from other UIs, such as Haystack, and cited that influence. You'll notice using it that it has a feel of its own. Many thanks to the design lead Igarashi, who's attention to detail makes Lens a joy to use. Some design goals were to make more usable space, as well be able to assign site-specific details, such as tags. Lens is not complete in its vision. However, it has feature parity to the point where broad testing should occur.

screen shot 2019-03-02 at 8 32 22 pm

Trying out Lens

We spend a lot of time and effort in attempts to de-risk trying new things. Notably, the Zipkin server ships with both the classic and the Lens UI, until the latter is complete. With design help from @kaiyzen and @bsideup, starting with Zipkin 2.12.3, end users can select the UI they prefer at runtime. All that's needed is to press the button "Try Lens UI", which reloads into the new codebase. There's then a button to revert: "Go back to classic Zipkin".

gobacktolens

Specifically the revert rigor was thanks to Tommy and Daniele who insisted on your behalf that giving a way out is as important as the way in. We hope you feel comfortable letting users try Lens now. If you want to prevent that, you can: set the variable ZIPKIN_UI_SUGGEST_LENS=false.

Auto-complete Keys

One design goal of Lens was to have it better reflect the priority of sites. The first work towards that is custom search criteria via auto-completion keys defined by your site. This was inspired by an alternative to Lens, Haystack UI, which has a similar "universal search" feature. Many thanks to Raja who put in ground work on the autocomplete api and storage integration needed by this feature. Thanks to Igarashi for developing the user interface in Lens for it.

To search by site-specific tags, such as environment names, you first need to tell Zipkin which keys you want. Start your zipkin servers with an environment variable like below that includes keys whitelisted from Span.tags. Note that keys you list should have a fixed set of values. In other words, do not use keys that have thousands of values.

AUTOCOMPLETE_KEYS=client_asg_name,client_cluster_name,service_asg_name,service_cluster_name java -jar zipkin.jar

Here's a screen shot of custom auto-completion, using an example trace from Netflix.

screen shot 2019-03-02 at 8 15 01 pm

Feedback

Maintaining two UIs is a lot of burden on Zipkin volunteers. We want to transition to Lens as soon as it is ready. Please upgrade and give feedback on Gitter as you use the tool. With luck, in a month or two we will be able to complete the migration, and divert the extra energy towards new features you desire, maybe with your help implementing them! Thanks for the attention, and as always star our repo, if you think we are doing a good job for open source tracing!

Zipkin 2.11.1

05 Aug 11:51
Compare
Choose a tag to compare

Zipkin 2.11.1 fixes a bug which prevented Cassandra 3.11.3+ storage from initializing properly. Thanks @nollbit and @drolando

Zipkin 2.11

03 Aug 13:25
Compare
Choose a tag to compare

Zipkin 2.11 dramatically improves Cassandra 3 indexing and fixes some UI glitches

Cassandra 3 indexing

We've had cassandra3 storage type for a while, which uses SASI indexing. One thing @Mobel123 noticed was particular high disk usage for indexing tags. This resulted in upstream work in cassandra to introduce a new indexing option which results in 20x performance for the type of indexing we use.

See #1948 for details, but here are the notes:

  1. You must upgrade to Cassandra 3.11.3 or higher first
  2. Choose a path for dealing with the old indexes
  • easiest is re-create your keyspace (which will drop trace data)
  • advanced users can run zipkin2-schema-indexes.cql, which will leave the data alone but recreate the index
  1. Update your zipkin servers to latest patch (2.11.1+)

Any questions, find us on gitter!

Thanks very much @michaelsembwever for championing this, and @llinder for review and testing this before release. Thanks also to the Apache Cassandra project for accepting this feature as it is of dramatic help!

UI fixes

@zeagord fixed bugs relating to custom time queries. @drolando helped make messages a little less scary when search is disabled. Zipkin's UI is a bit smaller as we've updated some javascript infra which minimizes better. This should reduce initial load times. Thanks tons for all the volunteering here!