Skip to content

Commit

Permalink
Update README for the initial release of cloud-tpu-diagnostics packag…
Browse files Browse the repository at this point in the history
…e to production

PiperOrigin-RevId: 538581626
  • Loading branch information
SurbhiJainUSC authored and copybara-github committed Jun 7, 2023
1 parent d1fe368 commit 053e6b0
Show file tree
Hide file tree
Showing 2 changed files with 84 additions and 2 deletions.
5 changes: 3 additions & 2 deletions pip_package/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,10 @@ To release a new version (e.g. from `1.0.0` -> `2.0.0`):

## [Unreleased]

## [0.1.0] - YYYY-MM-DD
## [0.1.0] - 2023-06-08

* Initial release
* Initial release of cloud-tpu-diagnostics PyPI package
* FEATURE: Contains debug module to collect stack traces on faults

[Unreleased]: https://github.com/google/cloud-tpu-monitoring-debugging/compare/v0.1.0...HEAD
[0.1.0]: https://github.com/google/cloud-tpu-monitoring-debugging/releases/tag/v0.1.0
81 changes: 81 additions & 0 deletions pip_package/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,84 @@

This is a comprehensive library to monitor, debug and profile the jobs running on Cloud TPU.
To learn about Cloud TPU, refer to the [full documentation](https://cloud.google.com/tpu/docs/intro-to-tpu).

## Features
### 1. Debugging
#### 1.1 Collect Stack Traces
This module will dump the python traces when a fault such as Segmentation fault, Floating-point exception, Illegal operation exception occurs in the program. Additionally, it will also periodically collect stack traces to help debug when a program running on Cloud TPU is stuck or hung somewhere.

## Installation
To install the package, run the following command on TPU VM:

```
pip install cloud-tpu-diagnostics
```

## Usage
To use this package, first import the module:

```
from cloud_tpu_diagnostics import diagnostic
from cloud_tpu_diagnostics.configuration import debug_configuration
from cloud_tpu_diagnostics.configuration import diagnostic_configuration
from cloud_tpu_diagnostics.configuration import stack_trace_configuration
```

Then, create configuration object for stack traces. The module will only collect stack traces when `collect_stack_trace` parameter is set to `True`. There are following scenarios supported currently:

##### Scenario 1: Do not collect stack traces on faults

```
stack_trace_config = stack_trace_configuration.StackTraceConfig(
collect_stack_trace=False)
```
This configuration will prevent you from collecting stack traces in the event of a fault or process hang.

##### Scenario 2: Collect stack traces on faults and display on console

```
stack_trace_config = stack_trace_configuration.StackTraceConfig(
collect_stack_trace=True,
stack_trace_to_cloud=False)
```
If there is a fault or process hang, this configuration will show the stack traces on the console (stderr).

##### Scenario 3: Collect stack traces on faults and upload on cloud

```
stack_trace_config = stack_trace_configuration.StackTraceConfig(
collect_stack_trace=True,
stack_trace_to_cloud=True)
```
This configuration will temporary collect stack traces inside `/tmp/debugging` directory on TPU host if there is a fault or process hang. Additionally, the traces collected in TPU host memory will be uploaded to Google Cloud Logging, which will make it easier to troubleshoot and fix the problems.

By default, stack traces will be collected every 10 minutes. In order to change the duration between two stack trace collection events, add the following configuration:

```
stack_trace_config = stack_trace_configuration.StackTraceConfig(
collect_stack_trace=True,
stack_trace_to_cloud=True,
stack_trace_interval_seconds=300)
```
This configuration will collect the stack traces on cloud after every 5 minutes.

Then, create configuration object for debug.

```
debug_config = debug_configuration.DebugConfig(
stack_trace_config=stack_trace_config)
```

Then, create configuration object for diagnostic.

```
diagnostic_config = diagnostic_configuration.DiagnosticConfig(
debug_config=debug_config)
```

Finally, call the `diagnose()` method using `with` and wrap the statements inside the context manager for which you want to collect the stack traces.

```
with diagnostic.diagnose(diagnostic_config):
run_job(...)
```

0 comments on commit 053e6b0

Please sign in to comment.