Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add fine tuning dataset creation from docs #52

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Add dataset creation from docs
Shreyanand committed May 25, 2023

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
commit 2f07c370ca0bacf26f88070450a20583dd293bfe
206 changes: 195 additions & 11 deletions notebooks/create-validation-dataset.ipynb
Original file line number Diff line number Diff line change
@@ -3,15 +3,17 @@
{
"cell_type": "markdown",
"id": "51a2b01a-6ed3-40ba-91e1-9ce499d11a07",
"metadata": {},
"metadata": {
"tags": []
},
"source": [
"## Validation dataset\n",
"This notebook takes the FAQ questionnaire from the ROSA workshop documents and creates a fine-tuning or validation dataset for text generation models."
]
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 3,
"id": "42bf2292-252e-4bd7-8fe3-a93ba7134a0a",
"metadata": {},
"outputs": [],
@@ -22,7 +24,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 4,
"id": "73e7542a-45a2-4ce7-aa74-a8a5317e08af",
"metadata": {
"tags": []
@@ -71,7 +73,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 5,
"id": "9e3bd68b-823d-4333-9985-e423c7fb2a0d",
"metadata": {},
"outputs": [
@@ -162,8 +164,8 @@
"</div>"
],
"text/plain": [
" Question \\\n",
"0 What is Red Hat OpenShift Service on AWS (ROSA)? \n",
" Question \n",
"0 What is Red Hat OpenShift Service on AWS (ROSA)? \\\n",
"1 Where can I go to get more information/details? \n",
"2 What are the benefits of Red Hat OpenShift Service on AWS (Key Features)? \n",
"3 What are the differences between Red Hat OpenShift Service on AWS and Kubernetes? \n",
@@ -191,7 +193,7 @@
"[65 rows x 2 columns]"
]
},
"execution_count": 3,
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
@@ -204,26 +206,208 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 6,
"id": "2dc72548-849e-4dd1-ae45-d66d3a197708",
"metadata": {},
"outputs": [],
"source": [
"validation_set.to_csv('../data/processed/validation_data.csv')"
]
},
{
"cell_type": "markdown",
"id": "f9c2ea8a-c693-425e-bf7f-bba4bf1ab69d",
"metadata": {},
"source": [
"## Create question answer pairs from the documentation dataset"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "6ff9a273-6fbf-46a6-91a3-5424cbd05be1",
"metadata": {},
"outputs": [],
"source": [
"import markdown\n",
"\n",
"# Open the Markdown file and read its contents\n",
"with open(\"../data/external/rosaworkshop/1-account_setup.md\", \"r\") as file:\n",
" md_text = file.read()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "177fcd51-d1a8-4680-8dfa-6193ccee7467",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'There are currently two supported credential methods when creating a ROSA cluster. One method uses an IAM user with the *AdministratorAccess* policy (only for the account using ROSA). The other, more recent, and **recommended** method uses AWS STS. Please see the section \"[ROSA with STS Explained](15-sts_explained.md)\" for a detailed explanation. In this workshop we will only be using the STS method.\\n\\n## Prerequisites\\n\\nPlease review the prerequisites found in the documentation at [Prerequisites for ROSA w/STS](https://docs.openshift.com/rosa/rosa_planning/rosa-sts-aws-prereqs.html) before getting started.\\n\\n\\nYou will need the following pieces of information from your AWS account:\\n\\n- AWS IAM User\\n- AWS Access Key ID\\n- AWS Secret Access Key\\n\\n### A Red Hat account\\nIf you do not have a Red Hat account, create one here <https://console.redhat.com/>. Accept the required terms and conditions. Then check your email for a verification link.\\n\\n### Install the AWS CLI\\n[Install the AWS CLI](https://aws.amazon.com/cli/) as per your operating system.\\n\\n### Enable ROSA\\nComplete this step if you have *not* enabled ROSA in your AWS account.\\n\\n- Visit <https://console.aws.amazon.com/rosa> to enable your account to use ROSA.\\n- Click on the orange \"Enable OpenShift\" button on the right.\\n\\n ![Enable](images/1-enable.png)\\n\\n- It will take about a minute and then you will see a green \"service enabled\" bar at the top.\\n\\n ![Enabled](images/1-enabled.png)\\n\\n### Install the ROSA CLI\\n- Install the [ROSA CLI](https://console.redhat.com/openshift/downloads) as per your operating system.\\n- Download and extract the relevant file for your operating system\\n - ex: `tar -xvf rosa-linux.tar.gz`\\n- Save it to a location within your \"PATH\".\\n - ex: `sudo mv rosa /usr/local/bin/rosa`\\n- Run `rosa version` to make sure it works and that it returns the version number.\\n\\n### Install the OpenShift CLI\\nThere are a few ways to install the `oc` CLI:\\n\\n1. If you have the `rosa` CLI installed, the simplest way is to run `rosa download oc`\\n 1. Once downloaded, untar (or unzip) the file and move the executables into a directory in your PATH\\n1. Or, you can [download and install](https://docs.openshift.com/container-platform/latest/cli_reference/openshift_cli/getting-started-cli.html#installing-openshift-cli) the latest OpenShift CLI (oc) \\n1. Or, if you already have an OpenShift cluster you can access the command line tools page by clicking on the *Question mark > Command Line Tools*. Then download the relevant one for your operating system.\\n\\n ![CLI Tools](images/0-cli_tools_page.png)\\n\\n**Why use `oc` over `kubectl`**<br>\\nBeing Kubernetes, one can definitely use `kubectl` with their OpenShift cluster. `oc` is specific to OpenShift in that it includes the standard set of features from `kubectl` plus additional support for OpenShift functionality. See [Usage of oc and kubectl commands](https://docs.openshift.com/container-platform/latest/cli_reference/openshift_cli/usage-oc-kubectl.html) for more details.\\n\\n### Configure the AWS CLI\\nIf you\\'ve just installed the AWS CLI, or simply want to make sure it is using the correct AWS account, follow these steps in a terminal:\\n\\n1. Enter `aws configure` in the terminal\\n1. Enter the AWS Access Key ID and press enter\\n1. Enter the AWS Secret Access Key and press enter\\n1. Enter the default region you want to deploy into\\n1. Enter the output format you want (“table” or “json”). For this guide you can choose “table” as it is easier to read but either is fine.\\n\\n It should look like the following as an example:\\n\\n $ aws configure\\n AWS Access Key ID: AKIA0000000000000000\\n AWS Secret Access Key: NGvmP0000000000000000000000000\\n Default region name: us-east-1\\n Default output format: table\\n\\n\\n### Verify the configuration\\nVerify that the configuration is correct.\\n\\n1. Run the following command to query the AWS API \\n\\n aws sts get-caller-identity\\n\\n2. You should see a table (or JSON if that’s what you set it to above) like the below. Verify that the account information is correct.\\n\\n $ aws sts get-caller-identity\\n ------------------------------------------------------------------------------\\n | GetCallerIdentity |\\n +--------------+----------------------------------------+--------------------+\\n | Account | Arn | UserId |\\n +--------------+----------------------------------------+--------------------+\\n | 000000000000| arn:aws:iam::00000000000:user/myuser | AIDA00000000000000|\\n +--------------+----------------------------------------+--------------------+\\n\\n\\n### Ensure the ELB service role exists\\nMake sure that the service role for ELB already exists, otherwise the cluster deployment could fail. As such, run the following to check for the role and create it if it is missing.\\n\\n aws iam get-role --role-name \"AWSServiceRoleForElasticLoadBalancing\" || aws iam create-service-linked-role --aws-service-name \"elasticloadbalancing.amazonaws.com\"\\n\\nIf you received an error during cluster creation like below, then the above should correct it.\\n\\n Error: Error creating network Load Balancer: AccessDenied: User: arn:aws:sts::970xxxxxxxxx:assumed-role/ManagedOpenShift-Installer-Role/163xxxxxxxxxxxxxxxx is not authorized to perform: iam:CreateServiceLinkedRole on resource: arn:aws:iam::970xxxxxxxxx:role/aws-service-role/elasticloadbalancing.amazonaws.com/AWSServiceRoleForElasticLoadBalancing\"\\n\\n### Log into your Red Hat account\\n1. Enter `rosa login` in a terminal.\\n2. It will prompt you to open a web browser and go to:\\n\\n <https://console.redhat.com/openshift/token/rosa>\\n\\n3. If you are asked to log in, then please do.\\n4. Click on the \"Load token\" button.\\n5. Copy the token and paste it back into the CLI prompt and press enter. Alternatively, you can just copy the full `rosa login --token=abc...` command and paste that in the terminal.\\n\\n ![CLI Tools](images/1-token.png)\\n\\n### Verify credentials\\nVerify that all the credentials set up are correctly.\\n\\n1. Run `rosa whoami`\\n\\n You should see an output like below:\\n\\n AWS Account ID: 000000000000\\n AWS Default Region: us-east-2\\n AWS ARN: arn:aws:iam::000000000000:user/myuser\\n OCM API: https://api.openshift.com\\n OCM Account ID: 1DzGIdIhqEWy000000000000000\\n OCM Account Name: Your Name\\n OCM Account Username: [email protected]\\n OCM Account Email: [email protected]\\n OCM Organization ID: 1HopHfA20000000000000000000\\n OCM Organization Name: Red Hat\\n OCM Organization External ID: 0000000\\n\\n2. Please check all information for accuracy before proceeding.\\n\\n### Verify quota\\nVerify that your AWS account has ample quota in the region you will be deploying your cluster to. Run the following:\\n\\n rosa verify quota\\n\\nShould return a response like\\n\\n I: Validating AWS quota...\\n I: AWS quota ok. If cluster installation fails, validate actual AWS resource usage against https://docs.openshift.com/rosa/rosa_getting_started/rosa-required-aws-service-quotas.html\\n\\nSee [the documentation](https://docs.openshift.com/rosa/rosa_planning/rosa-sts-required-aws-service-quotas.html) for more details regarding quotas.\\n\\n### Verify `oc` CLI\\nVerify that the `oc` CLI is installed correctly\\n\\n rosa verify openshift-client\\n\\nWe have now successfully set up our account and environment and are ready to deploy our cluster.\\n\\n### Cluster Deployment\\nIn the next section you will deploy your cluster. There are two mechanisms to do so:\\n\\n- Using the ROSA CLI\\n- Using the OCM Web User Interface\\n\\nEither way is perfectly fine for the purposes of this workshop. Though keep in mind that if you are using the OCM UI, there will be a few extra steps to set it up in order to deploy into your AWS account for the first time. This will not need to be repeated for subsequent deployments using the OCM UI for the same AWS account.\\n\\nPlease select the desired mechanism in the left menu under \"Deploy the cluster\".\\n\\n*[ROSA]: Red Hat OpenShift Service on AWS\\n*[STS]: AWS Security Token Service\\n*[OCM]: OpenShift Cluster Manager\\n'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"md_text"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cf9cfa45-d0e0-407f-8703-6786c7f70d00",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"import re\n",
"import nltk\n",
"\n",
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.text_splitter import MarkdownTextSplitter\n",
"from langchain.vectorstores import Chroma\n",
"from langchain.docstore.document import Document\n",
"from langchain.prompts import PromptTemplate\n",
"from langchain.chains.question_answering import load_qa_chain\n",
"from langchain.llms import OpenAI\n",
"from langchain.document_loaders import TextLoader, DirectoryLoader\n",
"from langchain.chains.qa_with_sources import load_qa_with_sources_chain\n",
"from langchain.prompts.prompt import PromptTemplate\n",
"\n",
"from dotenv import load_dotenv, find_dotenv\n",
"import pandas as pd\n",
"import time\n",
"pd.set_option('display.max_colwidth', None)\n",
"\n",
"load_dotenv(find_dotenv(\"credentials.env\"), override=True)\n",
"import os\n",
"os.environ[\"LANGCHAIN_TRACING\"] = \"true\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "60c2da7a-2d4b-41b0-8109-d62a3142c88a",
"metadata": {},
"outputs": [],
"source": [
"chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type=\"stuff\")\n",
"answer = chain({\"input_documents\": docs, \"question\": query}, return_only_outputs=True)"
]
},
{
"cell_type": "code",
"execution_count": 66,
"id": "56be814b-ee55-469a-be51-bd265fba454b",
"metadata": {},
"outputs": [],
"source": [
"from langchain.prompts import PromptTemplate\n",
"from langchain.llms import OpenAI\n",
"\n",
"llm = OpenAI(temperature=0, max_tokens=-1)\n",
"prompt = PromptTemplate(\n",
" input_variables=[\"md\"],\n",
" template=\"{md} \\n List and describe in detail the 15 major points covered in this guide. Write 100 words for each point\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 67,
"id": "868d95be-247e-4a03-bddd-917cb26351fa",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"OpenAI(cache=None, verbose=False, callback_manager=<langchain.callbacks.shared.SharedCallbackManager object at 0x7f15fa1d9610>, client=<class 'openai.api_resources.completion.Completion'>, model_name='text-davinci-003', temperature=0.0, max_tokens=-1, top_p=1, frequency_penalty=0, presence_penalty=0, n=1, best_of=1, model_kwargs={}, openai_api_key=None, openai_api_base=None, openai_organization=None, batch_size=20, request_timeout=None, logit_bias={}, max_retries=6, streaming=False, allowed_special=set(), disallowed_special='all')"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"llm"
]
},
{
"cell_type": "code",
"execution_count": 68,
"id": "2471e322-e7ab-4b9a-a7bd-1f2032002892",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chains import LLMChain\n",
"chain = LLMChain(llm=llm, prompt=prompt)"
]
},
{
"cell_type": "code",
"execution_count": 69,
"id": "b6d6062f-518b-4535-92c3-41f97576573d",
"metadata": {},
"outputs": [],
"source": [
"ans = chain.run(md_text)"
]
},
{
"cell_type": "code",
"execution_count": 70,
"id": "ff4ffca5-8823-4c41-a50a-f00fbc2f24d4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
".\n",
"\n",
"1. Credential Methods: There are two supported credential methods when creating a ROSA cluster. The first method uses an IAM user with the *AdministratorAccess* policy and the second, more recent, and recommended method uses AWS STS.\n",
"2. Prerequisites: Before getting started, it is important to review the prerequisites found in the documentation at [Prerequisites for ROSA w/STS](https://docs.openshift.com/rosa/rosa_planning/rosa-sts-aws-prereqs.html). This includes having a Red Hat account, installing the AWS CLI, enabling ROSA, installing the ROSA CLI, and installing the OpenShift CLI.\n",
"3. Configure the AWS CLI: After installing the AWS CLI, it is important to configure it with the correct AWS Access Key ID, AWS Secret Access Key, default region, and output format.\n",
"4. Ensure the ELB Service Role Exists: It is important to make sure that the service role for ELB already exists, otherwise the cluster deployment could fail. As such, it is important to check for the role and create it if it is missing.\n",
"5. Log into your Red Hat Account: To log into your Red Hat account, you must enter `rosa login` in a terminal and follow the instructions to copy the token and paste it back into the CLI prompt.\n",
"6. Verify Credentials: After logging into your Red Hat account, it is important to verify that all the credentials set up are correctly by running `rosa whoami`.\n",
"7. Verify Quota: It is important to verify that your AWS account has ample quota in the region you will be deploying your cluster to by running `rosa verify quota`.\n",
"8. Verify `oc` CLI: To verify that the `oc` CLI is installed correctly, run `rosa verify openshift-client`.\n",
"9. Cluster Deployment: There are two mechanisms to deploy your cluster: using the ROSA CLI or using the OCM Web User Interface.\n",
"10. Create Cluster: To create a cluster, you must run `rosa create cluster` and provide the necessary parameters.\n",
"11. Configure Cluster: After creating the cluster, you must configure it by running `rosa configure cluster` and providing the necessary parameters.\n",
"12. Verify Cluster: To verify that the cluster is running correctly, run `rosa verify cluster`.\n",
"13. Access Cluster: To access the cluster, you must run `rosa access cluster` and provide the necessary parameters.\n",
"14. Delete Cluster: To delete the cluster, you must run `rosa delete cluster` and provide the necessary parameters.\n",
"15. Troubleshooting: If you encounter any issues while creating, configuring, or accessing the cluster, you can refer to the ROSA troubleshooting guide for help.\n"
]
}
],
"source": [
"print(ans)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c7fda387-277c-49a8-9a6b-2658cfce2be1",
"id": "073c3a04-3364-4499-a475-6372551eab2c",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"display_name": "Python 3.9.14",
"language": "python",
"name": "python3"
},
@@ -237,7 +421,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
"version": "3.9.14"
}
},
"nbformat": 4,