diff --git a/README.md b/README.md
index c3b0eeb3..1a0cf5a9 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
-GreenplumPython is a Python library that enables the user to interact with Greenplum in a Pythonic way.
+GreenplumPython is a Python library that enables the user to interact with database in a Pythonic way.
GreenplumPython provides a [pandas](https://pandas.pydata.org/)-like DataFrame API that
1. looks familiar and intuitive to Python users
diff --git a/doc/source/db.rst b/doc/source/db.rst
index 74adc243..206d8e12 100644
--- a/doc/source/db.rst
+++ b/doc/source/db.rst
@@ -1,6 +1,5 @@
Database
========
-.. module:: greenplumpython
.. automodule:: db
:members:
diff --git a/doc/source/index.rst b/doc/source/index.rst
index bd5ffe4d..a31fcd1f 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -19,6 +19,7 @@ There are explanations about the implementation and examples.
:maxdepth: 2
:caption: Contents:
- install
+ req
+ req_advanced
tutorials
modules
diff --git a/doc/source/install.rst b/doc/source/install.rst
deleted file mode 100644
index dcf343fd..00000000
--- a/doc/source/install.rst
+++ /dev/null
@@ -1,54 +0,0 @@
-Requirements
-============
-
-GreenplumPython currently requires at least Python 3.9 to run, this is because:
- * Python 3.9 is the version we officially support and release with PL/Python3 and GPDB 6.
- * Python 3.9 is the default version in Rocky Linux 9 and is officially supported in Rocky Linux 8 (and also probably in RHEL 8 as well).
-
-GreenplumPython requires `plpython3 `_
-extension to be installed on Greenplum/Postgres.
-
-`dill `_ as an optional dependency for GreenplumPython `plpython` side,
-which provides convenient features like auto-importing modules in the `plpython` functions. (auto-import is available even when dill is NOT installed on server.
-`dill` is require to include outside dependencies in the same file/module, like functions or classes.)
-
-To install `dill` or any other python modules on the `plpython` side, refer to `GPDB plpython document `_ for more details.
-
-Installation
-============
-
-Outside Virtual Environments
-----------------------------
-
-You can install latest release of the **GreenplumPython** library with pip3:
-
-.. code-block:: bash
-
- pip3 install --user greenplum-python
-
-To install the latest development version, do
-
-.. code-block:: bash
-
- pip3 install --user git+https://github.com/greenplum-db/GreenplumPython
-
-Inside a Virtual Environment
-----------------------------
-
-You can install latest release of the **GreenplumPython** library with pip3:
-
-.. code-block:: bash
-
- pip3 install greenplum-python
-
-or to install the latest development version:
-
-.. code-block:: bash
-
- pip3 install git+https://github.com/greenplum-db/GreenplumPython
-
-The `--user` option in an active virtual environment will install to the local user python location.
-Since a user location doesn't make sense for a virtual environment, to install the **GreenplumPython** library,
-just remove `--user` from the above commands.
-
-
diff --git a/doc/source/notebooks/embedding.ipynb b/doc/source/notebooks/embedding.ipynb
index f8b0cdb6..4b5f90f7 100644
--- a/doc/source/notebooks/embedding.ipynb
+++ b/doc/source/notebooks/embedding.ipynb
@@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "# (Experimental) Generating, Indexing and Searching Embeddings\n",
+ "# Generating, Indexing and Searching Embeddings (Experimental)\n",
"\n",
"**WARNING: The feature introduced in this tutorial is currently experimental. It does not have any API stability guarantee.**\n",
"\n",
@@ -314,7 +314,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.9.13"
+ "version": "3.9.18"
}
},
"nbformat": 4,
diff --git a/doc/source/notebooks/package.ipynb b/doc/source/notebooks/package.ipynb
index 98b2a5b6..4d4fec18 100644
--- a/doc/source/notebooks/package.ipynb
+++ b/doc/source/notebooks/package.ipynb
@@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "# (Experimental) Installing Python Packages on Server without Internet Access\n",
+ "# Installing Python Packages on Server without Internet (Experimental)\n",
"\n",
"**WARNING: The feature introduced in this tutorial is currently experimental. It does not have any API stability guarantee.**\n",
"\n",
@@ -18,18 +18,20 @@
"\n",
"All these happen automatically and the user only need to declare what packages are needed.\n",
"\n",
- "In this way, as long as there is a database connection on a client with Internet access, the user can easily install the required packages, even if the database server cannot access the Internet by itself."
+ "In this way, as long as there is a database connection on a client with Internet access, the user can easily install the required packages, even if the database server cannot access the Internet by itself.\n",
+ "\n",
+ "**NOTE: This function only installs packages on the server host that GreenplumPython directly connects to. If your database server spreads across multiple hosts, additional operations are required to make the packages available on all hosts.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "## (Optional) Prerequisite: Setting-up an NFS Mount for Cluster\n",
+ "## (Optional) Prerequisite: Sharing Python Environments in a Cluster with NFS\n",
"\n",
"Setting up a NFS mount makes it easier to share a Python environment on multiple hosts and containers.\n",
"\n",
- "This is important for distributed database systems such as [Greenplum](https://greenplum.org/) because otherwise the same set of pcakges need to be installed on every host in the cluster.\n",
+ "This is important for distributed database systems such as [Greenplum](https://greenplum.org/) because otherwise the same set of packages needs to be copied to every host in the cluster.\n",
"\n",
"### Starting an NFS server\n",
"\n",
@@ -37,7 +39,7 @@
"\n",
"For how to do this, please refer to the documentation of the OS. For example, if you are using [Rocky Linux](https://rockylinux.org/), you might want to refer to [the NFS page](https://docs.rockylinux.org/guides/file_sharing/nfsserver/).\n",
"\n",
- "### Mount a Python environment with NFS\n",
+ "### Mount a Python environment with NFS on Each Host\n",
"\n",
"Next, we can mount a Python environment with NFS and share it to all hosts in the cluster. \n",
"\n",
diff --git a/doc/source/op.rst b/doc/source/op.rst
index 621adef2..12693837 100644
--- a/doc/source/op.rst
+++ b/doc/source/op.rst
@@ -1,6 +1,6 @@
Operators and Indexes
======================
-.. module:: greenplumpython
+module:: greenplumpython
.. automodule:: op
:members:
diff --git a/doc/source/req.rst b/doc/source/req.rst
new file mode 100644
index 00000000..a89bc1bb
--- /dev/null
+++ b/doc/source/req.rst
@@ -0,0 +1,61 @@
+Requirements
+============
+
+On Client
+---------
+
+On the client side, e.g., on our laptop or workstation, installing the `greenplum-python` Python package is all we need:
+
+.. code-block:: bash
+
+ python3 -m pip install greenplum-python
+
+This installs the latest released version. To try the latest development version, we can install it with
+
+.. code-block:: bash
+
+ python3 -m pip3 install git+https://github.com/greenplum-db/GreenplumPython
+
+Please note that the Python version needs to be at least 3.9 to install.
+
+On Server
+---------
+
+GreenplumPython works best with Greenplum. All features will be developed and tested on Greenplum first.
+
+We also try our best to support PostgreSQL and other PostgreSQL-derived databases, but some features might **NOT** be available when working with them.
+
+.. _Getting Started:
+
+Getting Started
+^^^^^^^^^^^^^^^
+
+To get started, all we need is a database that we have the permission to access.
+
+After connecting to the database, we can create :class:`~dataframe.DataFrame` s and manipulate them like using `pandas `_.
+
+.. _Creating Functions:
+
+Creating Functions
+^^^^^^^^^^^^^^^^^^
+
+Even though we can call existing functions in database to manipulate DataFrames, sometimes they might not fit our needs and we need to create new UDFs.
+
+To create a UDF, we need to install the PL/Python package on server and enable it in database with SQL:
+
+.. code-block:: sql
+
+ CREATE EXTENSION plpython3u;
+
+There are a few points to note when working with PL/Python:
+
+- To use the extension :code:`plpython3u`, it is required to login as a :code:`SUPERUSER`.
+ This might cause some security concerns. We will remove this limitation soon by supporting
+ `PL/Container `_.
+- Python 3.x is required on server. And it is recommended that the Python version on server
+ is greater than or equal to the one on client. This is to ensure all Python features are available
+ when writing UDFs.
+
+With all above steup, we are ready to go through the :doc:`tutorial <./sql>` to see how GreenplumPython compares with SQL.
+
+For other, more advanced, features, please refer to :doc:`./req_advanced`.
diff --git a/doc/source/req_advanced.rst b/doc/source/req_advanced.rst
new file mode 100644
index 00000000..b6dd32e8
--- /dev/null
+++ b/doc/source/req_advanced.rst
@@ -0,0 +1,70 @@
+Requirements on Server for Advanced Features
+============================================
+
+Using Non-Built-in Modules in a UDF
+-----------------------------------
+
+Modules installed in :code:`sys.path` on server will be available for use in a UDF. It is recommended to use a dedicated virtual
+environment on server for UDFs. To achieve this, one way is to activate the environment before starting the database server.
+For example, for PostgreSQL:
+
+ .. code-block:: bash
+
+ python3 -m venv /path/to/venv
+ source /path/to/venv/bin/activate
+ pg_ctl start
+
+In this way, UDFs executed in the PostgreSQL server can only use packages installed in the new virutal environment. This avoids
+polluting, or being polluted by, the system environment.
+
+Defining Classes and Functions Outside UDFs
+-------------------------------------------
+
+GreenplumPython will use the `dill` pickler to serialize and deserialize UDFs if it is available.
+Using a pickler like `dill` makes UDFs easier to write and to maintain because it allows us to refer to a function or class
+defined outside of the UDF. This means we don't need to copy it around. To use dill, we need to
+
+ - Make sure that the Python minor version on client equals to the one on server;
+ - Make sure that the dill version on server is no less than the one on client, based on
+ `dill's statement `_ on backward compatibility.
+
+With all in
+
+- the `Using Non-Built-in Modules in a UDF`_ section and
+- the `Defining Classes and Functions Outside UDFs`_ section
+
+setup, we are now ready to go though the :doc:`tutorial <./abalone>` on how to do Machine Learning (ML) in database with UDFs.
+
+Creating and Searching Embeddings (Experimental)
+------------------------------------------------
+
+Embeddings enable us to search unstructured data, e.g. texts and images, based on semantic similarity.
+
+To create and search embeddings, we will need all in :doc:`./req`, plus the
+`sentence-transformers `_ package installed
+in the server's Python environment.
+
+Please refer to the :doc:`tutorial <./tutorial_embedding>` for a simple working example to validate your setup.
+
+Uploading Data Files from Localhost (Experimental)
+--------------------------------------------------
+
+With GreenplumPython, we can upload data files of any format from localhost to server and parse them with a UDF.
+
+This feature requires all in :doc:`./req` to create UDFs.
+
+Please refer to the doc of :meth:`DataFrame.from_files() ` for detailed usage.
+
+Installing Python Packages (Experimental)
+-----------------------------------------
+
+With GreenplumPython, we can upload packages from localhost and install them on server.
+
+This can greatly simplify the process when the server cannot access the PyPI service directly.
+
+Since the installation is done by executing a UDF on server, this feature requires all in :doc:`./req`.
+
+Please refer to
+
+- the doc of :meth:`Database.install_packages() ` for detailed usage, and
+- the :doc:`tutorial <./tutorial_package>` for a simple working example.
diff --git a/doc/source/tutorial_embedding.rst b/doc/source/tutorial_embedding.rst
index fdcb879c..7dd69c05 100644
--- a/doc/source/tutorial_embedding.rst
+++ b/doc/source/tutorial_embedding.rst
@@ -2,6 +2,6 @@
.. toctree::
:maxdepth: 2
- :caption: (Experimental) Generating, Indexing and Searching Embeddings
+ :caption: Generating, Indexing and Searching Embeddings (Experimental)
notebooks/embedding
diff --git a/doc/source/tutorial_package.rst b/doc/source/tutorial_package.rst
index 22ab5dc9..b4c46e17 100644
--- a/doc/source/tutorial_package.rst
+++ b/doc/source/tutorial_package.rst
@@ -2,6 +2,6 @@
.. toctree::
:maxdepth: 2
- :caption: (Experimental) Installing Python Packages on Server without Internet Access
+ :caption: Installing Python Packages on Server without Internet (Experimental)
notebooks/package
\ No newline at end of file
diff --git a/greenplumpython/dataframe.py b/greenplumpython/dataframe.py
index eba3772f..e8be2dd1 100644
--- a/greenplumpython/dataframe.py
+++ b/greenplumpython/dataframe.py
@@ -1230,7 +1230,12 @@ def embedding(self) -> "Embedding":
"""
Enable embedding-based similarity search on columns of the current :class:`~DataFrame`.
- See :ref:`tutorial-embedding` for more details.
+ Example:
+ See :ref:`tutorial-embedding` for more details.
+
+ Warning:
+ This function is currently **experimental** and the interface is
+ subject to change.
"""
raise NotImplementedError(
"Please import greenplumpython.experimental.embedding to load the implementation."
@@ -1242,7 +1247,8 @@ def from_files(cls, files: list[str], parser: "NormalFunction", db: Database) ->
Create a DataFrame with data read from files.
Args:
- files: list of file paths.
+ files: list of file paths. Each path ends with the path of the
+ same file on client, without links resolved.
parser: a UDF that parses the given files on server. The UDF is required to
- take the file path as its only argument and
- returns a set of parsed records in the returing DataFrame.
@@ -1250,6 +1256,10 @@ def from_files(cls, files: list[str], parser: "NormalFunction", db: Database) ->
Returns:
DataFrame containing the parsed data from the given files.
+
+ Warning:
+ This function is currently **experimental** and the interface is
+ subject to change.
"""
raise NotImplementedError(
"Please import greenplumpython.experimental.file to load the implementation."
diff --git a/greenplumpython/db.py b/greenplumpython/db.py
index f06d7cb1..88cfd316 100644
--- a/greenplumpython/db.py
+++ b/greenplumpython/db.py
@@ -257,6 +257,20 @@ def install_packages(self, requirements: str) -> None:
Example:
See :ref:`tutorial-package` for more details.
+
+ Note:
+ This function only installs packages on the server host that
+ GreenplumPython directly connects to. If your database server
+ spreads across multiple hosts, additional operations are required
+ to make the packages available on all hosts.
+
+ One simple way to achieve this is to setup an NFS share on all
+ hosts. Please refer to :ref:`tutorial-package` for a simple working
+ example.
+
+ Warning:
+ This function is currently **experimental** and the interface is
+ subject to change.
"""
raise NotImplementedError(
"Please import greenplumpython.experimental.file to load the implementation."
diff --git a/greenplumpython/experimental/file.py b/greenplumpython/experimental/file.py
index f291dffd..d488d933 100644
--- a/greenplumpython/experimental/file.py
+++ b/greenplumpython/experimental/file.py
@@ -35,12 +35,12 @@ def _extract_files(tmp_archive_name: str, returning: str) -> list[str]:
tmp_archive.extractall(str(extracted_root))
tmp_archive_path.unlink()
if returning == "root":
- yield str(extracted_root.resolve())
+ yield str(extracted_root)
else:
assert returning == "files"
for path in extracted_root.rglob("*"):
if path.is_file() and not path.is_symlink():
- yield str(path.resolve())
+ yield str(path)
def _archive_and_upload(tmp_archive_name: str, files: list[str], db: gp.Database):
@@ -98,6 +98,15 @@ def _install_on_server(pkg_dir: str, requirements: str) -> str:
import sys
assert sys.executable, "Python executable is required to install packages."
+ try:
+ exec_version = sp.check_output([sys.executable, "--version"], text=True, stderr=sp.STDOUT)
+ except sp.CalledProcessError as e:
+ raise Exception(e.stdout)
+
+ lib_version = f"Python {sys.version_info.major}.{sys.version_info.minor}."
+ assert exec_version.startswith(
+ lib_version
+ ), f"Python major and minor versions mismatch (executable {exec_version}, library {lib_version})"
cmd = [
sys.executable,
"-m",
@@ -135,7 +144,7 @@ def _install_packages(db: gp.Database, requirements: str):
sp.check_output(cmd, text=True, stderr=sp.STDOUT, input=requirements)
except sp.CalledProcessError as e:
raise e from Exception(e.stdout)
- _archive_and_upload(tmp_archive_name, [local_dir.resolve()], db)
+ _archive_and_upload(tmp_archive_name, [local_dir], db)
extracted = db.apply(lambda: _extract_files(tmp_archive_name, "root"), column_name="cache_dir")
assert len(list(extracted)) == 1
server_dir = (