-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Our data infrastructure aims to curate, process, and provide access to the depths of bitcoin's technical ecosystem.
Additional Resources:
- Centralized Access: Aggregate and organize data for efficient user and system access.
- Discoverability: Leverage topics and metadata for advanced search and contextual exploration.
- Automation: Automate ingestion, processing, and enrichment tasks, such as summarization and topic extraction to ensure up-to-date, actionable data.
- Scalability: Build adaptable systems to accommodate growing sources and evolving needs, while maintaining flexibility to integrate new tools and use cases.
- Continuous Improvement: Monitor performance and iterate based on feedback and metrics.
These are cross-cutting concerns and challenges that affect all stages of the data pipeline:
- π§ Audit and Document the Data Pipeline
- βοΈ Streamline Data Ingestion and Processing
- βοΈ Develop Metrics to Guide Improvements
- π Improve Information Retrieval
- π‘ Monitor System Health and Reliability
Each action item is prefixed with an emoji that represents its current status: Backlog (π), In Progress (π οΈ), Done (β ).
Analyze and document the scattered components of the data pipeline to uncover inconsistencies and define a unified framework for understanding our data infrastructure and enable better decision-making.
- β
Create a Sources Explorer for easy access to available sources, resources and individual documents.
- β Enhanced with PR#150
- π οΈ Replace
/sources
with/explore
page, generalizing document exploration to include sources, authors, and tags. PR#155 - π οΈ Create a point of reference for the data infrastructure at bitcoinsearch/infrastructure.
- π οΈ Address terminology inconsistencies across the infrastructure to simplify understanding and documentation.
- See Proposal for Terminology Standardization.
- scraperV2 is using this new terminology
Establish a flexible and modular framework for ingesting and managing diverse data sources efficiently. This will create a more scalable and maintainable infrastructure for future growth.
- β Introduced scraperV2 with PR#81
- β Add support for Bitcoin Core PR Review Club
- β Add new StackExchange scraper
- π οΈ Activate scraperV2
- π Assess support for different types of sources: Research papers, Release docs, awesome-x pages, Twitter threads, Medium
The data pipeline has many different levers to tweak, so it's hard to know what's going wrong, what to change, where to start. Introduce measurable evaluation strategies across the data pipeline. This will help identify weak points, assess changes, and guide data-driven decisions instead of relying on intuition.
- π οΈ Research evaluation strategies for our RAG pipeline
Develop strategies and tools to enable more precise and contextually relevant information retrieval, benefiting both current and future products.
- π οΈ Research chunking strategies
- β Create topics-index as a comprehensive list of topics that can be used across the infrastructure
- π Refine topic extraction using the new topics-index and integrate into scraperV2.
Ensure the data pipelineβs stability and quality through centralized monitoring and automated alerts.
- π Central logging for data collection events
- π Automated alerts for scraper failures or data collection issues
The data pipeline can be broken down into the following stages that handle the complete lifecycle of data:
- π Discover: Identify relevant data sources
- π₯ Ingest: Bring data into the system
- βοΈ Process: Make the data more useful
- ποΈ Store: Persist and organize the data
- π½οΈ Serve: Provide access points and efficient search/query capabilities
- π‘ Consume: Retrieve and use the data to create value
Focuses on WHAT data we need and where to find them.
CURRENT STATE
- Sources are identified through human input.
- User can suggest sources for inclusion, but there is no establish pipeline to include those suggestions, it's manual work.
AREAS FOR IMPROVEMENT and EXPLORATION
- π Automate source suggestion workflows.
- π Automate source discovery using web crawlers, API monitors and aggregated lists (e.g awesome X pages)
Focuses on HOW to bring data into the system.
CURRENT STATE
-
Sources Management
- Sources are managed through a registry of predefined knowledge sources. see scraper/README.md
- Scheduling is done using GitHub Actions with nightly cron jobs for periodic updates, while one-off scrapers collect data on an ad hoc basis.
-
Data Collection
- Supported sources via scrapyV2
- Documents in GitHub repositories (GitHub scraper)
- Websites (scrapy scraper)
- Original content is stored in its native format in
body_formatted
and identified bytype
.
- Supported sources via scrapyV2
AREAS FOR IMPROVEMENT and EXPLORATION
- Health, maintenance and usage of the
coredev
index that powers the CoreDev bot
Focuses on HOW to make the data more useful.
CURRENT STATE
- Processing only happens on resources from the bitcoin-dev mailing list and Delving Bitcoin
-
Summarization
- The summarizer creates individual post summaries and combined thread summaries daily using
gpt-4-turbo-preview
. - Workflow inefficiencies: Duplicate storage (XML and Elasticsearch), questionable utility of individual summaries, and potential quality issues.
- The summarizer creates individual post summaries and combined thread summaries daily using
-
Topic Extraction
- Current topic extraction relies on an outdated list of Bitcoin-related topics.
- Primary and secondary topics as generated by topic extraction are not utilized anywhere on our data infrastructure.
-
Embeddings
- Vector embeddings generated using the document's title and summary using SentenceTransformer (
intfloat/e5-large-v2
). There's no cost, but the embedding size limit is 1024. - Outputs are stored in Elasticsearch but currently unused.
- Vector embeddings generated using the document's title and summary using SentenceTransformer (
AREAS FOR IMPROVEMENT and EXPLORATION
- π Integrate summarization into scraperV2.
- π Refine the summarization logic for individual summaries
- π Consideration of additional enrichment tasks, like named entity recognition or relationship mapping.
- Identify key entities (e.g., people, topics, locations) within data
- π Fix incorrect url in Combined Summaries (isse#64)
Focuses on HOW to persist and organize the data.
CURRENT STATE
- Elasticsearch stores all data, with full-text search support. It may be less effective at capturing semantic similarities between concepts.
- Bitcoin Transcripts are managed autonomously in their own GitHub repository, but also scraped for integration with Elasticsearch.
- Summaries are duplicated as XML files in the summarizer.
AREAS FOR IMPROVEMENT and EXPLORATION
- π Implement a central thread resource document to consolidate summaries.
- π Consolidate all data in Elasticsearch to avoid duplications (e.g., between XML files and index).
Focuses on HOW to make the data available for consumption
CURRENT STATE
-
Access Methods
- The Bitcoin Search API serves as the primary interface for querying data, powering both Bitcoin Search and chat-btc.
- API lacks documentation
-
Relevancy Challenges
- Poor relevancy ranking affects the usability of both search results and chat-btc responses.
AREAS FOR IMPROVEMENT and EXPLORATION
- π οΈ Research improved ranking methodologies for Elasticsearch queries.
- π Better documentation for API
CURRENT STATE
- Bitcoin Search: Displays search results from Elasticsearch using the Bitcoin Search API.
-
Chat-BTC: ChatGPT wrapper using a RAG pipeline to fetch relevant documents for queries.
- Retrieval Strategy
- Prompt
gpt-3.5-turbo
to extract keywords from user's last 10 questions - Use keywords (or query as fallback) to get relevant documents from Bitcoin Search API
- Filter out StackExchange Questions
- Prompt
- Response Generation
- 7000 tokens context window
- Include the last 4 to 6 messages of chat history
- Retrieval Strategy
-
Bitcoin TLDR: Displays Combine Summaries and individual Summaries
- Currently in redesign