Data Source Connectors: Complete Reference
Data source connectors are the foundation of your AI content pipeline. This comprehensive reference covers all available connectors, their configuration options, and best practices for implementation.
Overview
What are Data Source Connectors?
Data source connectors are specialized interfaces that allow AiWDC to extract content from various sources automatically. Each connector is designed to handle specific data formats, APIs, and protocols.
Available Connector Types
- RSS Feeds: Blog and news syndication
- Web Scraping: Website content extraction
- Database Integration: SQL and NoSQL databases
- API Integration: REST and GraphQL APIs
- Search Engines: Trend and content discovery
- File Processing: Document and media files
- Social Media: Platform-specific content
- Email: Newsletter and communication
RSS Feed Connectors
Configuration Options
connector_type: "rss_feed"
basic_config:
feed_url: "https://example.com/feed.xml"
update_frequency: "30 minutes"
max_items_per_update: 10
content_filtering:
include_keywords: ["AI", "machine learning", "automation"]
exclude_keywords: ["spam", "irrelevant"]
min_content_length: 500
max_content_length: 5000
content_processing:
extract_images: true
clean_html: true
remove_ads: true
preserve_formatting: true
Supported Formats
- RSS 2.0: Standard RSS feeds
- Atom 1.0: Atom syndication format
- JSON Feed: Modern JSON format
- Media RSS: RSS with media enclosures
Best Practices
- Feed Validation: Ensure feeds are valid XML
- Update Frequency: Don’t overwhelm source servers
- Content Filtering: Use filters to maintain relevance
- Error Handling: Handle feed downtime gracefully
Web Scraping Connectors
Configuration Options
connector_type: "web_scraper"
target_config:
start_url: "https://example.com/articles"
crawl_depth: 2
respect_robots_txt: true
rate_limit: "2 requests per second"
content_extraction:
content_selectors:
title: "h1.article-title"
body: "div.article-content"
author: "span.author-name"
date: "time.published-date"
cleanup_rules:
remove_elements: ["nav", "footer", "sidebar"]
preserve_structure: true
extract_images: true
advanced_options:
javascript_rendering: true
handle_pagination: true
custom_headers:
User-Agent: "AiWDC-Bot/1.0"
Advanced Features
- JavaScript Rendering: Execute client-side JavaScript
- Pagination: Handle multi-page content
- Login Support: Authenticate for protected content
- Proxy Support: Rotate IP addresses
- CAPTCHA Handling: Solve automated challenges
Legal Considerations
- robots.txt: Always respect robots.txt directives
- Terms of Service: Review website terms
- Rate Limiting: Be respectful of server resources
- Copyright: Only scrape content you have rights to
Database Connectors
MySQL Configuration
connector_type: "mysql"
connection:
host: "your-database-host"
port: 3306
database: "content_db"
username: "content_user"
password: "secure_password"
ssl: true
query_config:
query: "SELECT * FROM articles WHERE status = 'published' AND created_at > LAST_UPDATE"
id_column: "article_id"
last_update_column: "updated_at"
content_mapping:
title: "title"
content: "body"
author: "author_name"
publish_date: "publication_date"
categories: "category_names"
PostgreSQL Configuration
connector_type: "postgresql"
connection:
host: "localhost"
port: 5432
database: "content_repository"
username: "content_admin"
password: "database_password"
advanced_options:
connection_pool: true
pool_size: 10
timeout: "30 seconds"
MongoDB Configuration
connector_type: "mongodb"
connection:
uri: "mongodb+srv://user:password@cluster.mongodb.net/"
database: "content_db"
collection: "articles"
query:
filter: {"status": "published", "published_at": {"$gt": "$LAST_RUN"}}
projection: {"title": 1, "content": 1, "author": 1, "published_at": 1}
sort: {"published_at": -1}
limit: 100
Database Best Practices
- Security: Use encrypted connections
- Performance: Optimize queries with indexes
- Connection Management: Use connection pooling
- Error Handling: Handle connection failures
- Data Validation: Validate data before processing
API Integration Connectors
REST API Configuration
connector_type: "rest_api"
endpoint:
base_url: "https://api.example.com/v1"
authentication:
type: "bearer_token"
token: "your_api_token"
endpoints:
- name: "articles"
path: "/articles"
method: "GET"
parameters:
limit: 50
offset: 0
status: "published"
headers:
Accept: "application/json"
pagination:
type: "page_number"
page_param: "page"
max_pages: 10
rate_limiting:
requests_per_minute: 60
burst_limit: 10
GraphQL Configuration
connector_type: "graphql"
endpoint:
url: "https://api.example.com/graphql"
headers:
Authorization: "Bearer your_token"
Content-Type: "application/json"
query: |
query GetContent($limit: Int!, $offset: Int!) {
articles(limit: $limit, offset: $offset) {
id
title
content
author
publishedAt
tags
}
}
variables:
limit: 50
offset: 0
Authentication Methods
- Bearer Token: Simple token-based auth
- API Key: Key in header or query parameter
- OAuth 2.0: Full OAuth2 flow support
- Basic Auth: Username/password authentication
- Custom: Custom authentication schemes
Search Engine Connectors
Google Trends Configuration
connector_type: "google_trends"
authentication:
api_key: "your_google_api_key"
queries:
- keyword: "artificial intelligence"
timeframe: "now 7-d"
geo: "US"
category: "technology"
- keyword: "machine learning"
timeframe: "now 7-d"
geo: "US"
category: "technology"
processing:
min_interest_score: 50
trending_threshold: 80
Custom Search Configuration
connector_type: "custom_search"
search_engine:
name: "google"
api_key: "your_api_key"
search_engine_id: "your_cse_id"
query_config:
query: "AI content automation"
results_per_page: 10
safe_search: "moderate"
language: "en"
filtering:
exclude_domains: ["spam.com", "low-quality.com"]
include_domains: ["techcrunch.com", "venturebeat.com"]
date_range: "last_7_days"
File Processing Connectors
Document Processing
connector_type: "file_processor"
source:
type: "s3"
bucket: "content-bucket"
prefix: "documents/"
region: "us-east-1"
file_types:
- ".pdf"
- ".docx"
- ".txt"
- ".md"
processing:
extract_text: true
extract_metadata: true
ocr_enabled: true
language_detection: true
output_format:
structure: "markdown"
include_metadata: true
preserve_formatting: true
Image Processing
connector_type: "image_processor"
source:
type: "local_directory"
path: "/content/images"
processing:
extract_text: true # OCR
detect_objects: true
generate_captions: true
classify_content: true
output:
include_image: false
include_alt_text: true
include_tags: true
Social Media Connectors
Twitter Configuration
connector_type: "twitter"
authentication:
api_key: "your_api_key"
api_secret: "your_api_secret"
access_token: "your_access_token"
access_token_secret: "your_access_token_secret"
data_collection:
type: "search"
query: "#AIcontent automation"
result_type: "recent"
count: 100
filters:
min_retweets: 5
min_likes: 10
exclude_replies: true
language: "en"
LinkedIn Configuration
connector_type: "linkedin"
authentication:
client_id: "your_client_id"
client_secret: "your_client_secret"
access_token: "your_access_token"
endpoints:
- name: "company_updates"
path: "/companies/{company_id}/updates"
fields: ["id", "text", "created", "author"]
rate_limiting:
requests_per_hour: 500
Email Connectors
Newsletter Processing
connector_type: "email"
email_config:
server: "imap.gmail.com"
port: 993
encryption: "ssl"
username: "your_email@gmail.com"
password: "your_password"
folder_config:
folder: "INBOX/Newsletters"
mark_as_read: true
move_to_folder: "Processed"
content_extraction:
extract_html: true
extract_text: true
remove_tracking: true
extract_links: true
filtering:
from_domains: ["substack.com", "medium.com"]
subject_keywords: ["AI", "technology", "automation"]
Configuration Management
Environment Variables
environment:
DEVELOPMENT:
database_host: "localhost"
api_rate_limit: "1000/hour"
PRODUCTION:
database_host: "prod-db.example.com"
api_rate_limit: "10000/hour"
ssl_required: true
Secret Management
secrets:
database_password: "${DATABASE_PASSWORD}"
api_token: "${API_TOKEN}"
encryption_key: "${ENCRYPTION_KEY}"
Monitoring and Logging
Performance Monitoring
monitoring:
metrics:
- name: "extraction_success_rate"
target: "> 95%"
- name: "average_processing_time"
target: "< 30 seconds"
- name: "error_rate"
target: "< 5%"
alerts:
- condition: "error_rate > 10%"
action: "send_alert"
- condition: "processing_time > 60 seconds"
action: "log_warning"
Logging Configuration
logging:
level: "INFO"
format: "json"
outputs:
- type: "file"
path: "/var/log/aiwdc/connectors.log"
- type: "cloud"
service: "cloudwatch"
region: "us-east-1"
Error Handling
Common Error Types
- Connection Errors: Network issues, server downtime
- Authentication Errors: Invalid credentials, expired tokens
- Rate Limiting: Too many requests
- Data Format Errors: Unexpected data structures
- Permission Errors: Access denied
Error Recovery Strategies
error_handling:
retries: 3
backoff_strategy: "exponential"
max_backoff: "5 minutes"
fallback_actions:
- type: "use_cache"
ttl: "1 hour"
- type: "alternative_source"
priority: 2
- type: "human_notification"
channel: "slack"
Best Practices
1. Security
- Use encryption for sensitive data
- Implement proper authentication
- Follow principle of least privilege
- Regular security audits
2. Performance
- Implement caching strategies
- Use connection pooling
- Optimize queries and requests
- Monitor resource usage
3. Reliability
- Implement retry logic
- Use circuit breakers
- Have fallback mechanisms
- Monitor health metrics
4. Compliance
- Follow data protection regulations
- Respect terms of service
- Maintain audit trails
- Document data usage
Troubleshooting
Common Issues and Solutions
1. Connection Failures
- Check network connectivity
- Verify authentication credentials
- Review firewall settings
- Test with manual connections
2. Data Quality Issues
- Validate data schemas
- Implement data cleaning
- Use quality scoring
- Set up alerts for anomalies
3. Performance Problems
- Monitor resource usage
- Optimize queries
- Scale resources
- Implement caching
Future Enhancements
Upcoming Features
- AI-Powered Extraction: Smart content identification
- Real-time Processing: Stream processing capabilities
- Advanced Analytics: Predictive maintenance
- Multi-language Support: Global content sources
- Blockchain Integration: Verified content sources
Conclusion
Data source connectors are the critical foundation of your AI content pipeline. By understanding and properly configuring these connectors, you ensure a reliable, efficient, and scalable content automation system.
Start with basic configurations, gradually implement advanced features, and continuously optimize based on performance data. With proper setup and maintenance, your connectors will provide years of reliable service.
Resources
- Connector Documentation: Detailed API references
- Configuration Templates: Ready-to-use configs
- Troubleshooting Guide: Common issues and solutions
- Best Practices: Industry-standard approaches
- Community Support: User forums and discussions
By mastering these data source connectors, you’ll build a robust foundation for your AI content automation strategy.