Files
delphi-database/ADDRESS_VALIDATION_SERVICE.md
2025-08-11 21:58:25 -05:00

304 lines
7.2 KiB
Markdown

# Address Validation Service Design
## Overview
A separate Docker service for validating and standardizing addresses using a hybrid approach that prioritizes privacy and minimizes external API calls.
## Architecture
### Service Design
- **Standalone FastAPI service** running on port 8001
- **SQLite database** containing USPS ZIP+4 data (~500MB)
- **USPS API integration** for street-level validation when needed
- **Redis cache** for validated addresses
- **Internal HTTP API** for communication with main legal application
### Data Flow
```
1. Legal App → Address Service (POST /validate)
2. Address Service checks local ZIP database
3. If ZIP/city/state valid → return immediately
4. If street validation needed → call USPS API
5. Cache result in Redis
6. Return standardized address to Legal App
```
## Technical Requirements
### Dependencies
- FastAPI framework
- SQLAlchemy for database operations
- SQLite for ZIP+4 database storage
- Redis for caching validated addresses
- httpx for USPS API calls
- Pydantic for request/response validation
### Database Schema
```sql
-- ZIP+4 Database (from USPS monthly files)
CREATE TABLE zip_codes (
zip_code TEXT,
plus4 TEXT,
city TEXT,
state TEXT,
county TEXT,
delivery_point TEXT,
PRIMARY KEY (zip_code, plus4)
);
CREATE INDEX idx_zip_city ON zip_codes(zip_code, city);
CREATE INDEX idx_city_state ON zip_codes(city, state);
```
### API Endpoints
#### POST /validate
Validate and standardize an address.
**Request:**
```json
{
"street": "123 Main St",
"city": "Anytown",
"state": "CA",
"zip": "90210",
"strict": false // Optional: require exact match
}
```
**Response:**
```json
{
"valid": true,
"confidence": 0.95,
"source": "local", // "local", "usps_api", "cached"
"standardized": {
"street": "123 MAIN ST",
"city": "ANYTOWN",
"state": "CA",
"zip": "90210",
"plus4": "1234",
"delivery_point": "12"
},
"corrections": [
"Standardized street abbreviation ST"
]
}
```
#### GET /health
Health check endpoint.
#### POST /batch-validate
Batch validation for multiple addresses (up to 50).
#### GET /stats
Service statistics (cache hit rate, API usage, etc.).
## Privacy & Security Features
### Data Minimization
- Only street numbers/names sent to USPS API when necessary
- ZIP/city/state validation happens offline first
- Validated addresses cached to avoid repeat API calls
- No logging of personal addresses
### Rate Limiting
- USPS API limited to 5 requests/second
- Internal queue system for burst requests
- Fallback to local-only validation when rate limited
### Caching Strategy
- Redis cache with 30-day TTL for validated addresses
- Cache key: SHA256 hash of normalized address
- Cache hit ratio target: >80% after initial warmup
## Data Sources
### USPS ZIP+4 Database
- **Source:** USPS Address Management System
- **Update frequency:** Monthly
- **Size:** ~500MB compressed, ~2GB uncompressed
- **Format:** Fixed-width text files (legacy format)
- **Download:** Automated monthly sync via USPS FTP
### USPS Address Validation API
- **Endpoint:** https://secure.shippingapis.com/ShippingAPI.dll
- **Rate limit:** 5 requests/second, 10,000/day free
- **Authentication:** USPS Web Tools User ID required
- **Response format:** XML (convert to JSON internally)
## Implementation Phases
### Phase 1: Basic Service (1-2 days)
- FastAPI service setup
- Basic ZIP code validation using downloaded USPS data
- Docker containerization
- Simple /validate endpoint
### Phase 2: USPS Integration (1 day)
- USPS API client implementation
- Street-level validation
- Error handling and fallbacks
### Phase 3: Caching & Optimization (1 day)
- Redis integration
- Performance optimization
- Batch validation endpoint
### Phase 4: Data Management (1 day)
- Automated USPS data downloads
- Database update procedures
- Monitoring and alerting
### Phase 5: Integration (0.5 day)
- Update legal app to use address service
- Form validation integration
- Error handling in UI
## Docker Configuration
### Dockerfile
```dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8001
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8001"]
```
### Docker Compose Addition
```yaml
services:
address-service:
build: ./address-service
ports:
- "8001:8001"
environment:
- USPS_USER_ID=${USPS_USER_ID}
- REDIS_URL=redis://redis:6379
depends_on:
- redis
volumes:
- ./address-service/data:/app/data
```
## Configuration
### Environment Variables
- `USPS_USER_ID`: USPS Web Tools user ID
- `REDIS_URL`: Redis connection string
- `ZIP_DB_PATH`: Path to SQLite ZIP database
- `UPDATE_SCHEDULE`: Cron schedule for data updates
- `API_RATE_LIMIT`: USPS API rate limit (default: 5/second)
- `CACHE_TTL`: Cache time-to-live in seconds (default: 2592000 = 30 days)
## Monitoring & Metrics
### Key Metrics
- Cache hit ratio
- USPS API usage/limits
- Response times (local vs API)
- Validation success rates
- Database update status
### Health Checks
- Service availability
- Database connectivity
- Redis connectivity
- USPS API connectivity
- Disk space for ZIP database
## Error Handling
### Graceful Degradation
1. USPS API down → Fall back to local ZIP validation only
2. Redis down → Skip caching, direct validation
3. ZIP database corrupt → Use USPS API only
4. All systems down → Return input address with warning
### Error Responses
```json
{
"valid": false,
"error": "USPS_API_UNAVAILABLE",
"message": "Street validation temporarily unavailable",
"fallback_used": "local_zip_only"
}
```
## Testing Strategy
### Unit Tests
- Address normalization functions
- ZIP database queries
- USPS API client
- Caching logic
### Integration Tests
- Full validation workflow
- Error handling scenarios
- Performance benchmarks
- Data update procedures
### Load Testing
- Concurrent validation requests
- Cache performance under load
- USPS API rate limiting behavior
## Security Considerations
### Input Validation
- Sanitize all address inputs
- Prevent SQL injection in ZIP queries
- Validate against malicious payloads
### Network Security
- Internal service communication only
- No direct external access to service
- HTTPS for USPS API calls
- Redis authentication if exposed
### Data Protection
- No persistent logging of addresses
- Secure cache key generation
- Regular security updates for dependencies
## Future Enhancements
### Phase 2 Features
- International address validation (Google/SmartyStreets)
- Address autocomplete suggestions
- Geocoding integration
- Delivery route optimization
### Performance Optimizations
- Database partitioning by state
- Compressed cache storage
- Async batch processing
- CDN for static ZIP data
## Cost Analysis
### Infrastructure Costs
- Additional container resources: ~$10/month
- Redis cache: ~$5/month
- USPS ZIP data storage: Minimal
- USPS API: Free tier (10K requests/day)
### Development Time
- Initial implementation: 3-5 days
- Testing and refinement: 1-2 days
- Documentation and deployment: 0.5 day
- **Total: 4.5-7.5 days**
### ROI
- Improved data quality
- Reduced shipping errors
- Better client communication
- Compliance with data standards
- Foundation for future location-based features