# Atlas Record Cleaner Report - Implementation Guide

## Overview

This application validates and verifies species occurrence records from the NBN Atlas using the BRC Record Cleaner Service. It provides a web interface for users to select records by dataset or search URL, then processes them through validation and verification steps, producing a comprehensive report with downloadable CSV export.

## Features

- **Two Input Methods**: Select records by dataset or paste an NBN Atlas search URL
- **Validation**: Checks record format, dates, spatial references, and taxon names
- **Verification**: Tests records against biological rules (distribution, phenology, period)
- **Summary Statistics**: Visual dashboard showing pass/warn/fail rates
- **CSV Export**: Download complete results with all fields for spreadsheet analysis
- **Caching**: Dataset list cached for 20 minutes to reduce API calls
- **Loading Indicators**: User-friendly loading overlays during processing

## Architecture

### Service Layer Pattern

The application follows Django best practices with thin views and business logic in service classes:

```
atlas_record_cleaner_report/
├── services/
│   ├── atlas_service.py          # NBN Atlas API interactions
│   ├── record_cleaner_service.py # Record Cleaner API client
│   ├── data_mapper.py            # Data transformation functions
│   └── csv_service.py            # CSV export generation
├── templates/
│   └── atlas_record_cleaner_report/
│       ├── form.html             # Input form page
│       └── results.html          # Results display page
├── views.py                      # View functions (thin)
├── urls.py                       # URL routing
└── tests.py                      # Comprehensive test suite
```

### Data Flow

1. **User Input** → Form submission (dataset UID or search URL)
2. **Atlas Fetch** → Retrieve occurrences from NBN Atlas API
3. **Data Mapping** → Transform Atlas format to Record Cleaner format
4. **Validation** → Send records to Record Cleaner /validate endpoint
5. **Verification** → Send passing records to /verify endpoint
6. **Merge Results** → Combine original data with validation/verification results
7. **Display** → Show summary statistics and detailed table
8. **Export** → Generate CSV for download

## Configuration

All configuration constants are defined at the top of `views.py`:

```python
# Configuration Constants
CACHE_TIMEOUT = 1200        # Cache duration in seconds (20 minutes)
CACHE_KEY_DATASETS = 'atlas_datasets_list'  # Cache key for datasets
```

### Configuration Management

**Sensitive credentials are stored in `.env` files using `django-environ`.**

The application uses `.env` files to manage sensitive configuration values. This keeps secrets out of version control and allows separate configurations for development and production.

**Setup:**

1. Run the setup script:
   ```bash
   ./setup_config.sh
   ```

2. Or manually configure:
   ```bash
   # Option 1: Project root (local dev)
   cp env.example .env
   
   # Option 2: External directory (production)
   export APP_CONFIG_DIR="$HOME/.config/devs-gone-wild"
   cp env.example $APP_CONFIG_DIR/.env
   ```

3. Edit the `.env` file with your credentials:
   ```bash
   RECORD_CLEANER_USERNAME=your-username
   RECORD_CLEANER_PASSWORD=your-password
   DEBUG=True
   ```

**Configuration is loaded in `config/settings.py`:**

```python
import environ

env = environ.Env()
environ.Env.read_env()  # Reads from .env file

RECORD_CLEANER_USERNAME = env('RECORD_CLEANER_USERNAME')
RECORD_CLEANER_PASSWORD = env('RECORD_CLEANER_PASSWORD')
```

**Usage in code:**

```python
from django.conf import settings

# In views.py
rc_client = RecordCleanerClient(
    username=settings.RECORD_CLEANER_USERNAME,
    password=settings.RECORD_CLEANER_PASSWORD
)
```

See [CONFIG_SETUP.md](../CONFIG_SETUP.md) in the project root for detailed configuration instructions.

### Important: Production Configuration

**Before deploying to production:**

1. Set up production `.env` file (preferably in external directory via `APP_CONFIG_DIR`)
2. Set `DEBUG=False` in your `.env` file
3. Generate and set a strong `SECRET_KEY`
4. Set restrictive file permissions: `chmod 600 $APP_CONFIG_DIR/.env` or `chmod 600 .env`
5. Consider setting a reasonable MAX_RECORDS limit (e.g., 10000) for API calls

## Caching Implementation

### Dataset List Caching

The application uses Django's built-in caching framework to store the list of datasets from the NBN Atlas Registry.

**Cache Key**: `atlas_datasets_list`
**Timeout**: 1200 seconds (20 minutes)
**Backend**: Django default cache (LocMemCache for development)

### How It Works

1. On first page load, datasets are fetched from NBN Atlas Registry API
2. Results are stored in cache with 20-minute expiry
3. Subsequent requests within 20 minutes use cached data
4. After 20 minutes, cache expires and fresh data is fetched on next request

### Cache Refresh

Cache entries are automatically refreshed after expiration. No manual intervention needed.

### Configuring Cache Timeout

To change the cache duration, modify `CACHE_TIMEOUT` in `views.py`:

```python
CACHE_TIMEOUT = 3600  # 1 hour
CACHE_TIMEOUT = 7200  # 2 hours
```

### Production Cache Backend

For production, consider using a more robust cache backend in `config/settings.py`:

**Option 1: Database Cache**
```python
CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.db.DatabaseCache',
        'LOCATION': 'app_cache_table',
    }
}
```

Run: `python manage.py createcachetable`

**Option 2: Redis (Recommended for High Traffic)**
```python
CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.redis.RedisCache',
        'LOCATION': 'redis://127.0.0.1:6379/1',
    }
}
```

Requires: `pip install django-redis`

## API Integration

### NBN Atlas API

**Base URL**: `https://records-ws.nbnatlas.org`

**Endpoints Used**:
- `/occurrences/search` - Fetch occurrence records
- `https://registry.nbnatlas.org/ws/dataResource` - Fetch dataset list

**Fields Retrieved**:
- occurrenceID
- taxonConceptLSID
- scientificName
- eventDate (UTC timestamp in milliseconds)
- gridReference
- decimalLatitude
- decimalLongitude
- coordinateUncertaintyInMeters
- lifeStage

### Record Cleaner API

**Base URL**: `https://record-cleaner.brc.ac.uk`

**Endpoints Used**:
- `POST /token` - Authentication (JWT token, 15-minute expiry)
- `POST /validate` - Validate record format and data quality
- `POST /verify` - Verify records against biological rules

**Token Management**: Tokens are automatically refreshed 1 minute before expiry.

## Data Mapping

### Atlas to Record Cleaner Format

| Atlas Field | Record Cleaner Field | Transformation |
|-------------|---------------------|----------------|
| taxonConceptLSID | tvk | Direct copy |
| scientificName | name | Direct copy (fallback if no TVK) |
| eventDate | date | Milliseconds → DD/MM/YYYY |
| decimalLatitude | sref.latitude | Direct copy |
| decimalLongitude | sref.longitude | Direct copy |
| coordinateUncertaintyInMeters | sref.accuracy | Round up to 1, 10, 100, 1000, etc. |
| N/A | sref.srid | Always 4326 (WGS84) |
| lifeStage | stage | Direct copy |

### Coordinate Uncertainty Rounding

The `map_coordinate_uncertainty()` function rounds uncertainty values to the nearest power of 10:

- 1-9 → 10
- 10 → 10
- 11-99 → 100
- 100 → 100
- 101-999 → 1000

This ensures compatibility with Record Cleaner's expected accuracy format.

## Testing

### Running Tests

```bash
# Run all tests for the app
python manage.py test atlas_record_cleaner_report

# Run specific test class
python manage.py test atlas_record_cleaner_report.DataMapperTests

# Run with verbose output
python manage.py test atlas_record_cleaner_report --verbosity=2
```

### Test Coverage

The test suite includes:

- **Data Mapping Tests**: 10 tests covering all transformation functions
- **Atlas Service Tests**: 4 tests for URL parsing and building
- **Record Cleaner Service Tests**: 3 tests for authentication and validation
- **View Tests**: 4 tests for form display, caching, and CSV export

### Test Fixtures

Tests use mocking to avoid real API calls:
- `@patch('requests.post')` for Record Cleaner API
- `@patch('fetch_data_resources')` for Atlas Registry API

## URL Structure

- `/record-cleaner/` - Input form
- `/record-cleaner/generate/` - Generate report (POST)
- `/record-cleaner/download-csv/` - Download CSV export

## User Workflow

1. **Select Input Method**:
   - Option A: Choose a dataset from dropdown
   - Option B: Paste NBN Atlas search URL from browser

2. **Submit Form**: Click "Generate Record Cleaner Report"

3. **Processing** (with loading overlay):
   - Fetch occurrences from Atlas
   - Map data to Record Cleaner format
   - Validate all records
   - Verify records that passed validation

4. **View Results**:
   - Summary statistics (pass/warn/fail counts and percentages)
   - Detailed table with all records
   - Color-coded badges for quick scanning

5. **Download CSV**: Export complete results for spreadsheet analysis

## CSV Export Format

The CSV includes all original Atlas fields plus Record Cleaner results:

**Columns**:
- occurrence_id
- taxon_concept_lsid
- scientific_name
- occurrence_date
- grid_ref
- latitude
- longitude
- coordinate_uncertainty
- life_stage
- validation_result (pass/warn/fail)
- validation_messages
- preferred_tvk
- verification_result (pass/warn/fail/not_verified)
- verification_messages
- organism_key
- id_difficulty (1-5)

## Error Handling

### Atlas API Errors
- Connection timeout: Caught and displayed to user
- Invalid URL: Validated client-side before submission
- No records found: Friendly message displayed

### Record Cleaner API Errors
- Authentication failure: Error message with troubleshooting
- Token expiry: Automatic refresh before expiry
- Validation/verification errors: Caught and reported

### Form Validation
- Missing dataset selection: Client-side validation
- Invalid URL format: Client-side validation
- Empty search results: Informative error page

## Performance Considerations

### Pagination
- Atlas API called with `PAGE_SIZE` parameter
- Multiple pages fetched automatically if needed
- Configurable maximum records to prevent timeouts

### Batch Processing
- All records validated in single API call (up to Record Cleaner's limit)
- Only passing records sent for verification
- Efficient ID-based merging of results

### Caching
- Dataset list cached for 20 minutes
- Reduces Registry API calls by ~95%

### Session Storage
- Results stored in Django session for CSV download
- Avoids re-processing for export
- Automatically cleared on new report

## Accessibility & UX

- **Loading Overlays**: Clear feedback during long operations
- **Color-Coded Results**: Green (pass), yellow (warn), red (fail), gray (not verified)
- **Truncated Display**: Long fields truncated in table, full data in CSV
- **Breadcrumbs**: Easy navigation back to form
- **Toast Notifications**: Client-side validation feedback

## Future Enhancements

### Potential Improvements
1. **Asynchronous Processing**: Use Celery for large datasets
2. **Progress Indicators**: Real-time progress updates via WebSockets
3. **Report History**: Save and retrieve previous reports
4. **Filtering**: Filter results table by validation/verification status
5. **Pagination**: Paginate results table for very large datasets
6. **Export Formats**: Add JSON and Excel export options
7. **Scheduled Reports**: Automate regular quality checks on datasets
8. **Rule Selection**: Allow users to specify which verification rules to run

## Troubleshooting

### Common Issues

**Problem**: Datasets not loading
**Solution**: Check internet connection and NBN Atlas Registry API status

**Problem**: Record Cleaner authentication fails
**Solution**: Verify credentials in `views.py` are correct

**Problem**: Timeout on large datasets
**Solution**: Reduce `PAGE_SIZE`limit

**Problem**: CSV download shows 404
**Solution**: Ensure session is active and report was generated first

**Problem**: Cache not working
**Solution**: Check Django cache backend configuration in `settings.py`

## Development Tips

### Local Development
```bash
# Activate virtual environment
source .venv/bin/activate

# Run development server
python manage.py runserver

# Access the app
http://localhost:8000/record-cleaner/
```

### Debugging
- Set `PAGE_SIZE = 10` for faster testing
- Use Django Debug Toolbar for SQL query analysis
- Check `request.session['report_data']` for stored results
- Monitor cache hits: `cache.get(CACHE_KEY_DATASETS)`

## Security Notes

**Production Checklist**:
- [ ] Move credentials to environment variables
- [ ] Set `DEBUG = False` in settings.py
- [ ] Configure ALLOWED_HOSTS
- [ ] Enable CSRF protection (already enabled)
- [ ] Use HTTPS in production
- [ ] Implement rate limiting for API calls
- [ ] Add user authentication if needed
- [ ] Sanitize user-provided search URLs

## Support & Documentation

- **Record Cleaner API Docs**: https://record-cleaner.brc.ac.uk/docs
- **NBN Atlas API**: https://records-ws.nbnatlas.org
- **Django Caching**: https://docs.djangoproject.com/en/5.2/topics/cache/
- **BRC Contact**: brc@ceh.ac.uk
