UESPWiki:File Database Design

A UESPWiki – Sua fonte de The Elder Scrolls desde 1995

Overview

Currently any files available for download on the site are manually ftped to the server and hard-linked. This greatly limits the amount of files we can host simply due to the effort involved in maintaining these files and prevents us from easily including content such as Morrowind and Oblivion mods. A simple database system, preferably integrated with the existing Wiki, needs to be developed to allow file downloads to be easily added and maintained.


Basic Requirements and Ideas

Important Requirements

The following are the more important requirements for whatever system is used.

  • Integrated with the Wiki as much as possible (reuse Wiki user information).
  • Web front end for adding new files and editing existing ones.
  • Track basic stats such as number of downloads, last update, etc....
  • Each file is associated with an owner (Wiki user) that can edit and update the file details.
  • File information is stored in database form but the actual files are stored in the file system.
  • Both the file database and files themselves must be capable of being backed up and restored in some manner.
  • Simple search capabilities.

Other Ideas

The following are less important ideas that may be desired in the future.

  • Optional ability to upload files via FTP (useful for large files).
  • Some form of version tracking may be nice. Instead of having different versions of the same download being seperate, combine them into a single download resource record. By default the newest version is downloaded, but older versions may still be available.
  • Possibly use a custom domain for storing of files (ex: files.uesp.net) rather than a local/absolute path.
  • Files can belong to multiple categories.
  • Simple rating system (0-5 stars, default at 3).
  • Optional comments thread for each file.
  • Optional Screenshots (useful for mods but probably not worth it initially).
  • Add a MediaWiki extension for easily adding file links to pages (<file>SomeFile</File>).

Design Concepts

1. Existing Wiki Add-on/Extension 
The Wiki itself is capable of handling binary files (through the Image: namespace) but doesn't provide anything to properly organize and manipulate 1000s of files. The binary file capability could be expanded. The main advantage to this design is that almost no work needs to be done as everything is handled by the Wiki. The major downside is that the file uploading/downloading and file organization of the Wiki is extremely basic and may lead to problems later on. Extending the Wiki to better handle binary files is a possibility.
2. Article for Each File 
File information could be stored in a Wiki article with just the file itself having to be stored somewhere. This is nice from the user perspective and requires relatively little effort to implement. This still requires a good amount of custom work to be added as well as system for handling the file uploads and downloads. It does save us the effort of managing the file page content as it is handled by the Wiki.
3. Reusing Image Namespace 
Simply use the unmodified Image: namespace to handle files.
4. Modified Image Namespace 
Use the existing Image: namespace and tweak it slightly to better handle general file uploads/downloads.
5. Custom File Namespace 
This design uses a system similar to the existing behavior of the Image: namespace. A custom File: namespace will be created with additional handling for the extra file information.
6. Custom File Database
The most complex design but also the most flexible as it allows essentially any content and presentation without having to conform to the existing Wiki structure.

For the remainder of this design we will be generally considering options #3/4/5 as they appear to be the best compromise between a fully custom solution and integration into the existing Wiki code.

Reusing Image Namespace

Simply using the exisiting image namespace is the easiest solution as it requires no modification.

Advantages Disadvantages
  • No modification required
  • Reuses existing and well tested code
  • Image namespace is meant primarily for images and not binary files in general
  • Would be better to seperate image and binary file content
  • Image articles not intended for binary files (possibly confusing)
  • Many desired features are not possible or difficult to implement (proper versioning, patrols, seperate upload/download servers, etc...)
Example: Ficheiro:Test.zip

Modified Image Namespace

This extends the previous design concept further by taking the existing Image: namespace and modifying it slightly to better handle binary files.

Advantages Disadvantages
  • Limited modifications required
  • Reuses existing and well tested code
  • Can relatively easily change the format of the image articles for binary file content
  • Amount of tweaking is limited to small changes (larger features may render it too difficult and complex)
  • Would be better to seperate image and binary file content
  • Still leaves some important features behind
Example: UESPWiki:File_Database_Design/Example2

Custom File Namespace

This design adds a custom File: namespace that operates in a manner similar to the Image: namespace except specifically for binary files.

Advantages Disadvantages
  • Can possibly reuse some of the existing image code
  • Can easily modify file article content and implement desired features
  • Large amount of modification required
  • Possible confusion between image upload and file upload (requires clarification of links in the rest of Wiki)
Example: UESPWiki:File_Database_Design/Example2

File Record Information

The following is an incomplete list of data that can possibly be stored for each file record.

  • File name (no path information)
  • File size (cached for easy access)
  • Description
  • Last update date/time
  • Original record creation date/time
  • Author's user ID (preferably a Wiki user ID if possible)
  • Download count
  • RatingSum and RatingCount (used to compute an average rating)
  • Version information (depending on how we implement this)
  • Requirements (useful to list required Expansions for mods)
  • Category list
  • Contact information (may not be needed if available from the Wiki user information)
  • Website (useful for mods or utilities that have their own website/forums)
  • Alternate resource (another place the mod can be downloaded)


Database Structure

  • Wiki Database (existing)
    • User Table (existing)
      • UserID (long, PK)
      • ...
    • Files
      • ID (long, PK)
      • UserID (long, FK)
      • Name (text)
      • UpdateDate (date/time)
      • CreationDate (date/time)
      • Description (text)
      • Downloads (long)
      • RatingSum (long)
      • RatingCount (long)
      • VersionID (long, FK)
      • Website (text)
    • FileCategory
      • ID (long, pk)
      • Name (text)
      • Description (text)
    • FileCategoryLink
      • CategoryID (long, FK)
      • FileID (long, FK)
    • FileVersions
      • ID (long, PK)
      • FileID (long, FK)
      • Name (text)
      • Description (text)
      • ReleaseDate (date/time)
      • Filename (text)
      • Size (long64)
      • Requirements (text)

File Table Discussion

The file table holds the basic information about the file that won't change (or won't change much) from version to version. Some of the information here could be technically stored in the file version table (ratings for example) but this would likely only be useful in the rare case that a file changes dramatically between versions (in which case a new file record can be created instead). The specific file information (such as filename and size) is stored in the file version table which allows an arbitrary number of versions to exist for each file. The current/default version is stored in the VersionID field of the table which allows the owner of the file to be able to explicitly state which file is the default (merely choosing the most recent version by date could possibly cause problems).

File Category Discussion

The categories for a file are stored in a seperate table which allows each file to belong to an arbitrary number of them. The FileCategory table holds a list of the currently available categories. This would include the applicable game (Morrowind, Oblivion, None, etc...) and any other specific types (plug-in, utility, editor, quest pack, etc...). New categories can only be created/edited by the database administrator.

The FileCategoryLink table connects one file to a specific category. To get all categories for a file then requires a join to be performed between this table the file table, matching the specific FileID. Performance wise this shouldn't be a concern with a proper index. Even with 10,000 files with an average of 3 categories each the table size will be ~30,000 records.

File Version Discussion

The FileVersions table holds the information for each version of all files in the File table. This design allows an unlimited number of versions to exist for each file. The version of the file is stored in the Name field. The Filename field holds the actual filename which will be used to download the file from file server (ex: files.uesp.net). Relative paths may be added to the filename to better organize the file server rather than having 1000s of files in a single directory. Specific version information can be stored in the Description and Requirement fields. The ReleaseDate field stores when the version was made available (not necessarily the creation date of the version record).

Possible Issues

File Organization

The given design of file categories is good in that's quite simple and allows unlimited categories per file. A problem arises, however, when actually storing the file on the file server as all files would naturally be stored in a single directory. This isn't a huge deal, but when (or if) we start having to deal with 1000s, or 10000s of files in a single path it may being to be annoying.

It would be nice to have a very basic directory structure for the file locations to indicate their general type: \Morrow\Mods, \Morrow\Editors, \Morrow\Patches, etc.... This would make any manual maintenance and backup procedures easier. With the given category design, however, implementing this structure is non-trivial. It is likely possible to general a semi-appropriate path from a file's list of categories, but this would probably be a complex algorithm and also makes changing a file's category non-trivial (instead of just editing the database you now have to worry about physically moving a file).

A design that makes choosing a file's physical location easier to determine involves using two forms of category information. A seperate category field would be added to the FileVersion table (or File table) which would specify the file's base category (Morrowind-Mods, Daggerfall-Patches, etc...). The existing category design would still be used for specifying additonal category levels as needed. This also as the side effect of reducing the size of the FileCategoryLink table considerably. The splitting of category information does make some things more difficult, such as the searching specific categories for example.

Wiki Upgrades

By storing the file related tables in the wiki database and reusing the wiki User table, we introduce a possible issue when upgrading to future Wiki versions. There is no guarantee that future versions will use the same user structure or that the upgrade procedure won't nuke our custom file tables. We should look at how flexible and forgiving a Wiki upgrade is to the database and ensure that including the file tables to the Wiki database won't be regretted.

Wiki Extension

A Wiki extension allows functions to be developed to parse custom tags, <newtag></newtag>, in article the markup. There is a potential to use this capability to easily link files from article text.

  • Tag attributes may or may not be allowed (must confirm either way). If possible it would make adding a file tag trivial, <file id="1234" version="1.5">Some File</file>
  • Files can potentially be referred to by their unique ID or their name. Referring to files by name might be easier from a user standpoint but it causes problems since the name is not necessarily unique and can be modified.
  • Referring by file ID prevents us from specifying anything other than the current version of the file.
  • Referring to a file by ID yields a tag format of: <file>12345</file>. The custom extension function would replace this with a valid URL for the file: [a href="files.uesp.net/download.php?id=12345"]Some Filename[/A].

Web Interface

A web interface for the adding, editing, downloading, and browsing of files in the file database will need to be created.

Possible Features

  • Add new file (requires user login)
  • Browse files (table format with basic information, sortable)
  • Search files (simple text search with basic category selection)
  • View file details (view all available information on a file including all versions)
  • Download file
  • Edit file information (only available to sysops and the file author)
  • Add file version (sysops and file author only)
  • Edit file version information (sysops and file author only)
  • Database statistics (read only)

Detailed Design

The following is more specific design notes for a file upload/download system using a custom File: namespace operating similarily to the existing MediaWiki Image: namespace.

Sample Procedures

The following are example procedures on how the file upload/download system should function.

Upload a New File

  1. User uploads a new file selecting the initial article name and optionally any categories and license.
  2. New file page is created with default content (much like a new image page). The file is not yet available for download until verified (patrolled).
  3. A file patroller checks the new file upload request. This includes checking the content to ensure it is what is says it is, is not corrupted, is not copyrighted material, scanning for virus/trojan, selects an appropriate filename, etc.... If the file is verified it now becomes available for download and shows up in searches. If refused the file is either requested to be fixed and downloaded again or the file page deleted (depending on why it was refused).
  4. Upon verification the file is moved/copied from the temporary upload location to the permanent download location.

Upload a New Version of Existing File

  1. User uploads a new version of an existing file by clicking on a link from the file's page (much like reloading a new image).
  2. The new version is subject to the same verification as a new file with the addition of verifying that it is indeed a new version of the given file.
  3. Upon a successful verification the new version is moved/copied to the permanent download location and is now available.
  4. Upon a failed verification the new version info is simply deleted.
  5. The default version for a file will either be the most recently released (not uploaded) or might be set manually.

Download a File

  1. Each file page will contain a conspicuous download link for the current version.
  2. Older versions can also be downloaded from the file page.
  3. Downloading will forward you to a short page (for statistic purposes) which will automatically begin the download.
  4. Linking to a file download should be made to this itermittant page in order for the file statistics to be updated.

Find a File

  1. Article name and text should be searched by the default search function.
  2. A custom search function for just files can be written if additional functionality is required.

Delete a File or Version

  1. A deletion request is handled similarily as a request to delete an image page.
  2. If a file version is to be deleted the specific version file and database records are moved into an archive location where they can be later restored if needed.
  3. Similarily, a complete file deletion would move all file versions and database records to an archive location.

Database Structure

The following is the initial database structure for the basic file upload/download system. Not all features are included and the intention will be to add them at a later date.

Files

The files table stores the basic information on each file in the same manner that the existing image table stores data on images.

Field Type Description
file_id int(8) Unique identifier for the file.
file_name varchar(255) The file article title.
file_versionid int(8) The default/current file version or NULL if there are no current versions.
file_downloads int(8) Total downloads for all versions of this file. This should just be a sum of the version download counts but is included for simplicity.
file_author varchar(255) Author name and optionally an e-mail or website link. This is not a Wiki username. Allows Wiki code or HTML.

File_Patrollers

This table serves the same purpose as the existing patrollers table used for patrolling articles. It is used to ensure only that one file patroller sees a particular file upload request.

Field Type Description
fileptr_versionid int(8) The version record being patrolled.
fileptr_timestamp varchar(14) The timestamp for the start of the patrolling session.

FileVersions

This table holds information on the specific versions available for each file.

Field Type Description
filever_id int(8) Unique identifier for the file version.
filever_fileid int(8) The parent file this version belongs to (cannot be NULL).
filever_title varchar(255) Short description identifying this version.
filever_userid int(8) User that uploaded this version.
filever_releasets varchar(14) The release timestamp of this version (if available, otherwise NULL). The format matches the MediaWiki timestamp fields (YYYYMMDDHHmmSS).
filever_uploadts varchar(14) The upload timestamp of this version.
filever_name varchar(255) The filename of this version as used in the file system.
filever_size int(16) The size of the file in bytes.
filever_verified tinyint(1) Has this version been verified yet? Unverified versions will not be available for download.
filever_downloads int(8) Number of times this version has been downloaded.

Archives

When deleting versions or entire files the data will be moved into archive tables rather than simply removed. The archive tables (archive_files and archive_fileversions) will have the identical structure as their matching tables.

Pages

The following describes the needed custom and modified MediaWiki pages/files needed along with their requirements.

Special:FileUpload

Similar to the Special:Upload page for uploading images except for uploading files.

Special:Upload

This might be modified to forward to the new Special:FileUpload page if the input filename is not an image/movie/sound file. It is also possible that no Special:FileUpload page will be created but all images and files will be downloaded from this single page.

File.php

The main file for serving up file content, both the actual file page as well as for downloading.

Special:Log

The logging page may need to be updated to support proper output of file uploads and patrols.

Special:FilePatrol

A new page for patrolling (verifying) file uploads needs to be created. On this page the patroller can download new uploads, perform the necessary checks and then verify or refuse the upload. This should be very similar to the existing Special:Patrol page.

File System

The following describes the file system specific design.

Uploads

All uploads will be in their own directory until their are verified (moved) or refused (deleted). This directory may also be available for uploads via anonymous FTP (upload.uesp.net). Preferrably uploads should be saved using the lighttpd web server to reduce the load of the main Apache server.

Downloads

Once verified all files will be moved from the upload directory to their permanent download location.

  • The directory structure will have some manual structure as choosen by the file patrollers. Files will be roughly organized by game and type. This organization is optional and has no impact on the Wiki side of things but may be needed if the download site is available via FTP.
  • The filename is largely irrelevant and is not used to uniquely identify the file (the file article name and file/version ID is used instead).
  • The filename can be changed at anytime by an admin or file patroller.
  • Downloads may be accessible via anonymous ftp (downloads.uesp.net).
  • The lighttpd web server will be used to serve file content and reduce the load of the main Apache server.
  • Bandwidth/connection limiting will likely be needed to prevent download abuse of a single user.

Archives

When a file/version is deleted the actual file will be moved into an archive location in case it needs to be undeleted at a later date. The archives will not be pubically accessible.

Backups

Backups of the files can be done in a similar manner to the image backups, rsync via a private NFS share.

Scalability

Initially all three file systems (uploads, downloads, and archives) will be on a single server. In the future, however, it may be necessary to seperate one or more of these to their own server. The system must be capable of supporting this without requiring too much effort.

  • Uploads can easily be moved to their own server as it is essentially just a file storage. Files can be moved to the download server once verified via a NFS (or similar) share or via FTP. The only obvious issue is timing...once verified it may take a significant amount of time before a file has been copied from the upload to the download server. To avoid this the file transfer can be first to a temporary file which is then moved to its permanent name once the transfer is complete.
  • The download server will likely see the most traffic by far and is just a file server. The only special features required is the ability to move/copy files from the upload server and to the archive server.
  • Similarily the archive server only needs to be able to copy files from the download server and occasionally copy them back to the download server. This server does not require any public access and does not actually serve any of the files it stores.
  • If may be possible to code the download source to check the load/availability of the download server and use the backup (or mirrors) to serve a file. This requires a live backup of the file downloads to be in place. Multiple download servers can be implemented in this case with the one with the lowest load used.