Joseph Wilk

Joseph Wilk

Things with code, creativity and computation.

Automatic Admin Systems - Semantics With Rails & Django

The Magically Appearing Admin

Web developers using an MVC framework produce their websites playing with their models, views and controllers. Then by adding a few lines of magic an admin system appears which allows users to add/edit/delete/view/search their models.

Examples: Django’s Magic Admin (Also NewFormsAdmin – a branch of Django focused on making it easier to customise auto-admin) Ruby on rails Plugins:

Limitations of Database types and Semantics

While we could have an admin that displayed all a models attributes as text inputs, it would be nice if it was a bit more intelligent. We need the magic admin to infer what type of form input should be used from database types.

If within our database (and hence in the model) we have a

partyDate -> Datetime

Then we would like the admin to display a date styled input.

Django Datetime Input In Auto Admin

We have a limited set of database types:

  • varchar
  • text
  • integer
  • blob
  • datetime
  • timestamp

For very simple magic admins these database types are good enough.

But we find in more complex instances the database types are not expressive enough. Within our models we have an implicit idea about the meaning of the attributes (generally derived from our naming of variables) not just the type.

For example we may have the concept of IPAddress within our model. We have a clear idea of the semantic meaning of IPAddress and that it should be presented as a form input in a specific way. e.g.

IP Address Form

The database type is just char(12), so the database types are not enough for our magic admin to build a more complex admin.

Django Approach – Explicit & High level

Within Django we explicitly define all our Model fields in Python. We can introduce high level concepts like model attributes of type ‘IPAddressField’ and not worry about the way this maps to the db (Django deals with the transformation between IPAddressField and the database types).

[viewcode] src=../projects/python/examples/model.py geshi=python [/viewcode]

Ruby on Rails Approach – Implicit & Low Level

In rails our models define relationships between themselves but we generally do not list all model fields explicitly. That’s one of the beauty’s of rails compared to heavy ORM layers.

[viewcode] src=../projects/ruby/examples/Post.rb geshi=ruby [/viewcode]

To quote the Rails Active Record API

“Adding, removing, and changing attributes and their type is done directly in the database”

Hence we cannot easily attach metadata to attributes which are only defined in the database. We can overcome this by listing these fields in the model and attaching some metadata to describe the semantic meaning of the fields (rather than using special types for the fields in the model, keeping ActiveRecord happy).

But does this remove the idea of having lean and simple models?

Thoughts

Django’s high level model provides a very natural way to attach semantic meaning to model attributes which the admin can use. Its also all located in a single place, the model.

Rail has lower level models embracing the database, and attaching semantic meaning requires us to add more detail to the model, which feels like its breaking the DRY principle. Since Ruby embraces the idea of the only place certain model attributes are defined is in the database it feels like having the atributes mentioned in the model as-well is duplication.

Streamlined’s attempt (which is defined in a separate file to the model):

[viewcode] src=../projects/ruby/examples/streamlined.rb geshi=ruby [/viewcode]

There are potentially other ways of handling Rails representation of semantic information:

  • Storing the semantics in the database.

  • Imply semantic meaning from naming of database columns (‘myZipCode’ should use zipcode input type in the admin).

  • Customize Migrations to map semantic types to db types.

Rails has the flexibility to mirror Django, so can we find a good way for Rails to represent semantic meaning for models?

Interesting Further Projects

SemanticAttributes – http://code.google.com/p/semanticattributes/

Semantic Web, meet Ruby on Rails – http://www.jroller.com/obie/entry/more_about_ontologies

Latent Semantic Analysis in Python

Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents. Rather than looking at each document isolated from the others it looks at all the documents as a whole and the terms within them to identify relationships.

An example of LSA: Using a search engine search for “sand”.

Documents are returned which do not contain the search term “sand” but contains terms like “beach”.

LSA has identified a latent relationship, “sand” is semantically close to “beach”.

There are some very good papers which describing LSA in detail:

This is an implementation of LSA in Python (2.4+). Thanks to scipy its rather simple!

1 Create the term-document matrix

We use the previous work in Vector Space Search to build this matrix.

2 tf-idf Transform

Apply the tf-idf transform to the term-document matrix. This generally tends to help improve results with LSA.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def tfidfTransform(self,):
        """ Apply TermFrequency(tf)*inverseDocumentFrequency(idf) for each matrix element.
            This evaluates how important a word is to a document in a corpus

            With a document-term matrix: matrix[x][y]
                tf[x][y] = frequency of term y in document x / frequency of all terms in document x
                idf[x][y] = log( abs(total number of documents in corpus) / abs(number of documents with term y)  )
            Note: This is not the only way to calculate tf*idf
        """

        documentTotal = len(self.matrix)
        rows,cols = self.matrix.shape

        for row in xrange(0, rows): #For each document

                wordTotal= reduce(lambda x, y: x+y, self.matrix[row] )

                for col in xrange(0,cols): #For each term

                        #For consistency ensure all self.matrix values are floats
                        self.matrix[row][col] = float(self.matrix[row][col])

                        if self.matrix[row][col]!=0:

                                termDocumentOccurences = self.__getTermDocumentOccurences(col)

                                termFrequency = self.matrix[row][col] / float(wordTotal)
                                inverseDocumentFrequency = log(abs(documentTotal / float(termDocumentOccurences)))
                                self.matrix[row][col]=termFrequency*inverseDocumentFrequency

3 Singular Value Decomposition

SVD: http://en.wikipedia.org/wiki/Singular_value_decomposition

Determine U, Sigma, VT from our MATRIX from previous steps.

 U . SIGMA . VT = MATRIX

We use the scipy svd implementation here. Note that numpy (version: 1.0.4 ) also has an implementation of svd but I had lots of problems with it. I found it did not work with anything larger than a 2 dimensional matrix.

1
2
3
#Sigma comes out as a list rather than a matrix
 u,sigma,vt = linalg.svd(self.matrix)

4 Reduce the dimensions of Sigma

We generally delete the smallest coefficients in the diagonal matrix Sigma to produce Sigma’. The reduction of the dimensions of Sigma combines some dimensions such that they are on more than one term. The number of coefficients deleted can depend of the corpus used. It should be large enough to fit the real structure in the data, but small enough such that noise or unimportant details are not modelled.

The real difficulty and weakness of LSA is knowing how many dimensions to remove. There is no exact method of finding the right dimensions. Generally L2-norm or Frobenius norm are used.

5 Calculate the Product with New Sigma’

Finally we calculate:

U . SIGMA' . VT = MATRIX'
1
2
#Reconstruct MATRIX'
reconstructedMatrix= dot(dot(u,linalg.diagsvd(sigma,len(self.matrix),len(vt))),vt)

Giving use our final LSA matrix. We can now apply the same functionality used in Vector space search: searching and finding related scores for documents.

LSA In Action – Matrices

We start with out Model-Term frequency matrix with is generated from creating a Vector Space Search with four documents (D1-D4): D1:“The cat in the hat disabled” D2:“A cat is a fine pet ponies.” D3:“Dogs and cats make good pets” D4:“I haven’t got a hat.

    <strong>good  pet   hat   make  dog   cat   poni  fine  disabl</strong>
D1 [+0.00 +0.00 +1.00 +0.00 +0.00 +1.00 +0.00 +0.00 +1.00 ]
D2 [+0.00 +1.00 +0.00 +0.00 +0.00 +1.00 +1.00 +1.00 +0.00 ]
D3 [+1.00 +1.00 +0.00 +1.00 +1.00 +1.00 +0.00 +0.00 +0.00 ]
D4 [+0.00 +0.00 +1.00 +0.00 +0.00 +0.00 +0.00 +0.00 +0.00 ]

Apply tf-idf transform:

D1 [+0.00 +0.00 +0.23 +0.00 +0.00 +0.10 +0.00 +0.00 +0.46 ]
D2 [+0.00 +0.17 +0.00 +0.00 +0.00 +0.07 +0.35 +0.35 +0.00 ]
D3 [+0.28 +0.14 +0.00 +0.28 +0.28 +0.06 +0.00 +0.00 +0.00 ]
D4 [+0.00 +0.00 +0.69 +0.00 +0.00 +0.00 +0.00 +0.00 <span style="color: #ff0000;"><strong>+0.00</strong></span> ]

Perform SVD – Reduce Sigmas dimensions(removing the 2 smallest coefficients)

D1 [+0.01 +0.01 +0.34 +0.01 +0.01 +0.03 +0.02 +0.02 +0.11 ]
D2 [-0.00 +0.17 -0.01 -0.00 -0.00 +0.08 +0.35 +0.35 +0.02 ]
D3 [+0.28 +0.14 -0.01 +0.28 +0.28 +0.06 -0.00 -0.00 +0.02 ]
D4 [-0.01 -0.01 +0.63 -0.01 -0.01 +0.04 -0.01 -0.01 <strong><span style="color: #ff0000;">+0.19</span></strong> ]

Note the Word ‘disabl’ despite not being in D4 now has a weighting in that document.

Dependencies

http://www.scipy.org/

Problems

LSA assumes the Normal distribution where the Poisson distribution has actually been observed.

Source Code

Available at github:

git clone git://github.com/josephwilk/semanticpy.git

Building a Vector Space Search Engine in Python

A vector space search involves converting documents into vectors. Each dimension within the vectors represents a term. If a document contains that term then the value within the vector is greater than zero.

Here is an implementation of Vector space searching using python (2.4+).

1 Stemming & Stop words

Fetch all terms within documents and clean – use a stemmer to reduce. A stemmer takes words and tries to reduce them to there base or root. Words which have a common stem often have similar meanings. Example: CONNECTED CONNECTING CONNECTION CONNECTIONS

all map to CONNECT

We also remove any stopwords from the documents. [a,am,an,also,any,and] are all examples of stopwords in English. Stop words have little value in search so we strip them. The stoplist used was taken from: ftp://ftp.cs.cornell.edu/pub/smart/english.stop

 self.stemmer = PorterStemmer()
1
2
3
4
5
6
7
8
9
10
11
     def removeStopWords(self,list):
             """ Remove common words which have no search value """
             return [word for word in list if word not in self.stopwords ]


     def tokenise(self, string):
             """ break string up into tokens and stem words """
             string = self.clean(string)
             words = string.split(" ")

             return [self.stemmer.stem(word,0,len(word)-1) for word in words]

2 Map Keywords to Vector Dimensions

Map the vector dimensions to keywords using a dictionary: keyword=>position

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def getVectorKeywordIndex(self, documentList):
        """ create the keyword associated to the position of the elements within the document vectors """

        #Mapped documents into a single word string
        vocabularyString = " ".join(documentList)

        vocabularyList = self.parser.tokenise(vocabularyString)
        #Remove common words which have no search value
        vocabularyList = self.parser.removeStopWords(vocabularyList)
        uniqueVocabularyList = util.removeDuplicates(vocabularyList)

        vectorIndex={}
        offset=0
        #Associate a position with the keywords which maps to the dimension on the vector used to represent this word
        for word in uniqueVocabularyList:
                vectorIndex[word]=offset
                offset+=1
        return vectorIndex  #(keyword:position)

3 Map Document Strings to Vectors.

We use the simple Term Count Model. A more accurate model would be to use tf-idf (termFrequency-inverseDocumentFrequnecy).

1
2
3
4
5
6
7
8
9
10
def makeVector(self, wordString):
        """ @pre: unique(vectorIndex) """

        #Initialise vector with 0's
        vector = [0] * len(self.vectorKeywordIndex)
        wordList = self.parser.tokenise(wordString)
        wordList = self.parser.removeStopWords(wordList)
        for word in wordList:
                vector[self.vectorKeywordIndex[word]] += 1; #Use simple Term Count Model
        return vector

4 Find Related Documents

We now have the ability to find related documents. We can test if two documents are in the concept space by looking at the the cosine of the angle between the document vectors. We use the cosine of the angle as a metric for comparison. If the cosine is 1 then the angle is 0° and hence the vectors are parallel (and the document terms are related). If the cosine is 0 then the angle is 90° and the vectors are perpendicular (and the document terms are not related).

We calculate the cosine between the document vectors in python using scipy.

1
2
3
4
def cosine(vector1, vector2):
        """ related documents j and q are in the concept space by comparing the vectors :
                cosine  = ( V1 * V2 ) / ||V1|| x ||V2|| """
        return float(dot(vector1,vector2) / (norm(vector1) * norm(vector2)))

5 Search the Vector Space!

In order to perform a search across keywords we need to map the keywords to the vector space. We create a temporary document which represents the search terms and then we compare it against the document vectors using the same cosine measurement mentioned for relatedness.

1
2
3
4
5
6
7
def search(self,searchList):
        """ search for documents that match based on a list of terms """
        queryVector = self.buildQueryVector(searchList)

        ratings = [util.cosine(queryVector, documentVector) for documentVector in self.documentVectors]
        ratings.sort(reverse=True)
        return ratings

Further Extensions

  1. Use tf-idf rather than the Term count model for term weightings

  2. Instead of linear processing of all document vectors when searching for related content use: Lanczos methods OR a neural network-like approach.

  3. Moving towards Latent Semantic analysis, Probabilistic latent semantic analysis or Latent Dirichlet allocation.

Third Party tools

The stemmer used comes from: http://tartarus.org/~martin/PorterStemmer/python.txt

And the library for performing cosine calculations comes from NumPy: http://www.scipy.org/

Source

https://github.com/josephwilk/semanticpy.git

Prolog ASLDICN Event Calculus Planner

The event calculus planner used within my thesis was based on Dr. Murray Shanahan’s ASLDICN (Abductive SLD with Integrity constraints and proof by Negation) planner with compound action support. This planner is an adaptation from one published in one of Dr. Shanahan’s research papers

http://casbah.ee.ic.ac.uk/%7Empsha/planners.html

The original planner only supports the generation of a single plan. I needed to support conditional planning. I wanted the planner to generate multiple plans representing the different ways of reaching the goal. The problem was how to convert the planner to generate all possible plans. Importantly ensuring that this does not cause infinite looping and no redundant plan solutions are generated.

My version of the planner add the following features:

  • Conditional Planning

  • Impossible Predicate

  • Occured And NotOccured predicates

Prolog Event Calculus Planner

Download eventCalculusPlanner.pl

[viewcode] src=../projects/prolog/eventCalculusPlanner.pl geshi=fortran[/viewcode]

Running Prolog as CGI

Prolog can be run as CGI by using a PHP wrapper script which invokes the Prolog engine from within PHP. Prolog can be invoked indicating Prolog files to load and goals to initially achieve once loaded.

Executing the following in PHP can spawn a process which runs Prolog.

$cgiOutput = `sicstus --goal $goal. -l "$cgiPrologScriptToLoad"`;

This specific example is for Sicstus but most Prolog command lines have a similar format. Another possiblity is to setup Prolog as CGI, since any langauge can be CGI. I was running my code on a windows box and found it impossible for Prolog to direct the content to the command line and capture it for returning. If you’re going the unix route you may want to look at PiLLoWs guide.

For form postings you can catch the post in PHP or a scripting language and create a prolog formated file which is passed to the prolog script when invoked.

You may want to have Prolog maintain state. This can be achieved through using a database. The database that I have used is Berkeley DB which SICStus has built in support for.

Intelligent Workflow Management System

Download PDF Thesis http://www.doc.ic.ac.uk/teaching/projects/Distinguished04/JosephWilk.pdf

This project took the HTML form systems as a model and built a Workflow Management System that uses artificial intelligence planning methodologies and Event Calculus workflow specifications to try to overcome some of the problems of Workflow Management Systems. Logic, server side languages and planning all rolled into one.

The development of the Workflow Management System with AI uncovered interesting issues in modelling situations in the Event Calculus and the problems that need to be overcome to use AI with workflow. The problems and solutions developed in the project cover a wide spectrum of domains, looking at logic programming, server-side languages and getting the two to talk to each other. Areas covered include such interesting topics as typing of HTML to new frameworks for Prolog running as CGI.

Achievements

  • Workflow specification language Using the Event Calculus and extensions to specify workflow.

  • [HTML form typing | HTML typing in Prolog] A typing engine for ensuring that the HTML form element specifications are correct when used in workflow specifications.

  • A Visualisation tool for Event Calculus plans A tool that generates Scalable Vector Graphic graphs for Event Calculus plans.

  • A HTML/PHP iWFMS engine Using the plans generated from the workflow specifications to support the running and management of a system.

  • A JavaScript plan execution engine Facilitates the following of workflow plans in a scripting language that runs while the user is viewing and interacting with a web page.

  • Logic programming running as Common Gateway Interface (CGI) A framework for the use of high-level declarative programming languages functioning as CGI.

  • [Logic programming and server-side language interaction model | Interaction support between PHP and Prolog] An Interaction model allowing server-side languages used for generating web pages to interact with logic programming languages.

  • A Hospital model working example An example of how the specification can be utilised for a real world scenario in a hospital. Providing the full functionality within the iWFMS to run and manage this system.

PDO & Zend Framework Playing Nicely With MSSQL

The task was to use the MSSQL database adapter (Zend_Db_Adapter_Pdo_Mssql) from the Zend-Framework and ensure it worked on both windows and Unix platforms.

The PDO drivers were a little tricky. The main problem we found was that different drivers require the date to be inserted with different formats and the date that comes back from the db is in different formats.

Aside from having to deal with dates (which we handled as suggested by Bill Karwin) where we use Zend_Date and at the last point we convert it to a string date http://framework.zend.com/issues/browse/ZF-181

And Limit function not working: http://framework.zend.com/issues/browse/ZF-1037

We have had no other problems connecting to MSSQL 2005 and MSSQL 2002 SQL Server.

The drivers we use are: Windows DB-LIB (MS SQL, Sybase) 5.1.6.6 http://pecl4win.php.net/

Unix http://pecl.php.net/package/PDO_DBLIB

Under Unix we use our own Zend_Db_Adapter_Pdo_Dblib which just extends Zend_Db_Adapter_Pdo_Mssql. We do this just to change the date format for insertion (we store the format required for a PDO adapter in each Zend_Db_Adapter_Pdo_* and use that when converting the date from a Zend_Date to a string).

As far as PHP/PDO is concerned – Under Windows PHP runs pdo (mssql) Under Unix PHP runs pdo (dblib)

I would be interested to hear what the performance is like using ODBC to talk to MSSQL Server. Looking at all the problems we had with dates and inconsistent drivers going the OBDC route does seem appealing just to get some consistency.

Squid and Members

Task

Use Squid to manage a cache for a website where there are member users (logged in to site) and public users. Squid must cache both member views of a page and public views.

Squid needs to check the authentication of the user and decided whether it should redirect them to a cache for members or for public users. There are only two discrete sets of users and any content that is specific to users if handled via AJAX.

  • Squid will be operating as a transparent proxy.

  • Usernames/Passwords are stored within a MSSQL database.

  • Squid is hosted on a unix box along with Apache

Notes

This project failed in its goal but why it failed was interesting! This initial solution came to a halt due to SQUID not providing the ability to filter return headers to clients browser. A application could have been written to do this, but this felt like the solution was becoming too complex with too many bottlenecks and dependencies

SQUID – Authentication methods

There are 3 different methods Squid provides for authentication:

  1. SAMBA – dealing with auth within a windows environment

  2. SNMP – Simple Network Management protocol

  3. Ident Protocol – Server Daemon on users computer

SAMBA – No windows authentication mechanism within architecture

SNMP – ?

Ident Protocol – Requires demon on users computer. Impossible with a open web system

Apache – Authentication through Proxy_Auth

Three techniques for receiving user credentials.

  1. HTTP Basic protocol – Considered insecure

  2. Digest authentication protocol –

  3. NTLM – proprietary protocol developed by Microsoft

So without a viable authentication method I decided to adopt a Kerbros like Authorisation Token. The cookie is created using AES.

The user has a secret key S known by themselves and the Web application.

WebAppToken = Es{ TTL , emailaddress }

The email address is finally attached to the WebAppToken giving.

email@test.co.uk:WebAppToken

The web application uses the email to identify the secret key of the user and tries to decrypt the token. The web application checks that the TTL has not expired.

Note this mechanism is susceptible to replay with the margin being that length to the TTL.

Squid’s ACL

An external ACL script was used to allow access to the redirector. Hence access to this redirector implies that the user had valid permissions to be a member.

external_acl_type type-name [options] format helper-command

Squid Redirectors.

  • Squirm

  • External Script

An external script was selected due to pressing time constraints. A simple python which changes URLS to be member urls. Any script running the redirector is assumed to be a member due to ACLs.

/file/101010101010/filename.html

becomes

/member/101010101010/filename.html

Apache has a mod-rewrite rule:

RewriteRule ^/file/(.*)$ /file.php?controller=$1 
RewriteRule ^/member/(.*)$ /file.php?controller=$1member=true

Architecture

Squid Config

external_acl_type WebAppTokenCheck ttl=1 concurrency=10 %{Cookie} /home/esw/squid/acl/WebAppACL.php 
acl MemberCookieCheck external WebAppTokenCheck 
#Only allow redirection on those pages that pass security test 
redirector_access deny all 
redirector_access allow MemberCookieCheck 

#Redirection 
redirect_program /home/esw/squid/redirectors/redirectors.py 
redirect_children 5

Problems (sigh)

  • Client –> Squid request

  • Squid –> Apache

  • Apache –> Squid

  • Squid –> Client The response from Apache must have the headers indicating cache settings. This is used by squid to identify how long and if it should cache response.

These headers get returned to user client and there local browser detects the headers in the response and caches the file locally. Hence The client will not make another request to squid until the page is expired or a refresh is forced.

The client always needs to send its responses to squid as the state of the page is decided at squid (member/non-member).

It is possible to filter the response headers in Squid 3.0 via:

reply_header_access

Squid 2.5 is the current deployment version of Squid. So unless there is an alternative way to alter response headers need to move to Plan B.

Links

OpenId

OpenID is an open loosely distributed single sign on protocol. It looks at why Microsoft’s single sign on has not taken off on a large scale. Concluding that no-one wants a single company storing all details, hence create a distributed single sign-on protocol.

OpenIDs take the form of URLS:

exampleuser.livejournal.com

OpenID 1.1 Protocol Summary

OpenID specifications |http://openid.net/specs.bml

The openid protocol 1.1 specification in summary.

  • Identify the Identify Provider associated with openid submitted by the End User.

  • Agree a shared key between the Consumer and Identify Provider.

  • Redirect the End User to the Identify Provider to authenticate themselves with a password.

  • End User gets redirected back to Consumer with authentication data signed by the shared key.

Dumb mode –> Consumer asks Identify Provider to return whether authentication data is valid.

Smart mode –> Consumer checks signed data with shared key

Diffie hellman http://en.wikipedia.org/wiki/Diffie-Hellman_key_exchange can optionally be used when agreeing the shared key.

Trust

OpenID does not cover the concept of trust. Anyone could create a OpenID Identify Provider, create an id and then use this to login.

Hence its important to realise that the OpenID library implementations do not cover trust. A white list should be made which is then used to validate users Identify Providers and decide whether the consumer should trust them.

OpenID Provider Server URLs

There are other options than using white/black lists. Some of the more interesting ideas floating around about the use of Reputation.

Reputation for OpenID http://www.windley.com/archives/2007/03/reputation_for_openid.shtml

Wired – Herding the mob http://www.wired.com/wired/archive/15.03/herding.html

Security Issues (On-going analysis)

Malicious End User – Denial Of Service

A Malicious End User could have a OpenId which resolves to a malicious internal network. This internal connection could be for example: a [Tarpit|http://en.wikipedia.org/wiki/Tarpit_%28networking%29]. The Consumer is stuck in the tar pit, possibly timing out due to exceeding a maximum execution time.

Using a paranoid HTTP Library can help protect against this issue.

http://search.cpan.org/~bradfitz/LWPx-ParanoidAgent-1.02/lib/LWPx/ParanoidAgent.pm

Secret keys passed plain text

Its optional to pass the shared key using Diffie Hellman. Hence without Diffie Hellman the key could be sniffed passing between the Consumer and the Identity Provider.

Replay attacks

Nonces|http://en.wikipedia.org/wiki/Cryptographic_nonce] are not an integral part of the OpenID protocol.

“OpenID Consumer’s SHOULD add a self-signed nonce with Consumer-local timestamp in the openid.return_to URL parameters to prevent replay attacks. Details of that are left up to the Consumer.”

Consumers can operate in dumb mode, meaning they store no state. Without storing state it is not possible to protect against replay attacks. The Consumer has no history of previous nonces. Hence it would not detect a old nonce used in a replay attack.

Nonce Passing

The nonce generated by the Consumer can cross the network 3 times in plaintext as GET parameters.

  • Consumer Request to Identity Provider.

  • Identity Provider Redirection sent to End User (telling them to redirect to the Consumer)

  • End User request to Consumer.

The greater the exposure of the nonce and the potential to sniff the shared key means a malicious attacker could create a fake/replay/pre-play response that the Consumer would accept.

Delegating Authentication

OpenID supports a URL delegating to a Identify Provider. The URL delegates by providing ‘links’ in the html at the URL location.

Example:

<link rel="openid.server" href="http://www.livejournal.com/openid/server.bml">




<link rel="openid.delegate" href="http://exampleuser.livejournal.com/">

With a single point for delegation which is resolved without user interaction there is a risk that if this homepage is comprised the delegater could be changed to a malicious identify provider. This change could go un-noticed due to the automatic resolution.

This is a Trust issue and can be solved using whitelists and phishing protection at the Consumer.

Phishing

With the redirection from one login page to another there is a worry about phishing. How do you know you have not be redirected to some malicious signin page which steals you details rather than logging you in. URLs are an answer but a lot of people do not examine the URLS. This problem is not part of the OpenID protocol but rather a problem existing for all internet sites.

Verisign has a firefox plugin SeatBelt which attempts to detect if the OpenId site is legitimate. SeatBelt|https://pip.verisignlabs.com/seatbelt.do

Single Sign on – Break a single password = access to all accounts.

A person on average has: 18 user accounts 3.47 passwords.

So people are using the same password for multiple accounts already. Single email accounts are tied to many user accounts. Any forgotten email reminders sent to an email address from a website mean the comprise of an email account can mean access to many other user accounts is possible.

Single sign-on is a risk and it would be safer to have separate user accounts each with different passwords. This however does not fit with how people are using the internet.

Its a comprise of convenience vs security and that decision needs to be made on a per solution basis.

Security Summary

The OpenID specification has optional features which if not used decrease the security of the system.

There are also security issues that are outside the OpenID protocol.

When using OpenID its important to access the security requirements of the problem and ensure that library implementations provide these optional features.

Different OpenID libraries implement the protocol and go beyond it dealing with issues like Nonces being passed in the clear to many times. Assess the libraries support of these issues.

Work outside of the OpenID is required to deal with the issue of Trust.

Phishing remains an issue.

Openid Libraries

The currently list of libraries implementations of OpenId are available in:

  • C#

  • C++

  • Java

  • Perl

  • Python

  • Ruby

  • PHP

  • ColdFusion

Many libraries go beyond the openid protocol and add protection against replay attacks.

Yadis Protocol

Although not part of the specification OpenID can use the [Yadis|http://yadis.org/wiki/Main_Page] protocol:

The Yadis protocol enables discovery of service definitions from an http:// or https:// URL. The protocol consists of performing HTTP requests to obtain a Yadis Resource Descriptor.

This enables identify providers to be discovered through an OpenID. For our example OpenID we could look at the url: exampleuser.livejournal.com

http://exampleuser.livejournal.com

Links

http://openid.net/

http://en.wikipedia.org/wiki/Diffie-Hellman_key_exchange

https://pip.verisignlabs.com/seatbelt.do

http://yadis.org/wiki/Main_Page

http://en.wikipedia.org/wiki/Cryptographic_nonce

http://en.wikipedia.org/wiki/Tarpit_%28networking%29

Curl and Certificates With Windows PHP

Curl on a Windows PHP installation does not know where to look for certificates. Hence when you try and curl a https url it fails. The default value for CURLOPT_SSL_VERIFYPEER is true which means curl will always try and validate ssl by default. I discovered this while working with an OpenID library (v1.2.3): http://openidenabled.com/php-openid/

There is the option of disabling the verfication.

$ch=curl_init; // set URL and other appropriate options curl_setopt($ch,CURLOPT_SSL_VERIFYPEER, false);

But thats ignoring the problem and opening a security hole! Instead download a reputable Certificate bundle file, for example: http://curl.haxx.se/docs/caextract.html

Then set CURLOPT_CAINFO with the location of your certificate bundle.

if( strtoupper (substr(PHP_OS, 0,3)) == 'WIN' ) { curl_setopt($c, CURLOPT_CAINFO, 'C:/certificates/cacert.pem'); }