We all have problems in trying to fill in computerized forms which give us the choice of options none of which fit the bill. As a result I was very much amused by the blog post Misinformation and muppets by Azaria Frost. She describes the problems of getting car insurance when your occupation is not on the list. Several people made comments about how a different description such as "office worker" - which could apply to many people who also have a specialist title - can reduce your car insurance premium. One comment also explains that one of the reasons is because of the need to share information.
The problem is one that is pretty universal. While it is useful to classify information into boxes the real world is not like that. Often there is not clear borderline - for instances the classification of living things into families, species and subspecies in always going to turn up intermediate cases. Attempts to analyse data statistically can throw up problems. One of my first blog posts was I discover Babel's Dawn which had a review entitled Last Common Language was in Africa which I find very interesting. However when you look at original paper Phonemic Diversity Supports a Serial Founder Effect Model of Language Expansion from Africa in depth there are problems with the quality of the source data. This comes from classified data about a large number of different languages. While this does not invalidate the analysis the results must retain a considerable degree of uncertainty because of the classification assumptions which underlie the research.
The important thing to remember is that in classifying and statistically analyzing any body of data there are likely to be difficult cases. Most computer systems simply sanitize the data by shoe-horning the "messy data" into a standard category - and "conveniently" forgetting that there was ever any uncertainty.