It’s got me totally flummoxed, Nick. Can you give us a few examples of what one can read from it?
conrad
13 years ago
Mike,
I wondered that too. Here’s my guess. First, UDC = Universal Decimal System (souped up Dewey decimal system), which you might remember from the old days when you actually had to look up real academic books in a library versus just click on the electronic link (god knows why they still make hard copies of most of these).
The middle graph is the simplest. It’s just the UDC codes lined out, and I think the number represent the single digit added to what the UDC code would be (so 0, 1, 3, 5, 6 for psychology mean 150, 151, 153…, since psychology starts at 150 — I assume not all are there since some would have lines that are very small under them). The lines represent the overlap between the categories in Wikipedia and the UDC system. The bigger the line the more overlap, although it’s not clear to me how category overlap is actually calculated. It says “..corresponds to how many times a category term is found in a UDC main class”, but since the categories and classes don’t align perfectly (i.e., some exist in one but not the other), something else must be done here to account for this. Perhaps someone else knows. Maybe they just chuck lost of entries out.
I think the lower circle represents some of the categories and how they overlap. The inner circle, which doesn’t have labels, is just the size of different UDC categories represented by the color — i.e., what there is more or less of (too bad no-one apart from librarians can probably remember what colors represent what). The outer circle are wiki categories. I think, quite possibly incorrectly, that they are aligned category-wise (of course since I don’t know the colors of the UDC are, this is rather pointless to me).
The top graphs are very fancy graphs that try and represent hugely dimensional data (i.e., 43 dimensions for wiki and 9 for UDC) in 2 dimensions. This is a lot like multi-dimensional scaling, but they are using a very fancy algorithm to do it versus the more traditional ones. The basic idea is that if you start with apriori categories, and then try and arrange the categories in two dimensions so that things that are similar to each other are close and those that are not are far away, this is what you would get. The reason some of the things in the same category are not close to each other is because their content might be very different. However, because most things in a category are more similar to other things in a category but further away than things in other categories, you get clustering of categories. Again, since I’m not entirely sure how they are measuring distance, I’m not sure exactly what’s going on.
I should note that I could be entirely wrong on this!
Patrick
13 years ago
I get all that, and it does seem fascinating note the different distributions in each system and to reflect on what they might mean for what we think is important as opposed to what we actually care about, but at the end of the day I’m not sure what exactly I am supposed to take away from it!
Is is supposed to mean that intangibles like art and literature are much more important to us, and thus much more finely categorised, in real life than in abstract systems reflecting what intellectuals think is important?
Or that in real life we are stupidly distracted by triviality?
Or that formal classification systems over-classify ‘neat’ areas like science and under-classify messy but really important areas like music, fiction and paintings?
I think it’s time you added some thoughts of your own Nick!
It’s got me totally flummoxed, Nick. Can you give us a few examples of what one can read from it?
Mike,
I wondered that too. Here’s my guess. First, UDC = Universal Decimal System (souped up Dewey decimal system), which you might remember from the old days when you actually had to look up real academic books in a library versus just click on the electronic link (god knows why they still make hard copies of most of these).
The middle graph is the simplest. It’s just the UDC codes lined out, and I think the number represent the single digit added to what the UDC code would be (so 0, 1, 3, 5, 6 for psychology mean 150, 151, 153…, since psychology starts at 150 — I assume not all are there since some would have lines that are very small under them). The lines represent the overlap between the categories in Wikipedia and the UDC system. The bigger the line the more overlap, although it’s not clear to me how category overlap is actually calculated. It says “..corresponds to how many times a category term is found in a UDC main class”, but since the categories and classes don’t align perfectly (i.e., some exist in one but not the other), something else must be done here to account for this. Perhaps someone else knows. Maybe they just chuck lost of entries out.
I think the lower circle represents some of the categories and how they overlap. The inner circle, which doesn’t have labels, is just the size of different UDC categories represented by the color — i.e., what there is more or less of (too bad no-one apart from librarians can probably remember what colors represent what). The outer circle are wiki categories. I think, quite possibly incorrectly, that they are aligned category-wise (of course since I don’t know the colors of the UDC are, this is rather pointless to me).
The top graphs are very fancy graphs that try and represent hugely dimensional data (i.e., 43 dimensions for wiki and 9 for UDC) in 2 dimensions. This is a lot like multi-dimensional scaling, but they are using a very fancy algorithm to do it versus the more traditional ones. The basic idea is that if you start with apriori categories, and then try and arrange the categories in two dimensions so that things that are similar to each other are close and those that are not are far away, this is what you would get. The reason some of the things in the same category are not close to each other is because their content might be very different. However, because most things in a category are more similar to other things in a category but further away than things in other categories, you get clustering of categories. Again, since I’m not entirely sure how they are measuring distance, I’m not sure exactly what’s going on.
I should note that I could be entirely wrong on this!
I get all that, and it does seem fascinating note the different distributions in each system and to reflect on what they might mean for what we think is important as opposed to what we actually care about, but at the end of the day I’m not sure what exactly I am supposed to take away from it!
Is is supposed to mean that intangibles like art and literature are much more important to us, and thus much more finely categorised, in real life than in abstract systems reflecting what intellectuals think is important?
Or that in real life we are stupidly distracted by triviality?
Or that formal classification systems over-classify ‘neat’ areas like science and under-classify messy but really important areas like music, fiction and paintings?
I think it’s time you added some thoughts of your own Nick!
Hi – sorry for any unnecessary flummoxification.
Just thought it was a great viz for the reasons explained in the first line of Patrick’s contribution.