My summer job involves topic modelling, using machine learning tools to automatically learn different topics that some set of documents covers, so that the documents could then be classified by topic. I haven’t done this before, so I don’t yet have a good intuition of how currently available tools work. To develop that intuition, I’m playing around with different tools and datasets, to see what kinds of results different methods give.
One interesting case would be to run a topic modeler on an extended work of fiction with various story arcs and see if it could, for instance, identify specific story arcs. With 122 chapters and several distinct story arcs and cliques of characters, Harry Potter and the Methods of Rationality seemed like a good dataset to try this on. (The following might contain unmarked minor spoilers to the story; you’ve been warned.)
I went to hpmor.com and copy-pasted all the chapters into separate text files. I removed the author’s notes and the opening quotes and various dedications to Rowling in the early chapters, as well as the “the next chapter will be out on day X” mentions. I also omitted the Omake chapters.
I then used the free analysis tool Mallet to apply LDA to the dataset. LDA (Latent Dirichlet Allocation) is a topic modeling method in which a topic is formally defined to be a distribution over a vocabulary. For example, we might have a topic corresponding to the HPMOR’s Azkaban arc, which would include words such as quirrel, dementor, azkaban, and bellatrix with a high probability.
LDA assumes that documents are written according to the following process:
1. Randomly choose a distribution over topics.
2. For each word in the document:
a. Randomly choose a topic from the distribution of topics in step #1
b. Randomly choose a word from the corresponding distribution over the vocabulary
(David M. Blei 2012: Probabilistic Topic Models. Communications of the ACM. DOI:10.1145/2133806.2133826)
Of course, this isn’t the actual way that real-world documents are written, but we could kind of imagine that they were. For example, let’s imagine Eliezer Yudkowsky sitting down to write a chapter of HPMOR which he decides will mostly be the aftermath of the Azkaban arc, and will also tie those events together with Harry’s friendship with Draco. This would correspond to step 1 in the above process: let’s say that he decides that the chapter will be 70% about the SPE arc and 30% about the Harry-Draco relationship.
Now he starts writing. Each word (maybe more realistically, each sentence) can be related to either the SPE arc or the Harry-Draco relationship, so he will alternate between those two topics as he ties them together, choosing between them with a 70-30 probability. For either topic, there are several different sub-topics within that topic that he can cover, so we can think of there being a random chance for any word associated with that topic being selected. Of course, some words, like “Harry”, are likely to be associated with both topics.
When LDA is given an existing collection of documents, it then tries to reconstruct these original probabilities and distributions. In other words, it asks the question of “given this text, and given what I assume to have been the original process which generated it, which values would have been the most likely to produce this text?”. Mallet does this using Gibbs sampling: if you want to read more about that, see Wikipedia for Gibbs sampling in general or Steyvers & Griffiths (2006) for a discussion of it in the context of LDA.
But enough theory, let’s start experimenting! I start off by having Mallet extract the raw data from the documents into a form it can use, and ask it to consider 1- and 2-grams: that is, it will base its analysis both on individual words and pairs of words. Then I ask it to generate 20 topics for us, and to list the 20 most probable words in each topic.
(for all trials, I’m running LDA for 1000 iterations, re-optimizing the hyperparameters every 20 iterations, with a burn-in of 200 iterations)
Here are the initial results:
0 0,02579 phoenix wizard fawkes war blaise millicent zabini black_mist mist wizard_voice envelope million save haukelid back_sleep phoenixes violence tower bulstrode
1 0,02547 dad verres petunia mum eraser felthorne books michael evans parents atoms verres_evans rianne mother transfigure miss_felthorne father experiment michael_verres
2 0,03884 snake iss hissed defense_professor hagrid sstone mr_hagrid unicorn thiss musst ssay monster defense chamber sspeak chamber_secrets secrets bed slytherin_monster
3 0,031 severus minerva potions_master albus potions lesath severus_snape master lestrange professor_snape snape neville fred lesath_lestrange time_turner george azkaban gryffindors discipline
4 0,04833 quirrell professor_quirrell professor mr_potter mr quirrell_voice quirrell_harry goyle potter_professor mr_goyle classroom lesson quirrell_face slytherins quirrell_points battle_magic derrick lose skeeter
5 0,04691 draco father draco_harry draco_didn draco_voice harry_draco ron science conspiracy draco_couldn draco_draco platform draco_nodded draco_looked mother station rival draco_turned muggleborns
6 0,03858 professor_mcgonagall mcgonagall galleons mr_potter gold transfiguration bag shop coins wizarding witch malkin alley money wizarding_world madam_malkin street kit gringotts
7 0,02716 bellatrix dementors azkaban amelia snake patronus metal bahry broomstick auror charm aurors corridor quirrell bellatrix_black hissed hole iss cell
8 0,03185 troll hagrid weasley forest centaur yeh tracey tick unicorn broomstick filch mr_hagrid weasley_twins twins forbidden_forest rubeus argus george fred
9 0,03086 voldemort lord mirror lord_voldemort stone dark_lord altar tom perenelle tom_riddle dark parseltongue gun horcrux riddle child iss sshall hissed
10 0,03643 malfoy lucius lucius_malfoy wizengamot lord_malfoy debt house_malfoy draco_malfoy son house_potter thousand_galleons veritaserum false_memory lies murder hall galleons podium troll
11 0,04617 dementor headmaster patronus fear patronus_charm cast_patronus chocolate cage corporeal headmaster_harry patronuses harry_headmaster seamus anthony happy corporeal_patronus happy_thought harry_wizard warm
12 0,02038 draco magic fred paper dr george powerful test harry_potter fading fred_george magic_fading skeeter rita dr_potter blood scientist shadowy spells
13 0,02188 draco general soldiers neville sunshine chaos army dragon zabini battle granger armies doom_doom doom malfoy dragon_army longbottom forest dragons
14 0,02725 moody elder_wand dawn elder experiment lesath aftermath ravenclaw_common horizon peverell graveyard vow milgram philosopher_stone bellatrix_black narrow labeled unicorn hermione_nodded
15 0,02818 moody lupin mad_eye eye prophecy mad amelia remus mr_lupin monroe voldemort bones albus minerva amelia_bones alastor line eye_moody lily
16 0,03993 daphne susan tracey hannah lavender bully bullies draco_malfoy greengrass year millicent corridor girl parvati bones sprout professor_snape davis susan_bones
17 0,04419 granger miss_granger hermione miss padma hero patil heroes padma_patil professor_sinistra hermione_voice girls sinistra humming witches hermione_didn cell girl hero_hermione
18 0,02098 hat sorting game points goyle neville sorting_hat note comed_tea comed paper slytherins ha_ha mr_goyle remembrall ha tea ernie madam_hooch
19 1,37501 harry professor potter hermione voice time didn back quirrell dumbledore professor_quirrell mr don thought boy dark wasn hogwarts eyes
Not bad. The initial topics are a bid mixed bags, but they get better later on. The 0th topic seems to roughly be about the war. The 1st is mostly about Harry’s parents, but somewhat oddly, Rianne Felthorne gets included in the same topic.
Number 2 is interesting: it’s picking up Parseltongue words as being associated with the Defense Professor. This makes sense, because he occasionally speaks in Parseltongue, so if he’s present in a chapter, it’s also more likely that Parseltongue words will be present. Apparently Parseltongue words are also associated with unicorns and Hagrid, because both show up in this topic.
Number 3 seems to start out as a “senior staff of Hogwarts” topic, with Snape, McGonagall, and Dumbledore being included (but not Quirrel, interestingly enough), but then also has mentions of George, Azkaban, and Gryffindors in the end. Number 4 is clearly about Quirrel, and to a lesser extent Slytherins.
Number 5 seems to be the Draco-Harry chapters, and among the more informative words includes 2-grams such as “draco_nodded, draco_looked, draco_turned”. As an interesting observation, besides one hermione_nodded in topic number 14, Draco seems to be only character whose nods, lookings, or turnings were picked up by the modeler: I wonder what’s up with that. Number 6 involves McGonagall, Harry, and Harry’s money; number 7 looks to be the Azkaban arc. Number 8 is a topic combining Hagrid, the Forbidden Forest, and apparently also the twins. And so on.
This looks pretty good, but we could try varying the number of topics. Also, Mallet allows me to add a list of words to ignore in the analysis. By default, it already ignores words like the, is, at, and so on. Let’s add a few: “didn didn’t couldn couldn’t nodded looked turned said wasn wasn’t ‘t t”
0 0,03768 hagrid troll weasley forest mr_hagrid centaur yeh unicorn tracey tick weasley_twins filch twins broomstick forbidden_forest rubeus fred forbidden argus
1 0,04613 snape potions_master professor_snape quidditch sprout potions professor_sprout felthorne severus_snape master mirror severus rianne game susan plant susan_bones miss_felthorne exam
2 0,04554 dementor patronus headmaster phoenix fear patronus_charm fawkes chocolate patronuses cage wise corporeal seamus harry_headmaster star anthony wizard_voice souls corporeal_patronus
3 0,02705 fawkes moody envelope comed_tea comed tea experiment bellatrix_black hat lesath pillow train milgram prefect compartment drink frodo cards experimental
4 0,03366 voldemort lord dark_lord lord_voldemort iss tom stone altar hissed dark horcrux wand perenelle riddle tom_riddle thiss parseltongue vow gun
5 0,0315 bellatrix azkaban dementors snake amelia patronus metal broomstick bahry auror charm professor_quirrell quirrell corridor woman aurors hissed iss hole
6 0,0284 severus minerva neville hat lesath sorting lestrange sorting_hat fred george lesath_lestrange severus_snape legilimens fred_george discipline severus_voice handsome professor_snape points_ravenclaw
7 0,04123 daphne susan tracey hannah lavender girl bullies bully hermione greengrass girls millicent parvati draco_malfoy padma slytherin jugson davis corridor
8 0,02515 draco soldiers neville sunshine general chaos army dragon granger zabini battle malfoy armies doom doom_doom dragon_army forest shield dragons
9 0,01942 draco magic fred harry_potter dr paper george fading fred_george skeeter test magic_fading powerful rita blood dr_potter scientist wizards shadowy
10 0,05943 quirrell professor_quirrell professor mr_potter quirrell_harry mr quirrell_voice chamber potter_professor lose quirrell_face battle_magic lesson secrets snake derrick salazar monster quirrell_smiling
11 0,02807 mirror transfiguration transfigure eraser flamel atoms ball harry_hermione page hermione_voice separate sentient frame plants subject solid pig free_transfiguration objects
12 0,03587 hermione granger miss_granger miss padma hero heroes patil padma_patil elder_wand elder hermione_voice humming professor_flitwick protest cell mysterious_wizard professor_sinistra sinistra
13 0,03058 albus moody minerva severus voldemort prophecy eye amelia mad mad_eye potions_master bones monroe alastor headmistress amelia_bones eye_moody potions mark
14 0,03796 mcgonagall professor_mcgonagall parents dad mum verres gold evans galleons petunia bag christmas michael father trunk shop wizarding verres_evans coins
15 0,03985 goyle mr_goyle points slytherins defence paper ha game ha_ha note classroom remembrall pie hooch neville ernie bars boys madam_hooch
16 0,04854 draco father draco_harry ron science draco_voice platform conspiracy harry_draco mother pettigrew slytherin_house patronus_charm train draco_eyes patronus draco_don narcissa station
17 0,03689 malfoy lucius lucius_malfoy wizengamot lupin lord_malfoy remus debt mr_lupin son house_malfoy house_potter james galleons veritaserum mad false_memory longbottom vote
18 1,40563 harry professor potter voice hermione time back dumbledore quirrell mr professor_quirrell don thought boy dark hogwarts eyes face lord
19 0,03283 blaise millicent country zabini black_mist mist traitors hospital violence professor_voice harry_wizard pedestals jugson leader lord_jugson wishes shrug blue_light lucius_malfoy
The order of topics is now somewhat different. The Draco/Harry science chapters, which were previously topic number 5, now look to be topic 16: they seem a little less distinct now that we told the program to remove words like “nodded”, “looked”, and “turned”, which had been things that were previously associated with Draco, and probably with Draco talking to Harry in particular. Having fewer words that co-occur when Harry and Draco specifically are talking makes “Harry and Draco talking” a less distinct cluster. Maybe we shouldn’t have asked the program to ignore those words. I’ll take them off the ignore list.
What happens if we try 10 or 30 topics?
Here are the results with 10:
0 0,0553 hat transfiguration sorting goyle mr_goyle sorting_hat points transfigure class defence eraser note game paper professor_mcgonagall ha classroom ha_ha shadowy
1 0,05378 severus minerva azkaban albus fawkes phoenix lesath lestrange bellatrix severus_snape moody potions_master lesath_lestrange bellatrix_black neville envelope alarm hours severus_voice
2 0,06163 professor_quirrell quirrell professor mirror lord_voldemort voldemort stone defense_professor defense perenelle quirrell_harry parseltongue chamber flamel quirrell_voice tom sprout horcrux quirrell_face
3 0,04948 bellatrix snake voldemort azkaban dementors iss hissed amelia wand patronus lord bahry dark_lord broomstick metal charm altar auror dark
4 0,0406 malfoy moody lucius wizengamot lucius_malfoy albus eye mad lord_malfoy amelia mad_eye azkaban minerva amelia_bones eye_moody debt monroe alastor line
5 0,06065 hagrid dementor troll lupin forest remus mr_lupin tracey mr_hagrid centaur yeh unicorn tick filch huge james weasley elder_wand rubeus
6 0,04647 draco soldiers general sunshine army chaos dragon zabini neville battle granger malfoy armies blaise doom_doom dragon_army dragons dr father
7 0,05774 father professor_mcgonagall parents mum dad money fred galleons george ron verres gold rita science skeeter books trunk evans bag
8 1,55479 harry professor potter voice hermione time back quirrell dumbledore mr professor_quirrell don thought boy dark draco hogwarts eyes harry_potter
9 0,05979 daphne susan hermione tracey hannah padma girl bullies lavender girls millicent bully greengrass miss davis parvati susan_bones hero jugson
0 jumps out at once: it looks like the sorting hat is now a major topic! But upon a closer inspection, it looks like this might be an artifact of the 1-gram and 2-gram versions of it being double-counted: “hat”, “sorting”, and “sorting_hat” are all included the same topic. If we were to remove “hat” and “sorting”, the topic would become “transfiguration goyle mr_goyle sorting_hat points transfigure class defence eraser note game paper professor_mcgonagall ha classroom ha_ha shadowy”, which makes the topic look a lot less coherent. Notice that “goyle” also gets double-counted, with “goyle” and “mr_goyle”.
In general, most of these topics don’t look like they would correspond with any clear “real” topic, though there are a few exceptions like number 6 being related to the Quirrel Armies. Notice that the double-counting is also pretty prominent in general.
It seems useful to stop and reflect on why these results are now so bad. Here’s what I think: there are a lot of different events and storylines in HPMOR, each associated with their specific vocabulary. For instance, Rianne Felthorne, who was picked up in the 20-topic version, only appears in chapters 71, 76, and 79. If you tell the model to assume that there are a lot of topics, then it might actually come up with the hypothesis that there’s a topic which covers those three chapters and which has a very high probability of talking about Rianne. But with a low number of topics assumed, it can’t “waste” any topics by dedicating them to such rare words. Instead, in order to cover most of the documents, it has to assume that Rianne is part of some much bigger topic which spans a lot of chapters. Since Rianne only appears in three chapters, such a wide-spanning topic would have to have a very low probability of generating Rianne’s name. This means that topics will become dominated by words which appear pretty often in the text, and in a lot of different contexts – but of course that makes the topics less distinctive and meaningful. The only distinctive topics will be those that are major enough to span several chapters, which is the case for the Quirrel armies.
So how about the opposite direction, with 30 topics?
0 0,02023 hagrid troll forest tracey centaur yeh broomstick tick unicorn filch weasley forbidden_forest mr_hagrid huge forbidden unicorns argus rubeus half_giant
1 0,01219 draco harry_potter magic dr wizards powerful paper blood test father fading figure magic_fading spells dr_potter scientist muggles ll scientists
2 0,02164 elder pettigrew elder_wand hero vow rat dawn sirius_black prophecies rival sirius unicorn revived fingernails horizon hermione_harry girl_revived back_dead rooftop
3 0,02732 severus mum dad lesath verres evans parents father petunia lestrange neville michael verres_evans letter michael_verres lesath_lestrange books window roberta
4 0,02912 quirrell professor_quirrell professor mr_potter mr goyle classroom mr_goyle quirrell_harry quirrell_voice lose potter_professor skeeter quirrell_face quirrell_points derrick slytherins rita_skeeter rita
5 0,02385 mcgonagall professor_mcgonagall gold galleons mr_potter shop bag coins parents alley diagon malkin sighed wizarding_world street wizarding madam_malkin trunk pouch
6 0,02138 draco general soldiers neville sunshine chaos army dragon zabini battle granger malfoy armies doom doom_doom dragon_army dragons longbottom shield
7 0,01796 azkaban phoenix moody bellatrix fawkes envelope bellatrix_black aftermath lesath experiment harry_stared amelia milgram black_azkaban frodo bird clock pillow mask
8 0,02302 auror amelia defense_professor amelia_bones duel department mr_malfoy exam false_memory grade charmed law_enforcement beauxbatons trophy_room trophy enforcement magical_law department_magical memory_charm
9 0,02788 miss miss_granger granger padma hero heroes patil padma_patil professor_flitwick humming witches hermione girl hermione_voice cell professor_sinistra sinistra mysterious_wizard hero_hermione
10 0,0357 draco father ron draco_harry conspiracy platform draco_voice sad harry_draco station draco_nodded draco_turned mother narcissa lucius haired draco_eyes revenge slytherin_house
11 0,02313 responsible wards troll gryffindor head_table twins weasley hall weasley_twins minerva mr_hagrid great_hall cracked blame storeroom sinistra hagrid jugson year_witch
12 0,02151 malfoy lucius lucius_malfoy lord_malfoy wizengamot house_malfoy debt son house_potter house thousand_galleons ancient galleons plum_colored plum colored goblin colored_robes troll
13 0,00797 moody eye prophecy dark mad mad_eye dark_lord monroe albus mark mcgonagall scarred severus evidence lord david dark_mark scarred_man eye_moody
14 0,02847 voldemort dark_lord lord dark wand altar child gun iss hissed stone body vow master lord_voldemort girl_child apokatastethi graveyard sshall
15 0,02275 bellatrix dementors azkaban amelia broomstick snake metal bahry auror corridor professor_quirrell quirrell charm patronus bellatrix_black woman hole cell iss
16 0,02847 quidditch snape sprout professor_snape professor_sprout potions_master game bones susan plant susan_bones philosopher_stone mirror potions cedric snitch chamber broomstick tendrils
17 0,02856 lupin remus mr_lupin james lily remus_lupin peter nuclear stars star children_children haukelid tower edge million script ravenclaw_tower soft_voice godric_hollow
18 0,03067 dementor patronus headmaster patronus_charm fear patronuses chocolate cage corporeal happy cast_patronus presence anthony dementors expecto_patronum corporeal_patronus seamus happy_thought harry_headmaster
19 0,02245 snake iss hissed defense_professor hagrid mr_hagrid chamber infirmary unicorn monster chamber_secrets secrets slytherin_monster sstone sspeak yess ssay hissed_harry parseltongue
20 0,04184 daphne susan tracey hannah hermione girl lavender bully bullies greengrass girls parvati millicent davis slytherin corridor bones padma susan_bones
21 0,01251 wizard blaise millicent zabini war black_mist mist harry_wizard gregory violence jugson oaken_door pedestals bulstrode lord_jugson wizard_voice black_cloak half_moon black_hat
22 0,0332 hermione boy library book pages sentient page plate chocolate year_girl train talk flamel plants experiment compartment century research snakes
23 0,02155 hat sorting tea game sorting_hat comed note points ha ravenclaw comed_tea ha_ha pie neville bars paper largest hufflepuffs slytherins
24 0,01522 professor_quirrell quirrell mirror lord_voldemort voldemort dumbledore stone perenelle tom cauldron potion horcrux albus_dumbledore parseltongue tom_riddle riddle flamel david_monroe monroe
25 0,02873 severus minerva albus snape amelia voldemort potions_master bones potions master amelia_bones headmistress felthorne merlin moody severus_snape rianne professor_snape madam_bones
26 1,34252 harry professor potter hermione voice time quirrell back professor_quirrell dumbledore mr don thought boy dark hogwarts eyes face lord
27 0,01638 dumbledore goyle mr_goyle remembrall turner paper ah ernie discipline gargoyle madam_hooch hooch rock neville_remembrall points_ravenclaw thursday swamp gregory_goyle chicken
28 0,02038 transfiguration fred george transfigure fred_george eraser atoms skeeter rita twins ball minerva rita_skeeter flume impossible collection separate weasley_twins subject
29 0,02581 pansy traitors generals chant prismatic_wall wishes country samuel male_voice male audience crush vow pretty luminos_shouted parkinson luminos gate halls
Hmm. Not sure if this is so great, either: now we might have the opposite problem, that 30 topics is too much freedom for the model, and it can hypothesize all kinds of minitopics that aren’t actually there. Now I’m pretty sure that one *could* come up with 30 coherent topics if one did it manually, but that would require using more structure than a basic form of LDA is capable of using.
So 20 topics was probably best. Out of curiosity, how would it look like if we only considered 1-grams? That would eliminate some double-counting, but would it actually improve the results?
0 0,24225 albus severus voldemort moody mr minerva dark master prophecy lord eye mcgonagall potter potions bones azkaban mad snape monroe
1 0,20082 harry bellatrix azkaban professor quirrell snake dementors amelia metal charm auror bahry aurors lord woman wizard defense broomstick corridor
2 2,48546 voice boy time back looked eyes turned hand head door hogwarts place face heard words moment black robes stood
3 0,16911 draco granger neville general soldiers sunshine chaos army malfoy battle dragon zabini hermione armies shield blaise longbottom doom fight
4 0,44362 harry patronus dementor death charm light stars wand voice cast fear dementors die silver wouldn happy died bright aurors
5 0,21435 professor harry points mcgonagall mr time game slytherin ravenclaw goyle desk neville students year slytherins sprout classroom note quidditch
6 2,01788 thought dark mind life time lord dumbledore part man thing long power stop knew great world understand side true
7 0,64986 professor quirrell mr defense potter dark students lord miss spell true obvious room headmaster slytherin snape lose today slytherins
8 0,08225 voldemort harry lord stone dark mirror iss wand hissed altar riddle tom child horcrux parseltongue death dumbledore perenelle white
9 0,98236 harry wand hand air sense spell broomstick left ground fire body hit cloak mind red pouch moving pointed back
10 0,1403 hat sort ron sorting secrets tea book slytherin comed table talk neville train snake drink rule secret carriage pages
11 0,12296 hermione transfiguration lupin transfigure remus mr wand minerva mcgonagall eraser form tiny peter pettigrew brain atoms separate wood steel
12 0,38543 dumbledore headmaster wizard phoenix albus fawkes eyes fire flitwick war stone cloak mcgonagall office wizards understand shoulder back desk
13 0,33105 hermione granger miss professor mcgonagall defense hogwarts ve hero hagrid mr head ll year tracey forest heroes girl centaur
14 0,15878 hermione daphne susan tracey slytherin snape girl padma hannah year malfoy potions lavender bullies table miss house greengrass millicent
15 0,15193 severus weasley neville george fred minerva students twins lesath table snape mr skeeter tick rita gryffindor lestrange potions man
16 0,26817 draco father magic slytherin blood malfoy powerful ll wizards test figure paper potter spells lost fading muggles dr mother
17 2,62309 harry potter don people ve things make face good ll wouldn hogwarts made sort wanted thought thing put point
18 0,31149 malfoy lucius house granger son wizengamot hogwarts potter dumbledore lord chair lived ancient debt murder aurors magical britain room
19 0,25969 mcgonagall professor parents mr father evans verres mum dad witch galleons money gold books magic world mother family wizarding
I’d say that’s definitely worse: I have difficulties picking up anything sensible, though it’s interesting to look at what *does* remain identifiable. Quirrel Armies show up once again, in topic number 3. They’re definitely the most resilient topic in the whole story. There are also a few others, like number 8 is strongly related to Vold… He-Who-Shall-Not-Be-Named.
(I also tried if 30 topics would work better for 1-grams; I won’t show you the results, because the answer was “not really”.)
What if only considered 2-grams? That’s going to produce a mess, but I’m still curious to see what it looks like. Also, I want to see whether our hero the Quirrel Armies manages to survive that challenge as well!
0 0,01128 sorting_hat comed_tea points_ravenclaw severus_voice lesath_lestrange potter_severus gryffindor_table harry_sat older_student trimmed_robes whisper_whisper school_discipline severus_smiling severus_face potions_professor students_looked red_trimmed black_robed perfect_occlumens
1 0,01493 potions_master professor_snape miss_felthorne false_memory severus_snape professor_sprout memory_charm rianne_felthorne empty_air sorting_hat theodore_nott attempted_murder trophy_room susan_bones snape_voice cedric_diggory wards_hogwarts felthorne_snape albus_quietly
2 0,01344 fred_george rita_skeeter mr_hagrid chamber_secrets slytherin_monster hissed_snake hissed_harry pale_blue miss_skeeter heir_slytherin source_magic mary_place green_snake rich_people imperius_curse people_sort solving_groups problem_solving order_chaos
3 0,0083 doom_doom dragon_army general_potter chaos_legion general_granger mr_goyle sunshine_regiment sunshine_soldiers draco_malfoy general_malfoy blaise_zabini sleep_hex sunshine_general neville_longbottom prisoner_dilemma mrs_davis dragon_general mr_thomas mr_mrs
4 0,01159 dark_lord mad_eye eye_moody mr_grim girl_child lord_voldemort mr_white death_eater dark_mark apokatastethi_apokatastethi scarred_man mr_moody voldemort_voice harry_scar high_voice april_pm voldemort_hissed apokatastethi_soma lord_spoke
5 0,01227 mr_goyle ha_ha older_slytherins cereal_bars largest_slytherin student_classroom mr_crabbe quirrell_points current_points dangerous_student martial_arts game_controller snapped_fingers green_study wearing_pyjamas box_cereal hint_hint hermione_mind ha_su
6 0,01779 professor_quirrell bellatrix_black defense_professor harry_thought metal_door guardian_charm thought_harry bellatrix_professor dark_lord muggle_device patronus_charm hole_wall harry_brain partial_transfiguration shadows_death harry_knew harry_turned life_eaterss green_spark
7 0,00985 amelia_bones bellatrix_black madam_bones mad_eye minerva_mcgonagall eye_moody chief_warlock line_merlin alastor_moody headmistress_mcgonagall merlin_unbroken black_azkaban harry_james harry_stared peter_pettigrew potter_evans order_phoenix muggle_weapons lesath_lestrange
8 0,00967 lord_voldemort tom_riddle baba_yaga david_monroe answer_parseltongue wizarding_war great_creation blackened_fire az_reth nicholas_flamel quirrell_dropped back_professor quirrell_looked professor_quirrell quidditch_game obtain_sstone lay_bed horcrux_spell harry_aloud
9 0,01368 seventh_year salazar_slytherin general_granger susan_bones draco_malfoy slytherin_ghost sunshine_general year_girl year_boy miss_davis fourth_year sixth_year ancient_house hufflepuff_girl hermione_harry doom_doom slytherin_girl daphne_greengrass ravenclaw_girl
10 0,01569 professor_mcgonagall mr_goyle madam_malkin mokeskin_pouch madam_hooch neville_remembrall diagon_alley gold_coins older_witch bag_gold healer_kit shake_hand mcgonagall_face gregory_goyle mcgonagall_sighed gold_harry cavern_level genetic_parents gold_silver
11 0,32335 professor_quirrell harry_potter mr_potter defense_professor professor_mcgonagall dark_lord hermione_granger harry_voice miss_granger draco_malfoy boy_lived albus_dumbledore patronus_charm professor_flitwick harry_looked shook_head harry_thought mr_malfoy harry_harry
12 0,01027 harry_wizard black_mist wizard_voice resurrection_stone harry_headmaster moon_glasses black_cloak lord_jugson oaken_door albus_dumbledore wizard_face black_hat headmaster_harry wizard_quietly death_eater save_lives dumbledore_voice blue_eyes pretending_wise
13 0,00772 hermione_voice elder_wand harry_hermione hermione_harry free_transfiguration unbreakable_vow liquid_gas transfigure_liquid narrow_keyhole start_year metal_ball hermione_nodded collection_atoms muggle_science ve_thinking unicorn_princess time_narrow girl_revived living_subject
14 0,02155 mr_lupin verres_evans michael_verres remus_lupin professor_verres professor_michael comed_tea harry_father evans_verres living_room cross_station letter_hogwarts godric_hollow christmas_eve parents_harry son_harry mr_bronze leo_granger dad_mum
15 0,01017 warm_happy back_sleep lord_voldemort albus_dumbledore state_mind ravenclaw_tower expecto_patronum long_ago golden_frame tattered_cloak corporeal_patronus red_gold light_years golden_back lay_beneath auror_goryanof master_flamel quirrell_pointed true_love
16 0,01284 mr_hagrid weasley_twins forbidden_forest half_giant great_hall tick_harry weasley_twin huge_man argus_filch rubeus_hagrid part_mind head_table unicorn_blood gryffindor_table fred_george magical_creatures false_memory ron_weasley fred_weasley
17 0,01825 harry_potter draco_voice magic_fading dr_potter draco_harry harry_draco shadowy_figure dr_malfoy death_eater draco_don draco_draco powerful_wizards green_light blood_purism draco_realized potter_draco don_draco fading_world paper_magic
18 0,01634 lucius_malfoy lord_malfoy house_malfoy house_potter plum_colored draco_malfoy thousand_galleons colored_robes dark_stone madam_longbottom ancient_hall blood_debt chief_warlock hundred_thousand noble_ancient malfoy_stood debt_owed lords_ladies hall_wizengamot
19 0,01762 miss_granger padma_patil hermione_voice professor_sinistra hero_hermione year_witch penelope_clearwater mysterious_wizard chaos_legion professor_vector hermione_turned amelia_bones endless_stair people_ve harry_friend beneath_half ravenclaw_girl common_sense leather_folder
The armies show up *very* distinctively as topic number 3. An interesting topic is number 12, which looks like it might involve Harry’s and Dumbledore’s debates about death and mortality, given the presence of 2-grams like “resurrection_stone, harry_headmaster, albus_dumbledore, wizard_face, wizard_quietly, death_eater, save_lives, dumbledore_voice, pretending_wise” (if some of these seem confusing, remember that Mallet ignores very common words by default, so e.g. pretending_wise was probably “pretending to be wise” in the raw text).
Still, it seems like 20 topics with 1- and 2-grams is best. Let’s generate that kind of a classification again, and this time also have the classifier tell us what percentage of each chapter is made up by a given topic.
Here are the topics:
0 0,03474 moody eye monroe mad_eye mad voldemort amelia prophecy bones david amelia_bones albus david_monroe minerva eye_moody alastor line azkaban voldie
1 0,02881 draco father harry_potter blood dr draco_voice magic test muggles powerful paper wizards draco_harry fading scientist spells harry_draco magic_fading scientists
2 0,02842 miss_granger miss hermione hero heroes granger hermione_granger elder_wand elder humming sinistra hermione_voice cell mysterious_wizard professor_sinistra fingernails vow sparkling professor_vector
3 0,02272 hat sorting neville sorting_hat goyle note ha points slytherins remembrall game mr_goyle paper ha_ha comed ernie comed_tea defence rock
4 0,05541 quirrell professor_quirrell professor mr_potter mr lose quirrell_voice goyle mr_goyle lesson quirrell_harry quirrell_face potter_professor secrets monster quirrell_nodded quirrell_points derrick quirrell_looked
5 0,02989 father dad mum books ron verres evans science petunia parents platform verres_evans michael trunk scarf letter train son owl
6 0,03465 malfoy lucius lucius_malfoy lord_malfoy wizengamot debt son house_malfoy house_potter false longbottom colored podium thousand_galleons plum_colored plum false_memory law owed
7 0,03928 daphne susan tracey hannah snape lavender bullies bully professor_snape draco_malfoy greengrass millicent bones sprout parvati corridor susan_bones girl davis
8 0,03814 hagrid troll forest unicorn tracey mr_hagrid centaur yeh tick filch weasley broomstick rubeus forbidden_forest huge forbidden twins unicorns argus
9 0,03042 draco neville soldiers general sunshine chaos army dragon granger zabini battle malfoy armies doom_doom doom dragons forest dragon_army shield
10 0,03238 voldemort lord lord_voldemort mirror dark_lord stone iss altar tom horcrux riddle parseltongue hissed wand perenelle tom_riddle dark body gun
11 0,04766 fred george neville fred_george lesath skeeter weasley rita severus twins rita_skeeter lestrange weasley_twins lesath_lestrange gryffindors handsome legilimens flume occlumency
12 0,05025 padma girls girl patil pettigrew padma_patil table responsible rival astorga pansy granger rumor ravenclaw_table heroine rat morning madam_pomfrey year_witch
13 0,03567 bellatrix snake dementors azkaban amelia patronus broomstick professor_quirrell bahry metal quirrell auror charm woman iss hissed corridor aurors bellatrix_black
14 0,0177 phoenix fawkes war blaise aftermath millicent envelope azkaban moody black_mist mist zabini haukelid wizard_voice back_sleep tower gregory million violence
15 0,04214 dementor patronus lupin headmaster remus patronus_charm mr_lupin james lily cast_patronus godric cage corporeal patronuses fear death happy anthony chocolate
16 0,03449 severus minerva albus potions_master potions master snape severus_snape time_turner turner headmistress floo azkaban professor_snape discipline severus_voice points_ravenclaw escape headmaster_office
17 0,04336 mcgonagall professor_mcgonagall galleons gold alley shop bag mr_potter pouch diagon_alley coins diagon wizarding_world malkin witch vault wizarding street kit
18 1,37162 harry professor potter hermione voice time back dumbledore quirrell professor_quirrell mr don thought boy dark hogwarts eyes face lord
19 0,02203 transfiguration transfigure eraser atoms minerva page ball harry_hermione separate sentient hermione_voice library subject diamond collection snakes pig free_transfiguration research
To make things easier, I’m going to give each of those topics a more descriptive name. I went with these:
0: Mad-Eye Moody & David Monroe
1: Harry & Draco doing science together
3: Sorting Hat & Mr. Goyle
4: Professor Quirrell
5: Harry’s parents
6: Lucius Malfoy & Harry’s debt
8: Hagrid & the Forest
9: Quirrel Armies
10: Lord Voldemort
11: Fred & George
12: Padma Patil and stuff
13: Azkaban Arc
15: Dementors & Patronouses
16: Albus, Minerva, and Snape
17: Diagon Alley & Money
18: Generic (this topic makes up by far the largest proportion of the story: it has a weight of 1,37 whereas none of the others reach even 0,06. You could call it the “whatever doesn’t fit into one of the other topics” topic)
That’s not too bad of a list of topics in HPMOR, though the proportion of the “generic” topic is kinda annoying. Here are some of the topic classifications the model gives us (only the largest percentages shown):
Chapter 1, A Day of Very Low Probability: 57,8% Harry’s Parents, 42,1% Generic
Chapter 2, Everything I Believe Is False: 47,5% Generic, 26,2% Diagon Alley & Money, 25,2% Harry’s Parents
Chapter 3, Comparing Reality To Its Alternatives: 49,6% Generic, 39,6% Diagon Alley & Money, 7% Harry’s Parents
Chapter 4, The Efficient Market Hypothesis: 59,7% Diagon Alley & Money, 40% Generic
Chapter 5, The Fundamental Attribution Error: 49,8% Diagon Alley & Money, 49,4% Generic
Chapter 6, The Planning Fallacy: 50,5% Generic, 47,9% Diagon Alley & Money
These topic classifications initially go roughly as one might expect, though the topic we termed “Diagon Alley & Money” shows up as early as in Chapter 2, and they only got to the Alley in Chapter 3.
Chapter 7, Reciprocation: 46,1% Generic, 42,0% Harry’s Parents, 10,2% Harry & Draco doing science together
After that it stays strong until Chapter 7 where it disappears entirely as the story moves away from the Alley to the King’s Cross Station, Harry’s parents say him goodbye, and Harry runs into Draco among others.
Chapter 8, Positive Bias: 52,4% Generic, 38,5% Harry’s Parents, 4,4% Sorting Hat & Mr Goyle (1.3432768379668802E-5 Hermione)
But then there’s Chapter 8, where Harry and Hermione have an extended discussion: besides Generic, this is classified as mostly being about Harry’s Parents (???), and a little bit about the weirdball “Sorting Hat & Mr. Goyle”; the topic we had named “Hermione” comes at a very low fraction.
Chapter 9, Title Redacted, Part I: 50,3% Generic, 41,6% Sorting Hat & Mr. Goyle, 8,02% Fred & George
Chapter 9 is where people are sorted (and Fred & George make a minor appearance). It’s interesting to notice that chapter 8 had a bit of Sorting Hat content, even though nothing about the sorting was mentioned: we also previously saw that the Diagon Alley classification showed up even before they went to Diagon alley.
But now I need to leave work, so no time to do more analysis at this point. If anyone wants to do more analysis, the full results are here: http://pastebin.com/bGip7X4D