Similar code to create groups #7

inezpereira · 2020-05-19T11:59:03Z

Hi! This is more of a question/request.
Do you have code to mind-match participants into groups?
Meaning, your current output csv would be something like:

user_id,match_ids
1,93;217;463;645;783;1101
[...]
93, 1;217;463;645;783;1101 # has the same ids as before

titipata · 2020-05-19T12:21:00Z

Yes, I do have it! I will also put on the code. Mainly, we apply hierarchical clustering on top of distance matrix calculated from mind-matching part!

inezpereira · 2020-05-19T13:07:49Z

Perfect! :D
Oh, I am really looking forward to checking it out, then!

titipata · 2020-05-19T13:40:11Z

@inezpereira I put the group matching code here. This is the version that I use to rearrange users into order (by distance) and group them. It still needs some tweaks and a lot of documentation tho!

inezpereira · 2020-05-20T06:24:29Z

Very cool, just got a chance to try it out.
A few questions:

cluster is an array with the corresponding cluster number for subject at index i, right?
Is the t argument under hcluster.fcluster() what I then need to tweak to get groups of size n? How can I do this in a principled way? Right now, with the current value for t=0.01, I get a cluster for each subject (on my data), which according to the documentation means t is too small.
In users_group_df the first column is called user_id, but actually it's cluster_id, right?

titipata · 2020-05-20T10:02:27Z

Yes, so I use optimal ordering (here) so that the order of the final dataframe should be good here. Basically, I just need to do group by 5-6 users for the final use (or hand to someone to tweak it a bit).
Yes, t=0.01 is the threshold for clustering here. However, it tends to cluster everything into one big group, have to be careful when we use that. I tend to set it up low first and tune it up a little to see what happens.
users_group_df is basically an ordered of users_df arranged in ordered which you can assign the group later based on this ordered output. user_id is still ID of all the user here.

inezpereira · 2020-05-20T11:55:20Z

@titipata ok, so are you saying that I can just take consecutive chunks of the ordered users_group_df and define those as groups?
For example, if I want groups of 6:
users_group_df.iloc[0:6,] # group 1
users_group_df.iloc[6:12,:] # group 2
etc.
Do I understand you correctly?
Also: I thought that, with lower values of t, each individual would be a cluster, so you wouldn't have one big cluster but as many clusters as you have subjects. At least that is what the documentation indicates. So I thought that you were doing the following: (1) get the optimal ordering using linkage() and then, (2) when using fcluster(), just set t to a very small value to not group people by clusters but instead give us back an ordered list of the user_id's. Did I get this at least partially right?

Thanks again for sharing the code and for being so responsive in dealing with issues and questions!

titipata · 2020-05-20T12:20:59Z

Yes for both answers! I'll elaborate more clearly quite soon!

inezpereira · 2020-05-20T14:20:29Z

Ok cool! I'll add a third question to the lot:
3. Right now, my dataset does not include abstracts per se, just the fields of interest in the abstracts column of the input csv file. And a lot of subjects have exactly the same word or pair of words (meaning the strings are exactly the same) in the abstracts column.
However, when I look at the users_group_df, not all subjects saying they are interested in the exact same field get grouped together. I see you add some randomness to the process. However, removing A_rand from the linkage() call (and thus only looking at A_cluster, instead of A_cluster + A_rand) doesn't solve the issue.

Again, let me illustrate. Without randomness, I would expect users_group_df to look like:
251, John Doe, field A
1, Jane J. Doe, field A
56, James Doe, field A
72, Richard Anonymous, field B
4, Joan Doe, field B
3, Et Al, field C

Where the subjects are neatly grouped by field of interest.

Instead, I get something like:
1, Jane J. Doe, field A
56, James Doe, field A
72, Richard Anonymous, field B
3, Et Al, field C
4, Joan Doe, field B
251, John Doe, field A

Do you know why this persists even after removing randomness (A_rand) from the process?

titipata · 2020-05-20T15:47:01Z

@inezpereira Maybe the random distance is too small? You might have to try adding random +1,-1 matrix instead depending on your matrix. Since in your case since you have a count of terms overlap. In the example, it's a cosine distance, and adding small +- 0.01 randomness can tweak the result a bit.

I haven't tried experimenting with adding multiple randomnesses but will try to follow up on this later!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Similar code to create groups #7

Similar code to create groups #7

inezpereira commented May 19, 2020

titipata commented May 19, 2020

inezpereira commented May 19, 2020

titipata commented May 19, 2020

inezpereira commented May 20, 2020 •

edited

Loading

titipata commented May 20, 2020

inezpereira commented May 20, 2020 •

edited

Loading

titipata commented May 20, 2020

inezpereira commented May 20, 2020 •

edited

Loading

titipata commented May 20, 2020

Similar code to create groups #7

Similar code to create groups #7

Comments

inezpereira commented May 19, 2020

titipata commented May 19, 2020

inezpereira commented May 19, 2020

titipata commented May 19, 2020

inezpereira commented May 20, 2020 • edited Loading

titipata commented May 20, 2020

inezpereira commented May 20, 2020 • edited Loading

titipata commented May 20, 2020

inezpereira commented May 20, 2020 • edited Loading

titipata commented May 20, 2020

inezpereira commented May 20, 2020 •

edited

Loading

inezpereira commented May 20, 2020 •

edited

Loading

inezpereira commented May 20, 2020 •

edited

Loading